# 03_QUEST_NEW_DATA

Great, now we have a working recreation of figure3 using real data. How about let's
find some new data sets to display.

## 1. Objective and goal

Here, using the same data repositories and exclusion criteria as in the original paper,
go find a novel data set that was not included in the training or testing 
of the SCimilarity model.

### Search Specification
We will use the CZ CELLxGENE Discover API to programmatically find a dataset that meets the following strict criteria:

**Inclusion/Exclusion Criteria:**
*   **Organism:** Strictly Homo sapiens (Human).
*   **Assay:** 10x 3' v3 (or similar high-throughput scRNA-seq).
*   **Disease:** Duchenne muscular dystrophy.
*   **Tissue:** Muscle.
*   **Cell Count:** Minimum 2,000 cells to ensure a robust UMAP visualization.

### Processing Specification
Once a valid `.h5ad` dataset is identified and downloaded, the agent must:
1. Load the dataset using `scanpy`.
2. Ensure `.var.index` contains HGNC gene symbols and make them unique.
3. Subsample the data if necessary to maintain performance (e.g., 5000 cells max).
4. Use `scimilarity.utils.align_dataset` to match the model's expected gene space.
5. Generate SCimilarity predictions and embeddings.
6. Map the original dataset's author cell types (`author_cell_type` or `cell_type` from CELLxGENE) to the predictions using the 1:1 majority voting method developed in Quest 2.
7. Export the standard visualization JSONs (`umap_data.json`, `metadata.json`, `concordance_matrix.json`).

### Implement as a Skill
After successfully executing the search and processing script, the logic used to query the CELLxGENE API based on the criteria above should be abstracted into a reusable agent "skill."

## N. Web application

{{Update the web application to be able to display a data set of your choosing instead
of overwriting the files in web/data/}}
copyright: © 2026 Sonia Timberlake & Ryan Bellmore
license: Proprietary - Authorized Workshop Participants Only
distribution_allowed: false