# 03_QUEST_NEW_DATA

Great, now we have a working recreation of figure3 using real data. How about let's
find some new data sets to display.

## 1. Objective and goal

Here, using the same data repositories and exclusion criteria as in the original paper,
go find a novel data set that was not included in the training or testing 
of the SCimilarity model.

*   **Data Acquisition:** Write a script using the cellxgene_census API to identify and download a human heart single-cell dataset that was not part of the SCimilarity training set. Save this as data/heart_data.h5ad.
*   **Processing Pipeline:** Create a script (e.g., src/process_heart_data.py) to process the data:
    *   Load the heart .h5ad file.
    *   Utilize the SCimilarity foundation model to generate cell type predictions.
    *   Perform label harmonization to align predicted types with the author's original annotations.
    *   Export the resulting UMAP coordinates and metadata to web/data/heart_data.json.

**3. Creating a Reusable Skill:**
Encapsulate the data acquisition and processing steps into a reusable AI "Skill". The skill should allow an agent to accept a generic tissue name (e.g., 'lung', 'heart'), automatically query the CELLxGENE API for a valid dataset, process it through the SCimilarity model, and output the required JSON format for visualization.

An AI Skill is a specialized, bundled package of instructions, workflows, and scripts that extends an AI agent's capabilities for a specific domain. You use it to give the agent procedural knowledge to execute complex, multi-step tasks repeatedly and reliably without needing to explain the process every time. 

## 2. Web application

*   **UI Update:** Add a dataset selector (dropdown) to web/index.html allowing users to choose between "Kidney" and "Heart".
*   **Dynamic Loading:** Update web/visualization.js to fetch the appropriate JSON file (web/data/kretzler_kidney.json or web/data/heart_data.json) based on the user's selection and update the Plotly visualizations (UMAP and Concordance Heatmap) without a full page reload.
*   **Consistency:** Ensure the "Starry Night" aesthetic is preserved across both datasets.

copyright: © 2026 Sonia Timberlake & Ryan Bellmore
license: Proprietary - Authorized Workshop Participants Only
distribution_allowed: false
