# 03_QUEST_NEW_DATA SPEC

## Instructions:
You will EDIT this SPEC so your agent will build upon your work in Quest 02. You will process a completely novel dataset, upgrade your web application to handle dynamic dataset selection, and abstract your processing script into a reusable command-line tool (a "skill").

In this document, you can overwrite anything in {{}} to get Gemini to work with you, though based on your previous answers, this SPEC is mostly complete.

---

## 1. GOALs and Background 

**Objective:** Process a novel dataset (not seen in the SCimilarity training corpus) to test the model's zero-shot capabilities. Upgrade the web application to dynamically load and visualize multiple datasets. Finally, refactor the data processing pipeline into a reusable CLI tool.

### Deliverables:
1. A refactored Python CLI tool (`src/run_scimilarity_pipeline.py`) that can take *any* `.h5ad` file and output web-ready JSON payloads.
2. An upgraded `web/index.html` and `web/visualization.js` that supports dynamic dataset switching via a top-level dropdown.
3. Successful processing and visualization of the GSE282570 (Small Intestine Celiac Disease) dataset alongside the original Kidney dataset.

---

## 2. IMPLEMENTATION

### Data Processing & Pipeline Abstraction (The "Skill")
We need to transition from a hardcoded script to a flexible tool.
- **CLI Tool:** Create `src/run_scimilarity_pipeline.py`. It must accept command-line arguments:
  - `--input`: Path to the input `.h5ad` file.
  - `--out-dir`: The specific output directory for the generated JSON files (e.g., `web/data/celiac`).
  - `--dataset-name`: A display name for the dataset.
- **UMAP Generation:** The novel dataset lacks pre-calculated UMAPs. The script must use the SCimilarity model's internal representations (or simulated equivalents for the workshop) to calculate UMAP coordinates for the cells.
- **Rigorous Ontology Mapping:** We will continue to use the rigorous "Standardized Ontology Mapping" approach. The script must attempt to map the novel dataset's author labels and SCimilarity predictions to Cell Ontology (CL) IDs.

### Web Architecture Upgrades
To support multiple datasets, the data directory structure must be updated:

```text
web/data/
├── datasets.json           # Master index: [{"id": "kidney", "name": "KPMP Kidney"}, {"id": "celiac", "name": "Celiac Intestine"}]
├── kidney/                 # Output from Quest 02
│   ├── umap_data.json
│   ├── metadata.json
│   └── concordance_matrix.json
└── celiac/                 # Output from Quest 03
    ├── umap_data.json
    ├── metadata.json
    └── concordance_matrix.json
```

#### UI/UX Changes
- Add a new "Select Dataset" dropdown element at the very top of the page (above the main title or right below it).
- The dropdown should be populated dynamically by fetching `web/data/datasets.json` when the page loads.
- When the user changes the dataset, the `visualization.js` script must clear the current plots, fetch the JSON files from the appropriate subdirectory, and re-render everything using the new data.

---

## 3. INPUTS

**Data Sources:**
- **SCimilarity Model:** Local path `/data/models/model_v1.1`
- **Novel Dataset:** GSE282570 (Small Intestine Celiac Disease)
  - Location: `data/GSE282570.h5ad`

Use the existing project structure and virtual environment.

---

## 4. OUTPUTS

- A master `web/data/datasets.json` file.
- Populated `web/data/celiac/` and `web/data/kidney/` directories.
- A fully functional, multi-dataset web application running on port 8000.

---

## 5. TESTS

### VALIDATION Tests (Accuracy & Science)
- *Note: Because this is a novel dataset evaluating zero-shot capabilities on unseen disease biology (Celiac), we are explicitly bypassing strict concordance validation thresholds. We expect the model to have higher error rates here compared to the Kidney dataset.*
- No automated validation tests are required for the Celiac dataset output.

### VERIFICATION (Correct execution)
- Check that the `src/run_scimilarity_pipeline.py` CLI tool executes without throwing Python errors.
- Check that the web server serves the application and that switching datasets via the new dropdown successfully re-renders the UMAPs and Heatmap without JavaScript console errors.

--- 
copyright: © 2026 Sonia Timberlake & Ryan Bellmore
license: Proprietary - Authorized Workshop Participants Only
distribution_allowed: false
