# 02_QUEST_FIG3 SPEC

## Instructions:
You will EDIT this SPEC so your agent will analyze real single-cell data and create 3+ figures (replacing the mock data and placeholder figures generated in 01_QUEST_START_HERE).  
You should also write a series of tests to:
- verify the code works
- validate the code's results reflect the underlying biology

In this document, over write anything in {{}} to get Gemini to work with you. 

(We have filled in other sections to save you time and focus your attention on the more valuable educational aspects.)

This is a suggested SPEC structure, but feel free to add or change content.


---

## 1. GOALs and Background 

**Objective:** Start a single cell analysis with SCimilarity foundation model by creating cell *embeddings* and *labels* from new scRNA data. **Reproduce published results: Scimilarity paper Figure 3** comparing cell *labels* from SCimilarity predictions against held-out, author-annotated results to assess model processing and output. 

### Figure 3 description: 
- **3b (Author Annotations):** UMAP 2D scatter plot showing author cell type annotations from figure 3b in the Heimberg et. al paper.
- **3c (SCimilarity Predictions):** UMAP 2D scatter plot showing cell types determined by the foundational model SCimilarity.
- **3d (Concordance Heatmap):** Concordance heatmap comparing the author vs the SCimilarity predictions. Higher concordance is a darker color. Author annotations are in columns and SCimilarity predictions are in rows. Normalize percentages such that each column (author annotation) sums to 100%. Display non-zero percentages in each cell using a contrasting color.

### Visual Style:
- **Color Palette:** Use the palette inspired by Salvador Dalí's "The Hallucinogenic Toreador" (Gold: #D4AF37, Brown: #8B4513, Platinum: #E5E4E2, Slate: #2F4F4F, Red: #CD5C5C, Blue: #4682B4). Apply these colors consistently across UMAP markers, heatmap scales, and the web UI.


### Deliverables:
1. Python scripts that extract author annotations and calculate Scimilarity predictions from a real data set
2. Tests that verify and validate the output data

## 2. IMPLEMENTATION

### BIOLOGY
#### **Cell label terminology** might need standardizing
   - Need a way to compare Scimilarity model output, Fig 3 annotations, and author annotations, in a harmonized way based on the paper.
   - **Methodology chosen: Canonical Ontology Mapping via LLM.**
   - Use an LLM to map both Author annotations and SCimilarity model outputs to standardized names in the Cell Ontology (CL). 
   - **Fallback:** If a SCimilarity output or Author annotation cannot be mapped to the Cell Ontology, retain and use the original output label.
   - This approach provides scientific rigor while ensuring no data is lost due to mapping failures.


### Tech stack and project structure.

#### Languages
- Use python for most processing {{ Edit if author prefers R}}

#### Architecture
- Use the existing project structure as outlined in 01_QUEST_START_HERE.md.

#### Interactive plot infrastracture
1. **Local File Serving:** 
   - *Gotcha:* Standard Python `http.server` often fails to resolve symlinked data directories outside its root, leading to 404s.
   - *Fix:* Physically copy the JSON files: `mkdir -p web/data && cp results/data_subsample/* web/data/` before serving.
2. **Plotly Performance:**
   - *Gotcha:* 10,000 points using standard `scatter` SVG rendering is slow.
   - *Fix:* Use `type: 'scattergl'` for WebGL hardware acceleration.
3. **Interactive Highlighting:**
   - *Gotcha:* Updating the data arrays for every dropdown change is too slow.
   - *Fix:* Group cells into separate Plotly traces by cell type. To highlight, change `marker.opacity` to `0.05` for non-matching traces and keep it at full opacity for the matching trace.
4. Do not invent new file formats for biological data. Use standard formats where possible. 


## 3. INPUTS

**Data Sources:**
- **SCimilarity Model:** Local path `/data/models/model_v1.1` (Symlinked to `models/model_v1.1` in the workspace).
- **Kidney Dataset:** Kretzler/KPMP kidney dataset is used in Figure 3.
  - Source: `gs://rb-wkshp-bioit26-data/data/kretzler_kidney.h5ad`
  - Destination: `data/kretzler_kidney.h5ad`

Use the existing project structure as outlined in 01_QUEST_START_HERE.md.

Use the existing python virtualenv installed in this directory.

## 4. OUTPUTS

Use the json data in web/data/ to understand the structure of the output data. Document that here.

### TABLEs
- Create a data table (`cell_type_counts.csv`) containing the counts of each cell type.
- Columns: `cell_type`, `author_count`, `scimilarity_count`.
- Rows: One row for each harmonized cell type.
- **Web UI Update:** Display this table directly in the web application (`index.html`), positioned below the LLM interpretation summary.

### Figure 3 test and interpretation: 
- Use an LLM to generate a scientific summary comparing the Author and SCimilarity visualizations.
- Focus specifically on identifying and explaining any large clusters of cells in the UMAP plot where the Author and SCimilarity annotations differ significantly.
- **Web UI Update:** Display this LLM-generated summary directly in the web application (`index.html`), positioned directly below the concordance heatmap.

## 5. TESTS

### VALIDATION Tests (Accuracy & Science)

- **Kidney Tissue Verification:** Verify that all cell types identified by SCimilarity (after ontology mapping) are biologically plausible cell types found within human kidney tissue.
- **Concordance Check:** Calculate the percentage of cells where Author and SCimilarity labels match (global concordance) and ensure it meets a reasonable threshold for this dataset.

- Automated check: Cross-reference identified cell types against a list of known kidney cell types (e.g., from the KPMP or Human Protein Atlas).

### VERIFICATION (Correct execution)

- Write a small text file summarizing the input data, including structure, dimensions and how author data can be used for our task. 

#### Unit tests (working, reproducible code)

Add a suite of unittests in tests/unit/ that run with the standard
library unittest package. These tests should aim for >75% test coverage
of the python files in src/.

- All APIs should be mocked
- Create a set of mock data from a file in data/ that is structurally identical but much smaller that the real data to aid with testing
- Write a testing README with an overview of the testing strategy and coverage
- Check that the tests pass
- Write a series of test that ensures that the server in web/ can be run without errors, including javascript errors.


--- 
copyright: © 2026 Sonia Timberlake & Ryan Bellmore
license: Proprietary - Authorized Workshop Participants Only
distribution_allowed: false