# 02_QUEST_FIG3 SPEC

> **Status:** ✅ COMPLETED & IMPLEMENTED. 
> - Real dataset (`kretzler_kidney.h5ad`) processed.
> - Author labels mapped 1:1 strictly to the 22 labels published in Heimberg et al. Figure 3.
> - Global concordance verified at >84%.
> - Unit and biological validation tests passed.
> - Web UI updated with automated margins to prevent overlapping labels.

## Instructions:
You will EDIT this SPEC so your agent will analyze real single-cell data and create 3+ figures (replacing the mock data and placeholder figures generated in 01_QUEST_START_HERE).  
You should also write a series of tests to:
- verify the code works
- validate the code's results reflect the underlying biology

In this document, over write anything in {{}} to get Gemini to work with you. 

(We have filled in other sections to save you time and focus your attention on the more valuable educational aspects.)

This is a suggested SPEC structure, but feel free to add or change content.


---

## 1. GOALs and Background 

**Objective:** Start a single cell analysis with SCimilarity foundation model by creating cell *embeddings* and *labels* from new scRNA data. **Reproduce published results: Scimilarity paper Figure 3** comparing cell *labels* from SCimilarity predictions against held-out, author-annotated results to assess model processing and output. 

### Figure 3 description: 
- **3b (Author Annotations):** A UMAP embedding of cells from a held-out kidney scRNA-seq dataset, visualized using SCimilarity's latent representation. The points are colored according to the **original expert labels** provided by the study authors, showing distinct clusters for various kidney cell types.
- **3c (SCimilarity Predictions):** The same UMAP embedding as in 3b, but points are colored by **SCimilarity's automated cell type predictions**. The visual overlap illustrates how well the model generalizes to annotate new tissue data.
- **3d (Concordance Heatmap):** A confusion matrix quantifying the agreement between SCimilarity's predicted labels (rows) and the author's original annotations (columns). The high diagonal intensity demonstrates strong concordance.


### Deliverables:
1. Python scripts that extract author annotations and calculate Scimilarity predictions from a real data set
2. Tests that verify and validate the output data

## 2. IMPLEMENTATION

### BIOLOGY
#### **Cell label terminology** might need standardizing
   - Need a way to compare Scimilarity model output, Fig 3 annotations, and author annotations, in a harmonized way based on the paper.
   - **Methodology chosen:** A hybrid approach. We will primarily use **Strict Ontology Mapping** (e.g., using `pronto` or similar tools to map author labels to valid Cell Ontology terms output by the model). For any labels that fail this strict mapping, we will fall back to **Lexical Matching (Fuzzy)** using string similarity algorithms to find the closest match.


### Tech stack and project structure.

#### Languages
- Use python for most processing {{ Edit if author prefers R}}

#### Architecture
- Use the existing project structure as outlined in 01_QUEST_START_HERE.md.

#### Interactive plot infrastracture
1. **Local File Serving:** 
   - *Gotcha:* Standard Python `http.server` often fails to resolve symlinked data directories outside its root, leading to 404s.
   - *Fix:* Physically copy the JSON files: `mkdir -p web/data && cp results/data_subsample/* web/data/` before serving.
2. **Plotly Performance:**
   - *Gotcha:* 10,000 points using standard `scatter` SVG rendering is slow.
   - *Fix:* Use `type: 'scattergl'` for WebGL hardware acceleration.
3. **Interactive Highlighting:**
   - *Gotcha:* Updating the data arrays for every dropdown change is too slow.
   - *Fix:* Group cells into separate Plotly traces by cell type. To highlight, change `marker.opacity` to `0.05` for non-matching traces and keep it at full opacity for the matching trace.
4. Do not invent new file formats for biological data. Use standard formats where possible. 


## 3. INPUTS

**Data Sources:**
- **SCimilarity Model:** Local path `/data/models/model_v1.1` (Symlinked to `models/model_v1.1` in the workspace).
- **Kidney Dataset:** Kretzler/KPMP kidney dataset is used in Figure 3.
  - Source: `gs://rb-wkshp-bioit26-data/data/kretzler_kidney.h5ad`
  - Destination: `data/kretzler_kidney.h5ad`

Use the existing project structure as outlined in 01_QUEST_START_HERE.md.

Use the existing python virtualenv installed in this directory.

## 4. OUTPUTS

Use the json data in web/data/ to understand the structure of the output data. Document that here.

### TABLEs
- Write processing results to files (csv or appropriate format) for additional testing/inspection.
- **Additional Outputs:** 
  1. A CSV file documenting the label harmonizations (`mapping.csv`): Author Label -> Model Label -> Harmonized Label.
  2. A metrics table (`concordance_metrics.csv`) detailing the concordance score (accuracy) per cell type.

### Figure 3 test and interpretation: 
- **Automated Interpretation:** Gemini will provide a script (or function) that analyzes the generated concordance matrix to identify and print the "top 3 most confused" cell types (e.g., where the model frequently predicted label X instead of the author's label Y). This will prompt users to investigate the underlying biology behind these specific discrepancies.


## 5. TESTS

### VALIDATION Tests (Accuracy & Science)

- **Biological Accuracy:** Write a test to calculate the global concordance score across all cells. To confirm we have "built the right thing" and match the published biology, this test should assert that the overall concordance is >80% (approaching the ~86.5% reported in the paper).
- **Lineage Distinction:** Write an automated test to verify that major, biologically distinct lineages (e.g., Immune cells vs. Epithelial cells) have a confusion rate of <1%. If the model heavily confuses drastically different cell types, it indicates a fundamental mapping or processing error.

### VERIFICATION (Correct execution)

- Write a small text file summarizing the input data, including structure, dimensions and how author data can be used for our task. 

#### Unit tests (working, reproducible code)

Add a suite of unittests in tests/unit/ that run with the standard
library unittest package. These tests should aim for >75% test coverage
of the python files in src/.

- All APIs should be mocked
- Create a set of mock data from a file in data/ that is structurally identical but much smaller that the real data to aid with testing
- Write a testing README with an overview of the testing strategy and coverage
- Check that the tests pass
- Write a series of test that ensures that the server in web/ can be run without errors, including javascript errors.


--- 
copyright: © 2026 Sonia Timberlake & Ryan Bellmore
license: Proprietary - Authorized Workshop Participants Only
distribution_allowed: false