# 02_QUEST_FIG3 SPEC

## Instructions:
You will EDIT this SPEC so your agent will analyze real single-cell data and create 3+ figures (replacing the mock data and placeholder figures generated in 01_QUEST_START_HERE).  
You should also write a series of tests to:
- verify the code works
- validate the code's results reflect the underlying biology

In this document, over write anything in {{}} to get Gemini to work with you. 

(We have filled in other sections to save you time and focus your attention on the more valuable educational aspects.)

This is a suggested SPEC structure, but feel free to add or change content.

---

## 1. GOALs and Background 

**Objective:** Start a single cell analysis with SCimilarity foundation model by creating cell *embeddings* and *labels* from new scRNA data. **Reproduce published results: Scimilarity paper Figure 3** comparing cell *labels* from SCimilarity predictions against held-out, author-annotated results to assess model processing and output. 

### Figure 3 description: 
- **3b (Author Annotations):** UMAP embedding of cell profiles from SCimilarity's latent representation of a held-out kidney dataset25 clolored by author-provided cell type annotation.
- **3c (SCimilarity Predictions):** UMAP embedding of cell profiles from SCimilarity's latent representation of a held-out kidney dataset25 clolored by SCimilarity predicted cell type annotation
- **3d (Concordance Heatmap):** Concordance heatmap between the two annotations colored by %concordance

### Deliverables:
1. Python scripts that extract author annotations and calculate Scimilarity predictions from a real data set
2. Tests that verify and validate the output data

## 2. IMPLEMENTATION

### BIOLOGY
#### **Cell label terminology** might need standardizing
   - Need a way to compare Scimilarity model output, Fig 3 annotations, and author annotations, in a harmonized way based on the paper.
   - The user has chosen **Option 2: Integrating a Formal Ontology**. We will use standard ontology terms to map the author annotations (e.g., "PT-S1/S2") and SCimilarity predicted labels to a common naming convention for fair comparison.

### Tech stack and project structure.

#### Languages
- Use python for most processing

#### Architecture
- Use the existing project structure as outlined in 01_QUEST_START_HERE.md.

#### Interactive plot infrastracture
1. **Local File Serving:** 
   - *Gotcha:* Standard Python `http.server` often fails to resolve symlinked data directories outside its root, leading to 404s.
   - *Fix:* Physically copy the JSON files: `mkdir -p web/data && cp results/data_subsample/* web/data/` before serving.
2. **Plotly Performance:**
   - *Gotcha:* 10,000 points using standard `scatter` SVG rendering is slow.
   - *Fix:* Use `type: 'scattergl'` for WebGL hardware acceleration.
3. **Interactive Highlighting:**
   - *Gotcha:* Updating the data arrays for every dropdown change is too slow.
   - *Fix:* Group cells into separate Plotly traces by cell type. To highlight, change `marker.opacity` to `0.05` for non-matching traces and keep it at full opacity for the matching trace.
4. Do not invent new file formats for biological data. Use standard formats where possible. 

#### Data Processing & Pipeline Architecture
1. **Biological Normalization:**
   - *Gotcha:* SCimilarity requires normalized gene expression data to generate accurate embeddings. Passing raw counts causes mathematically invalid distances.
   - *Fix:* Apply `sc.pp.normalize_total(adata, target_sum=1e4)` and `sc.pp.log1p(adata)` before passing the matrix to `CellEmbedding`.
2. **Latent Space Consistency (Figures 3b vs 3c):**
   - *Gotcha:* If you run standard `sc.pp.pca` -> `sc.tl.umap` for the author data, and `X_scimilarity` -> `sc.tl.umap` for the predictions, the plots will look completely different and cannot be compared visually.
   - *Fix:* To match the paper, generate a **single** UMAP representation computed entirely on the `X_scimilarity` latent embedding (`sc.pp.neighbors(adata, use_rep='X_scimilarity')`). Then color that exact same layout by author labels (Fig 3b) and predicted labels (Fig 3c).
3. **Pandas Index Misalignment:**
   - *Gotcha:* When adding `CellAnnotation` predictions back to the `AnnData` object, Pandas will try to align the prediction Series index (integers) against the AnnData observation index (cell barcodes). This results in an entire column of `NaN`s.
   - *Fix:* Assign the raw array using `.values`: `adata.obs['scimilarity_pred'] = predictions.values`.

## 3. INPUTS

**Data Sources:**
- **SCimilarity Model:** Local path `/data/models/model_v1.1` (Symlinked to `models/model_v1.1` in the workspace).
- **Kidney Dataset:** Kretzler/KPMP kidney dataset is used in Figure 3.
  - Source: `gs://rb-wkshp-bioit26-data/data/kretzler_kidney.h5ad`
  - Destination: `data/kretzler_kidney.h5ad`

Use the existing project structure as outlined in 01_QUEST_START_HERE.md.

Use the existing python virtualenv installed in this directory.

## 4. OUTPUTS

Use the json data in web/data/ to understand the structure of the output data. Document that here.
- `umap_data.json`: Contains `x_author`, `y_author`, `x_scim`, `y_scim`, `cell_id`, `author_label_idx`, and `pred_label_idx`.
- `metadata.json`: Contains `cell_types` (the union of standard labels), `color_map`, and `total_cells`.
- `concordance_matrix.json`: Contains `values` (a 2D matrix of percentages), `author_labels`, and `pred_labels`.

### TABLEs
- Write processing results to files (`results/data_subsample/`)
- Generate a table that summarizes overall percentage of cell types
- Calculate the concordance of author annotated cell types with SCimilarity annotated cell types. There may be some mismatch between the cell type labels by the two methods due to abbreviation. 
  - **Resolution Method chosen by user: Integrating a Formal Ontology** (Mapping arbitrary author strings and SCimilarity outputs to standardized `cell_type` names found in the `h5ad` metadata like "kidney loop of Henle thick ascending limb epithelial cell").

### Figure 3 test and interpretation: 

Are the identified cell types make sense coming from a kidney sample?
Yes, the mapped cell types include proximal tubule cells, loop of Henle cells, collecting duct cells, etc., which are strictly specific to kidney anatomy.

## 5. TESTS

### VALIDATION Tests (Accuracy & Science)

- Inspect the figure 3b and 3c.  They should be 90% similar to each other. 
- Are most of the % cell type concordance >90%?

### VERIFICATION (Correct execution)

- Write a small text file summarizing the input data, including structure, dimensions and how author data can be used for our task. 
  - Will be output to `results/data_summary.txt`.

#### Unit tests (working, reproducible code)

Add a suite of unittests in tests/unit/ that run with the standard library unittest package. These tests should aim for >75% test coverage of the python files in src/.

- All APIs should be mocked
- Create a set of mock data from a file in data/ that is structurally identical but much smaller that the real data to aid with testing
- Write a testing README with an overview of the testing strategy and coverage
- Check that the tests pass
- Write a series of test that ensures that the server in web/ can be run without errors, including javascript errors.