# 02_QUEST_FIG3 SPEC

## Instructions:
You will EDIT this SPEC so your agent will analyze real single-cell data and create 3+ figures (replacing the mock data and placeholder figures generated in 01_QUEST_START_HERE).  
You should also write a series of tests to:
- verify the code works
- validate the code's results reflect the underlying biology

In this document, over write anything in {{}} to get Gemini to work with you. 

(We have filled in other sections to save you time and focus your attention on the more valuable educational aspects.)

This is a suggested SPEC structure, but feel free to add or change content.


---

## 1. GOALs and Background 

**Objective:** Start a single cell analysis with SCimilarity foundation model by creating cell *embeddings* and *labels* from new scRNA data. **Reproduce published results: Scimilarity paper Figure 3** comparing cell *labels* from SCimilarity predictions against held-out, author-annotated results to assess model processing and output. 

### Figure 3 description: 
- **3b (Author Annotations):** {{Plots on a graph represent cells types.  They are color coded based on the type.  The x and y axis are dimensionality reductions on neural network embeddings.  The cell types have been generated by an expert.}}
- **3c (SCimilarity Predictions):** {{The same as 3b except the cell type labels are generated from the neural network.}}
- **3d (Concordance Heatmap):** {{This is a heat map to show matches between the author annotations and the predicted annotations.  The author annotations are along the x axis and the predicted annotations are along the y axis.  The expectation is that the values match and are plotted along the center diagonal.  Each cell in the heat map represents the percentage of annotatios that agree. The rows and columns must be sorted biologically (e.g., via hierarchical clustering) rather than alphabetically so related cell types are grouped together.}}


### Deliverables:
1. Python scripts that extract author annotations and calculate Scimilarity predictions from a real data set
2. Tests that verify and validate the output data

## 2. IMPLEMENTATION

### BIOLOGY
#### **Cell label terminology** might need standardizing
   - Need a way to compare Scimilarity model output, Fig 3 annotations, and author annotations, in a harmonized way based on the paper.
   - Gemini should suggest 4 ways to reimplement cell label terminology and then have the user pick
   - {{Describe your strategy for harmonizing cell type labels. Use one of these, or something similar: an ontology lookup, simple string matching.}}


### Tech stack and project structure.

#### Languages
- Use python for most processing {{}}

#### Architecture
- Use the existing project structure as outlined in 01_QUEST_START_HERE.md.

#### Interactive plot infrastracture
1. **Local File Serving:** 
   - *Gotcha:* Standard Python `http.server` often fails to resolve symlinked data directories outside its root, leading to 404s.
   - *Fix:* Physically copy the JSON files: `mkdir -p web/data && cp results/data_subsample/* web/data/` before serving.
2. **Plotly Performance:**
   - *Gotcha:* 10,000 points using standard `scatter` SVG rendering is slow.
   - *Fix:* Use `type: 'scattergl'` for WebGL hardware acceleration.
3. **Interactive Highlighting:**
   - *Gotcha:* Updating the data arrays for every dropdown change is too slow.
   - *Fix:* Group cells into separate Plotly traces by cell type. To highlight, change `marker.opacity` to `0.05` for non-matching traces and keep it at full opacity for the matching trace.
4. **Heatmap Sorting:**
   - *Gotcha:* Sorting the concordance matrix alphabetically scatters misclassifications randomly.
   - *Fix:* Use hierarchical clustering (e.g., `scipy.cluster.hierarchy`) on the concordance profiles to sort the rows and columns so biologically similar cell lineages are grouped together.
5. Do not invent new file formats for biological data. Use standard formats where possible. 



## 3. INPUTS

**Data Sources:**
- **SCimilarity Model:** Local path `/data/models/model_v1.1` (Symlinked to `models/model_v1.1` in the workspace).
- **Kidney Dataset:** Kretzler/KPMP kidney dataset is used in Figure 3.
  - Source: `gs://rb-wkshp-bioit26-data/data/kretzler_kidney.h5ad`
  - Destination: `data/kretzler_kidney.h5ad`

Use the existing project structure as outlined in 01_QUEST_START_HERE.md.

Use the existing python virtualenv installed in this directory.

## 4. OUTPUTS

Use the json data in web/data/ to understand the structure of the output data. Document that here.

### TABLEs
- Write processing results to files (csv or appropriate format) for additional testing/inspection.
- {{ Produce a summary table based on figure 3D that contains the statistics for average and median agreement between author annotations and predictive annotations.}}

### Figure 3 test and interpretation: 
{{Write an accompanying markdown file that has an analysis of all the figures generated.}}


## 5. TESTS

### VALIDATION Tests (Accuracy & Science)

- {{At least 2/3rds of cells along the diagonal matching between author and predictive annotations have at least a 2/3rds percentage match.}}

- {{Verify that the UMAP coordinates generated are mathematically valid}}

### VERIFICATION (Correct execution)

- {{Write a small text file summarizing the input data, including structure, dimensions and how author data can be used for our task.}}

#### Unit tests (working, reproducible code)

Add a suite of unittests in tests/unit/ that run with the standard
library unittest package. These tests should aim for >75% test coverage
of the python files in src/.

- All APIs should be mocked
- Create a set of mock data from a file in data/ that is structurally identical but much smaller that the real data to aid with testing
- Write a testing README with an overview of the testing strategy and coverage
- Check that the tests pass
- Write a series of test that ensures that the server in web/ can be run without errors, including javascript errors.




--- 
copyright: © 2026 Sonia Timberlake & Ryan Bellmore
license: Proprietary - Authorized Workshop Participants Only
distribution_allowed: false
