# 02_QUEST_FIG3 SPEC

## Instructions:
You will EDIT this SPEC so your agent will analyze real single-cell data and create 3+ figures (replacing the mock data and placeholder figures generated in 01_QUEST_START_HERE).  
You should also write a series of tests to:
- verify the code works
- validate the code's results reflect the underlying biology

In this document, over write anything in {{}} to get Gemini to work with you. 

(We have filled in other sections to save you time and focus your attention on the more valuable educational aspects.)

This is a suggested SPEC structure, but feel free to add or change content.


---

## 1. GOALs and Background 

**Objective:** Start a single cell analysis with SCimilarity foundation model by creating cell *embeddings* and *labels* from new scRNA data. **Reproduce published results: Scimilarity paper Figure 3** comparing cell *labels* from SCimilarity predictions against held-out, author-annotated results to assess model processing and output. 

### Figure 3 description: 
- **3b (Author Annotations):** Uniform manifold approximation and projection (UMAP) embedding of cell profiles (dots) from SCimilarity's latent representation of a held out kidney dataset25 colored by author-provided
- **3c (SCimilarity Predictions):** Perform the same UMAP embedding on the same kidney dataset25 but colored by SCimilarity-predicted
- **3d (Concordance Heatmap):** Provide a correlation of cell type annotations colored by author-provided cells with each SCimilarity annotation


### Deliverables:
1. Python scripts that extract author annotations and calculate SCimilarity predictions from a real data set
2. Tests that verify and validate the output data

## 2. IMPLEMENTATION

### BIOLOGY
#### **Cell label terminology** harmonization
   - We will use a hardcoded mapping dictionary to harmonize author annotations and SCimilarity predictions.
   - This dictionary will be generated from a CSV file (`data/label_mapping.csv`) containing three columns: `Label` (the display name for figures), `Author-label` (the original string in the dataset), and `SCimilarity-label` (the string output by the model).
   - The processing script will load this CSV and use it to normalize all cell labels before calculating the concordance matrix and generating UMAP visualizations.
   - **Visualization:** Show the correlation of author-annotated to cell type cluster and the SCimilarity to the same cell type expression cluster. Show this as a dot plot of label correlation with author-annotated in the Y axis and SCimilarity in the X axis.



### Tech stack and project structure.

#### Languages
- Use python for most processing 
## test to see difference if R is used

#### Architecture
- Use the existing project structure as outlined in 01_QUEST_START_HERE.md.

#### Interactive plot infrastracture
1. **Local File Serving:** 
   - *Gotcha:* Standard Python `http.server` often fails to resolve symlinked data directories outside its root, leading to 404s.
   - *Fix:* Physically copy the JSON files: `mkdir -p web/data && cp results/data_subsample/* web/data/` before serving.
2. **Plotly Performance:**
   - *Gotcha:* 10,000 points using standard `scatter` SVG rendering is slow.
   - *Fix:* Use `type: 'scattergl'` for WebGL hardware acceleration.
3. **Interactive Highlighting:**
   - *Gotcha:* Updating the data arrays for every dropdown change is too slow.
   - *Fix:* Group cells into separate Plotly traces by cell type. To highlight, change `marker.opacity` to `0.05` for non-matching traces and keep it at full opacity for the matching trace.
4. Do not invent new file formats for biological data. Use standard formats where possible. 



## 3. INPUTS

**Data Sources:**
- **SCimilarity Model:** Local path `/data/models/model_v1.1` (Symlinked to `models/model_v1.1` in the workspace).
- **Kidney Dataset:** Kretzler/KPMP kidney dataset is used in Figure 3.
  - Source: `gs://rb-wkshp-bioit26-data/data/kretzler_kidney.h5ad`
  - Destination: `data/kretzler_kidney.h5ad`

Use the existing project structure as outlined in 01_QUEST_START_HERE.md.

Use the existing python virtualenv installed in this directory.

## 4. OUTPUTS

### Data Structures (JSON)
The processing script will output the following JSON files to `web/data/`:

1.  **`umap_data.json`**:
    *   `x_author`, `y_author`: Array of floats (UMAP coordinates)
    *   `x_scim`, `y_scim`: Array of floats
    *   `author_label_idx`: Array of integers (mapping to the `cell_types` list in metadata)
    *   `pred_label_idx`: Array of integers
    *   `cell_id`: Array of strings
2.  **`metadata.json`**:
    *   `cell_types`: List of strings (unique labels)
    *   `color_map`: Dictionary of `{ label: hex_code }`
    *   `n_cells`: Total cell count
3.  **`concordance_matrix.json`**:
    *   `values`: 2D array of proportions (floats)
    *   `author_labels`, `pred_labels`: Lists of strings for the axes

### TABLEs
- Write processing results to files (csv or appropriate format) for additional testing/inspection.
- Create a table (in csv) of annotation correlation across all provided labels and add an additional column with the label that has the highest correlation

### Figure 3 test and interpretation: 
Identify which annotated cell types are better represented by author-provided labels and which were best annotated by SCimilarity.  Best is defined as highest correlation


## 5. TESTS

### VALIDATION Tests (Accuracy & Science)

- **Concordance Threshold:** Test that the average diagonal value of the concordance matrix (author-label vs. predicted-label) is greater than **85%**. This validates that the SCimilarity model successfully reproduces author-level annotations.
- **Proportion Table:** Provide a table that provides the proportion of annotated cells (annotated type vs. all cells).  Show this alongside each UMAP plot 3b and 3c described above.
- **Label Coverage:** Test that intermediate output provides a label for each cell. Identify that an embeddings file is created for each UMAP requested.

### VERIFICATION (Correct execution)

- Write a small text file summarizing the input data, including structure, dimensions and how author data can be used for our task. 

#### Unit tests (working, reproducible code)

Add a suite of unittests in tests/unit/ that run with the standard
library unittest package. These tests should aim for >75% test coverage
of the python files in src/.

- All APIs should be mocked
- Create a set of mock data from a file in data/ that is structurally identical but much smaller that the real data to aid with testing
- Write a testing README with an overview of the testing strategy and coverage
- Check that the tests pass
- Write a series of test that ensures that the server in web/ can be run without errors, including javascript errors.




--- 
copyright: © 2026 Sonia Timberlake & Ryan Bellmore
license: Proprietary - Authorized Workshop Participants Only
distribution_allowed: false
