# SCimilarity Competitive Intelligence & Visualization Platform

## Overview
This repository contains the codebase developed during the **Bio-IT World 2026 AI Upskilling Workshop for Computational Biologists**. The project demonstrates the end-to-end integration of large language models (LLMs) with computational biology workflows, from processing single-cell RNA-sequencing data to extracting clinical competitive intelligence.

The application serves two primary functions:
1.  **SCimilarity Validation Dashboard:** An interactive web visualization validating the performance of the [SCimilarity foundation model](https://www.nature.com/articles/s41586-023-06886-0) against real-world human kidney and heart single-cell datasets.
2.  **Translational Strategy Dashboard:** An automated competitive intelligence report generator that queries external databases (Open Targets) to assess the clinical viability and "whitespace" of newly identified pro-fibrotic targets.

## Key Features

*   **Foundation Model Integration:** Leverages the SCimilarity model to generate automated cell-type predictions from raw `.h5ad` single-cell datasets.
*   **Label Harmonization:** Implements a tiered mapping strategy to align and compare model predictions against original author annotations via a concordance matrix.
*   **Dynamic Web Visualizations:** A responsive, Vanilla JS, Plotly-based UI that renders UMAP coordinates and prediction concordance without requiring a full page reload, styled with a custom "Starry Night" aesthetic.
*   **Reusable AI Skills:** Contains the `cellxgene-scimilarity` Gemini CLI Skill. This modular AI tool automates the process of querying the `CELLxGENE` Census API, downloading tissue-specific datasets, running SCimilarity predictions, and generating UI-ready JSON artifacts.
*   **Automated Competitive Intelligence:** A Python data pipeline (`generate_intelligence_report.py`) that queries the Open Targets GraphQL API to generate a clinical landscape report (Phase 1-4 trial distributions, top indications) for specific biological markers (e.g., SPP1, MARCO, CD163).

## Project Structure

```text
.
├── src/
│   ├── download_heart_data.py         # Script to fetch targeted datasets from CELLxGENE
│   ├── process_heart_data.py          # Data pipeline for SCimilarity embedding & prediction
│   ├── process_real_data.py           # Core processing logic for kidney dataset
│   └── generate_intelligence_report.py# Queries Open Targets API & generates Plotly dashboard
├── skills/
│   └── cellxgene-scimilarity/         # Source code for the reusable AI data ingestion skill
├── web/
│   ├── index.html                     # SCimilarity Visualization Dashboard
│   ├── visualization.js               # Frontend logic for dynamic UMAP rendering
│   ├── intelligence_report.html       # Automated Competitive Intelligence Dashboard
│   └── data/                          # JSON artifacts (UMAP coords, Metadata, Concordance)
├── data/                              # Raw .h5ad single-cell datasets
├── models/                            # SCimilarity model weights
└── tests/                             # Unit testing suite for data processing validation
```

## Workshop Progression (Quests)

This repository is the result of four guided workshop "quests":

1.  **Quest 1 (Start Here):** Environment setup and deployment of a minimal web specification.
2.  **Quest 2 (Fig 3 Recreation):** Writing robust data processing pipelines and tests to recreate Figure 3 from the SCimilarity paper using real single-cell data.
3.  **Quest 3 (New Data & Skills):** Expanding the pipeline to novel datasets (Tabula Sapiens Heart) and packaging the complex workflow into a reusable AI "Skill" for the Gemini CLI.
4.  **Quest 4 (Open Targets):** Transitioning from basic research to translational strategy by building a pipeline to assess the clinical landscape of discovered biomarkers (SPP1, MARCO, CD163).

## Deployment

The static visualization application and intelligence dashboards can be hosted locally or deployed to a static file server (e.g., Google Cloud Storage, AWS S3, GitHub Pages).

To run locally:
```bash
cd web
python -m http.server 8000
```
Then navigate to `http://localhost:8000` or `http://localhost:8000/intelligence_report.html`.

## License
**Proprietary - Authorized Workshop Participants Only**
© 2026 Sonia Timberlake & Ryan Bellmore
