A self-maintaining single-cell multi-omics knowledge base built using the LLM Wiki pattern.
Instead of retrieving from raw documents at query time like traditional RAG, CellWiki uses an LLM to incrementally build and maintain a persistent wiki — a structured, interlinked collection of Markdown pages that sits between curated scientific papers and the user. Knowledge is compiled once and kept current, not re-derived on every query.
CellWiki automates the lifecycle of single-cell biology knowledge:
- Extract — Automatically extract cell types, marker genes, tissues, diseases, methods, and trajectories from single-cell biology papers (PDFs)
- Integrate — Merge information across papers, detecting contradictions and building consensus
- Reason — Infer relationships (cell type ↔ gene ↔ tissue ↔ disease ↔ trajectory) across modalities
- Query — Answer structured questions with citations back to sources
- Self-maintain — Automatically detect knowledge gaps, broken links, and stale claims; propose updates
The project currently focuses on tumor immunology (T cell exhaustion, tumor microenvironment), with 96 cell type wiki pages already generated from example papers.
Raw Sources (PDFs) ──► LLM Extraction ──► Wiki Pages (Markdown)
│ │
Contradiction Schema Rules
Detection (schema.md)
│ │
Knowledge Graph ◄─────── Multi-Omics Merger
CellWiki v2.0 uses LangGraph state machines for four composable workflows:
| Subgraph | Description |
|---|---|
| Ingest | Two-step pipeline: LLM analysis → human review interrupt → wiki generation |
| Query | Entity match → graph expansion → budget-aware context assembly → LLM synthesis |
| Lint | Self-healing loop: scan → classify → auto-fix → recheck |
| Research | Gap-driven: generate queries → web search → synthesize → propose updates |
All subgraphs persist state to SQLite, enabling cross-session recovery and state history inspection.
Six entity types with defined schemas and relationships:
| Entity | Description | Pages |
|---|---|---|
cell_type |
Identity/state/context triplet model | 96 |
marker_gene |
Specificity/sensitivity/stability/detectability scoring | Planned |
tissue |
Cell type abundance mapping | Planned |
disease |
Cell type/gene associations | Planned |
method |
Experimental technique descriptions | Planned |
trajectory |
State transition paths | Planned |
Each page includes YAML frontmatter with structured metadata, evidence tier classification (Tier 1–5), and source citations.
git clone https://github.com/BrainStOrmics/CellWiki.git
cd CellWiki
pip install -e . # Core installation
pip install -e ".[dev]" # Include testing dependencies
pip install -e ".[analysis]" # Include graph analysis (NetworkX, Louvain)- Python >= 3.10
- OpenAI-compatible API key (configured for qwen3.6-plus by default)
Set environment variables or create a .env file:
OPENAI_API_KEY=your_api_key
OPENAI_BASE_URL=your_api_base_url # Optional, for non-OpenAI providers
OPENAI_MODEL=qwen3.6-plus # Default modelcellwiki initDownloads the Cell Ontology, initializes workspace directories, and generates the index.
# Interactive mode with human review
cellwiki ingest paper.pdf
# Batch mode (skip review)
cellwiki ingest --no-review paper.pdf# v1: Simple lookup
cellwiki query "CD8+ T cell"
# v2: LangGraph-powered query with graph expansion
cellwiki query-v2 "What markers distinguish exhausted CD8+ T cells in the tumor microenvironment?"cellwiki graph # Generate JSON, DOT, Mermaid, and SVG visualizationscellwiki audit # Run audit checks (missing CL IDs, marker conflicts, orphan pages)
cellwiki review # Interactive review of audit issuescellwiki lint-v2 # Run self-healing lint with auto-fix
cellwiki lint-v2 --no-fix # Report only, no fixes
cellwiki list # List all cell types
cellwiki status # Show wiki status panel
cellwiki diff # Show what the last paper changedcellwiki memory list # List all checkpoints
cellwiki memory history # Show checkpoint history
cellwiki memory stats # Show checkpoint database stats
cellwiki memory cleanup # Clean old checkpointscellwiki debug graph # Visualize LangGraph structure
cellwiki debug trace # Trace execution flow
cellwiki debug history # Show state historywiki/
├── index.md # Global index of all pages
├── overview.md # Knowledge base overview
├── log.md # Append-only operation log
├── contradictions.md # Contradiction tracking
├── statistics.md # Quality metrics
├── cell_types/ # Cell type pages (96 generated)
├── marker_genes/ # Marker gene pages
├── tissues/ # Tissue pages
├── diseases/ # Disease pages
├── methods/ # Method pages
├── trajectories/ # Trajectory pages
└── state_spaces/ # State space definitions
| Package | Purpose |
|---|---|
openai |
LLM API calls |
pdfplumber |
PDF text extraction |
pydantic |
Data model validation |
langgraph |
State machine orchestration |
langchain-core |
LangGraph foundation |
networkx |
Knowledge graph analysis |
graphviz |
Graph visualization |
rich |
Terminal formatting |
This project is licensed under the MIT License — see the LICENSE file for details.