feat: integrate issue #32 field-gap additions (papers, datasets, software, databases, resources)#38
Merged
Conversation
Add 27 primary-research references plus 9 reviews from the field-gap analysis in issue #32. Matrix cells and reference entries updated in lockstep against the post-#33 Taxonomy-linked matrix. Rebased onto post-#33 main: the reference ids were renumbered from the original 198-233 to 200-235 to avoid colliding with #33's new #198 (Thevenot 2015) and #199 (Rohart 2017). Primary refs #200-225 plus CellFM #235; reviews #226-234. Citation text is unchanged from the prior caail-citation-reviewer pass — only the ids shifted. Matrix placements use the renamed "Bioprocess & Scale-Up" column and the Taxonomy.md axis links. Fills two previously-empty cells: Reinforcement Learning x Bioprocess & Scale-Up (#200-203) and CNN x Cellular Engineering (#218, Yang L 2025). Two placements were re-grounded against the methods via the caail-classification-reviewer: - Xu et al. 2025 (#208) is a comparative Raman/capacitance viability- monitoring study; multi-anchored SVM + Ensemble Learning + Chemometrics x Bioprocess & Scale-Up (SVR and PLS are both implemented and competitive, per the Roell #32 multi-anchor precedent). - Serpe et al. 2025 (#217) uses a Fast Random Forest pixel classifier (Trainable Weka), not a CNN as the title's imaging framing suggests; placed Ensemble Learning x Scaffolding. CellFM (#235, Zeng et al. 2025) classified to Foundation Models: Masked Language Modeling x Cellular Engineering after confirming its masked gene-expression pretraining objective. ProCyon (#224) carries the 24-author bioRxiv v3 with APA 21+-author truncation.
Add 15 single-cell / chromatin atlases, foundation-model corpora, and bioprocess-characterization datasets from issue #32, plus the BioML-bench benchmark companion (Papers.md ref #223): - Sheep: hypoxia-acclimatization multi-omics atlas (PRJNA1001505) + new chromatin-accessibility cluster - Duck: follicular-granulosa multi-omics map (PRJNA1254901) + new follicle-development cluster - Fish: Atlantic salmon spleen single-nucleus atlas - Crustacean: four shrimp / prawn single-cell atlases (new cluster) - Mollusk: oyster gill + Pacific oyster PGC + scallop adductor scRNA (new single-cell cluster) - HumanReference: Genecorpus-104M, Mouse-Genecorpus-20M, scPerturb - CHOReference: pseudo-perfusion + DMFA datasets (new section) - Benchmarks: BioML-bench (benchmark triangle for ref #223) Every entry verified against source full text by caail-claim-reviewer; corrected the oyster species (Crassostrea hongkongensis, not the issue's C. virginica), removed unsupported counts and superlatives, and cited the now-published Biotechnology and Bioengineering version of the pseudo-perfusion characterization.
Add 16 tools from issue #32, verified against each repo's README and canonical paper by caail-claim-reviewer: - Media Optimization & Cell Line Engineering: ESMFold, ColabFold, OmegaFold, RFdiffusion, ProteinMPNN, EvoDiff, Boltz, Chai-1, IgFold, AbLang (protein structure prediction, generative design, inverse folding for growth-factor and recombinant-media-protein engineering) - Metabolic Modeling & Strain Design: moped, mergem - Quantitative Genetics & Multi-Omics Analysis: OmicVerse, CellRank, CellChat, Giotto Suite Repointed three moved repos to their current canonical homes (omicverse/omicverse, scverse/cellrank, giotto-suite/Giotto) and moped to gitlab.com/qtb-hhu/moped; softened Boltz ("approach" not "reproduce" AlphaFold3 accuracy) and other claims to exactly what the sources support. g2f from the issue is held out — its repo is dead and the CRAN package is archived (no live canonical home).
Add 11 databases from issue #32, verified against each canonical site by caail-claim-reviewer: - Sequence, Genome & Expression: RNAcentral - Protein & Structure: OMA Browser; ProteomeXchange (promoted from an inline PRIDE mention to a dedicated entry, placed beside PRIDE rather than the issue's "spectral databases" section, which is for spectral libraries) - NEW "Glycomics & Glycoprotein Databases" section: GlyGen, GlyTouCan, GlyConnect - Cell Line & Single-Cell: Tabula Sapiens; scBaseCount (dual-listed with Datasets/HumanReference.md) - Pathways, Metabolism & Metabolic Models: PathBank, SABIO-RK, MetaNetX (promoted from an inline BiGG mention) Corrected reviewer-flagged claims: the issue's dead Tabula Sapiens / scBaseCount URLs (repointed to live homes) and its wrong GlyGen attribution (University of Georgia + GWU, not "Boston College"); Tabula Sapiens counts (1.1M cells / 28 organs / 24 donors); PathBank scale (600,000+ pathways).
…primers (#32) Add 12 OtherResources.md entries and 2 Primers/AI.md learning playlists from issue #32, verified against canonical sources by caail-claim-reviewer: - NEW "University centers & consortia" subsection: iCAMP (UC Davis), NICA (USDA/Tufts), Bezos Centre for Sustainable Protein (Imperial) - New Harvest initiatives: CAPE - GFI initiatives: GFI Research Grant Program - Courses: TU Delft cellular-agriculture advanced course; UMN Cellular Bioprocess Technology summer course - Curated Bibliographies: awesome-CRISPR, awesome-protein-design-software, awesome-lipidomics (General bioinformatics) + a new "Scientific & biomolecule language models" subgroup (Awesome-Biomolecule-Language- Cross-Modeling, Awesome-Scientific-Language-Models) - Primers/AI.md: Stanford CS230 (Learn the fundamentals), CS224N (Go deeper) Corrected reviewer-flagged claims: CAPE is New-Harvest-led and multi-partner (not a two-party New Harvest / Alberta Innovates effort); GFI funds plant-based + fermentation + cultivated, not cultivated alone; dropped an unsupported "EMNLP 2024" venue tag; UMN ML workshop is an optional half-day add-on.
BioML-bench (Miller et al. 2025) is reference #225 in this branch's Papers.md; correct the Datasets/Benchmarks.md companion cross-ref (link text and anchor) from the stale #223.
Bump the real-corpus ground-truth assertions to match the integrated field-gap content (counts regenerated by `pnpm parse`): papers 195->231, software 70->86, databases 72->83, datasets 130->146. Also: refs-with-code-URL 71->75 (Cosenza 2021, ESCARGOT, PloverDB, BioML-bench); AI primer fundamentals playlists 5->6 (vitest) and the page-wide external playlist cards 5->7 (e2e: Stanford CS230 + CS224N); homepage Papers-count link 195->231. Fixture-based unit counts are unchanged.
Add reference #236 (Tac, Gardner & Kuhl 2026, arXiv 2602.03092) and its matrix anchor in Deep Learning x Sensory Prediction. The paper builds a multinomial-diffusion + score-based generative model that learns the human palate from 500k+ recipes and designs burgers validated in a blinded 101-participant sensory evaluation. Placed via caail-classification-reviewer: diffusion/score-based models have no dedicated row, so the Deep Learning catch-all is the best-available home (GAN/VAE rejected — diffusion is neither); column is Sensory Prediction (not Media Optimization, which is culture medium). Bibliographic entry verified by caail-citation-reviewer.
Add the CRISPR/Cas9 GGTA1-knockout bovine satellite cell RNA-seq deposit (GEO GSE330550; D'Costa et al. 2026, bioRxiv 10.64898/2026.05.20.726299) to the 'Bovine satellite cells & cultured-meat differentiation' cluster and the inventory table in Datasets/Cow.md. The study disrupts GGTA1 (alpha-1,3-galactosyltransferase) in immortalized bovine satellite cells to remove the alpha-gal epitope behind Alpha-gal Syndrome, toward AGS-compatible cultivated beef. Dataset deposit only (the paper applies CRISPR/Cas9, not an AI/ML method, so it gets no Papers.md matrix entry); the grkenney/AGAL analysis package is left out as too thin to stand alone in Software.md. Claims verified against the paper's full text by caail-claim-reviewer.
…Models) Add Context7 (upstash/context7) to Software.md's AI Agents & Foundation Models section: an open-source, MIT-licensed MCP server from Upstash that injects up-to-date, version-specific library docs into LLM prompts. Framed honestly as general developer infrastructure (not cell-ag- specific) relevant to CAAIL's AI-agent audience. Verified by caail-claim-reviewer.
…base Add Pando (pando.foraybio.com) to Databases.md's Ecosystem & Industry Directories section as a commercial in vitro plant knowledge base / AI workspace from Foray Bioscience (plant-side cell-ag), with the intro updated to acknowledge it as a second non-GFI industry addition. The 'largest' claim is attributed to Foray, not asserted. Routed here per CAAIL's split (databases live in Databases.md). Verified by caail-claim-reviewer.
Update real-corpus ground-truth to the parsed actuals after adding the burger paper (#236), the GSE330550 bovine deposit, Context7, and Pando: papers 231->232, software 86->87, databases 83->84, datasets 146->147, and the homepage '231 Papers' -> '232 Papers' e2e assertion. Fixture counts untouched.
Wrap 115 bare backtick code-span accessions across the per-species Datasets pages in their canonical registry URLs, in the existing [`ACC`](url) form (clickable on GitHub + the site). Covers GEO (GSE), NCBI BioProject (PRJNA/PRJEB), SRA (SRP/SRX/SRR/SRS + legacy), GenBank nuccore (JR), PRIDE (PXD), CNCB GSA (CRA)/BioProject (PRJCA)/GVM/OMIX, and CNGB CNSA (CNP). Already-linked accessions (inside study-title URLs or already wrapped) are skipped via a negative-lookbehind guard; non-accession tokens (pcPigMNet2025, GSB-codes, B8/CHO) are excluded by per-namespace full-match patterns with digit-count minimums. Databases.md/Software.md were already fully linked (no change). Counts unchanged; build, 296 vitest, and 34 e2e all pass.
Rename the 'Tools' stat label to 'Software' in the homepage hero (Hero.astro) and the By the Numbers dashboard (MetricsDashboard.astro) so the count is labelled consistently with the Software section, nav, and page.
'87 Software' read awkwardly; label the count 'Software tools' on the homepage hero and the By the Numbers dashboard so it reads '87 Software tools' (consistent with the two-word 'Research areas' stat).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Integrates the 92 field-gap candidate additions from the issue #32 analysis into the CAAIL canonical files, following the
zotero-to-caail-syncworkflow with mandatory adversarial reviewer subagents. The branch was rebased onto post-#33mainand reconciled against the newTaxonomy.md-linked matrix.Closes #32.
What's included (by category)
Papers.md#200–225,#235) + 9 reviews (#226–234)Datasets/Software.mdDatabases.mdOtherResources.md+Primers/Corpus counts: papers 195 → 231, software 70 → 86, databases 72 → 83, datasets 130 → 146.
Rebase onto post-#33
mainPR #33 merged after this branch was cut and rewrote the
Papers.mdmatrix (Taxonomy-linked axes, the Bioprocess control → Bioprocess & Scale-Up rename, a new Chemometrics row, and reference IDs#198/#199). Reconciliation:198–233 → 200–235to avoid colliding with feat: matrix classification audit, Taxonomy, and Papers Explorer upgrades #33's new#198(Thévenot) /#199(Rohart). Citation text is unchanged from the priorcaail-citation-reviewerpass — only the<a id>numbers and matrix(#N)anchors shifted.#200–203), CNN × Cellular Engineering (#218).Classification re-grounding (
caail-classification-reviewer)Two placements were re-verified against the papers' own methods under the stricter post-#33 taxonomy:
#208) — comparative Raman/capacitance viability-monitoring study; multi-anchored SVM + Ensemble Learning + Chemometrics × Bioprocess & Scale-Up (SVR and PLS are both genuinely implemented and competitive, per the Roell#32multi-anchor precedent).#217) — corrected from CNN × Scaffolding to Ensemble Learning × Scaffolding: the imaging ML is a Trainable-Weka Fast Random Forest pixel classifier, not a CNN (the only CNN mention is a related-work citation). Column stays Scaffolding (the deliverable is plant-scaffold quality assessment; no bioreactor was used).#235) classified to Foundation Models: Masked Language Modeling × Cellular Engineering (masked gene-expression pretraining objective).Curation deviations from the raw issue list (carried from the #32 checkpoints)
g2fheld out ofSoftware.md— repo is dead and the CRAN package is archived (no live canonical home).Databases.mdentry, placed beside PRIDE rather than the issue's "spectral databases" section (which is reserved for spectral libraries).Datasets/Benchmarks.mdpoints atPapers.md #225.Verification
pnpm --dir site lint:papers→ 0 errors (matrix ↔ reference integrity OK).pnpm --dir site build→ succeeds (Taxonomy build guard + count-drift guards pass).pnpm --dir site test→ 296 passed;pnpm --dir site test:e2e→ 34 passed (parser + e2e ground-truth synced to the new counts).Every drafted entry was verified before commit by the read-only adversarial reviewer subagents (
caail-citation-reviewer,caail-claim-reviewer,caail-classification-reviewer); the agent that wrote an entry never reviewed it.Update — 4 additional new-since-#32 Zotero items
A follow-up Zotero ⇄ repo sync pass appended four resources added to the caail Zotero library after the #32 set above (Papers IDs continue from 236):
feat(papers)— Tac, Gardner & Kuhl 2026, Generative AI creates delicious, sustainable, and nutritious burgers (arXiv 2602.03092), ref #236, anchored in Deep Learning × Sensory Prediction. The model is a multinomial-diffusion + score-based generative architecture; the classification reviewer flagged a taxonomy gap (no diffusion/score-based row exists) and placed it in the Deep Learning catch-all as the best available row — worth considering a dedicated "Diffusion & Score-Based Generative Models" row later.feat(data)— CRISPR/Cas9 GGTA1-knockout bovine satellite-cell RNA-seq deposit (GEO GSE330550; D'Costa et al. 2026) →Datasets/Cow.md. Dataset-only (the paper applies CRISPR, not an AI/ML method, so no matrix entry); the thingrkenney/AGALR package was left out.feat(software)— Context7 (upstash/context7, MIT) →Software.md § AI Agents & Foundation Models, framed honestly as general AI-agent developer infrastructure (not cell-ag-specific).feat(databases)— Pando (Foray Bioscience) →Databases.md § Ecosystem & Industry Directories, a commercial in vitro plant knowledge base (plant-side cell-ag); the "largest" claim is attributed to Foray.feat(data)— hyperlinked 115 bare accession IDs (GSE / PRJNA / SRA / CRA / PRJCA / PXD / CNP / GVM / OMIX / JR…) across the per-speciesDatasets/*.mdpages to their canonical registries (GitHub + site), skipping already-linked ones.Counts re-synced: papers 231→232, software 86→87, databases 83→84, datasets 146→147. lint:papers (0 errors), build, 296 vitest, and 34 e2e all pass. Each entry was verified by the read-only reviewer subagents before commit.
🤖 Generated with Claude Code