feat: integrate issue #32 field-gap additions (papers, datasets, software, databases, resources) by benjibromberg · Pull Request #38 · tucca-cellag/caail

benjibromberg · 2026-06-10T22:55:50Z

Summary

Integrates the 92 field-gap candidate additions from the issue #32 analysis into the CAAIL canonical files, following the zotero-to-caail-sync workflow with mandatory adversarial reviewer subagents. The branch was rebased onto post-#33 main and reconciled against the new Taxonomy.md-linked matrix.

Closes #32.

What's included (by category)

File(s)	Added	Notes
`Papers.md`	36 papers — 27 primary refs (`#200–225`, `#235`) + 9 reviews (`#226–234`)	matrix cells + reference entries in lockstep
`Datasets/`	16 atlases / FM corpora / bioprocess + benchmark deposits	across Sheep, Duck, Fish, Crustacean, Mollusk, Human/CHO reference, Benchmarks
`Software.md`	16 open-source tools	protein structure/design, metabolic modeling, multi-omics
`Databases.md`	11 repositories	incl. a new Glycomics & Glycoprotein section
`OtherResources.md` + `Primers/`	14 initiatives, courses, bibliographies, primers

Corpus counts: papers 195 → 231, software 70 → 86, databases 72 → 83, datasets 130 → 146.

Rebase onto post-#33 `main`

PR #33 merged after this branch was cut and rewrote the Papers.md matrix (Taxonomy-linked axes, the Bioprocess control → Bioprocess & Scale-Up rename, a new Chemometrics row, and reference IDs #198/#199). Reconciliation:

Renumbered 198–233 → 200–235 to avoid colliding with feat: matrix classification audit, Taxonomy, and Papers Explorer upgrades #33's new #198 (Thévenot) / #199 (Rohart). Citation text is unchanged from the prior caail-citation-reviewer pass — only the <a id> numbers and matrix (#N) anchors shifted.
Re-applied every matrix anchor into feat: matrix classification audit, Taxonomy, and Papers Explorer upgrades #33's reclassified cells (append, not overwrite) using the renamed Bioprocess & Scale-Up column and Taxonomy axis links (no Wikipedia links reintroduced).
Two empty cells filled: Reinforcement Learning × Bioprocess & Scale-Up (#200–203), CNN × Cellular Engineering (#218).

Classification re-grounding (`caail-classification-reviewer`)

Two placements were re-verified against the papers' own methods under the stricter post-#33 taxonomy:

Xu et al. 2025 (#208) — comparative Raman/capacitance viability-monitoring study; multi-anchored SVM + Ensemble Learning + Chemometrics × Bioprocess & Scale-Up (SVR and PLS are both genuinely implemented and competitive, per the Roell #32 multi-anchor precedent).
Serpe et al. 2025 (#217) — corrected from CNN × Scaffolding to Ensemble Learning × Scaffolding: the imaging ML is a Trainable-Weka Fast Random Forest pixel classifier, not a CNN (the only CNN mention is a related-work citation). Column stays Scaffolding (the deliverable is plant-scaffold quality assessment; no bioreactor was used).
CellFM (#235) classified to Foundation Models: Masked Language Modeling × Cellular Engineering (masked gene-expression pretraining objective).

Curation deviations from the raw issue list (carried from the #32 checkpoints)

g2f held out of Software.md — repo is dead and the CRAN package is archived (no live canonical home).
ProteomeXchange promoted from an inline PRIDE mention to a dedicated Databases.md entry, placed beside PRIDE rather than the issue's "spectral databases" section (which is reserved for spectral libraries).
Reviewer-flagged source corrections applied (oyster species → Crassostrea hongkongensis; dead Tabula Sapiens / scBaseCount URLs repointed; GlyGen attribution; unsupported counts/superlatives removed). The BioML-bench companion in Datasets/Benchmarks.md points at Papers.md #225.

Verification

pnpm --dir site lint:papers → 0 errors (matrix ↔ reference integrity OK).
pnpm --dir site build → succeeds (Taxonomy build guard + count-drift guards pass).
pnpm --dir site test → 296 passed; pnpm --dir site test:e2e → 34 passed (parser + e2e ground-truth synced to the new counts).

Every drafted entry was verified before commit by the read-only adversarial reviewer subagents (caail-citation-reviewer, caail-claim-reviewer, caail-classification-reviewer); the agent that wrote an entry never reviewed it.

Update — 4 additional new-since-#32 Zotero items

A follow-up Zotero ⇄ repo sync pass appended four resources added to the caail Zotero library after the #32 set above (Papers IDs continue from 236):

feat(papers) — Tac, Gardner & Kuhl 2026, Generative AI creates delicious, sustainable, and nutritious burgers (arXiv 2602.03092), ref #236, anchored in Deep Learning × Sensory Prediction. The model is a multinomial-diffusion + score-based generative architecture; the classification reviewer flagged a taxonomy gap (no diffusion/score-based row exists) and placed it in the Deep Learning catch-all as the best available row — worth considering a dedicated "Diffusion & Score-Based Generative Models" row later.
feat(data) — CRISPR/Cas9 GGTA1-knockout bovine satellite-cell RNA-seq deposit (GEO GSE330550; D'Costa et al. 2026) → Datasets/Cow.md. Dataset-only (the paper applies CRISPR, not an AI/ML method, so no matrix entry); the thin grkenney/AGAL R package was left out.
feat(software) — Context7 (upstash/context7, MIT) → Software.md § AI Agents & Foundation Models, framed honestly as general AI-agent developer infrastructure (not cell-ag-specific).
feat(databases) — Pando (Foray Bioscience) → Databases.md § Ecosystem & Industry Directories, a commercial in vitro plant knowledge base (plant-side cell-ag); the "largest" claim is attributed to Foray.
feat(data) — hyperlinked 115 bare accession IDs (GSE / PRJNA / SRA / CRA / PRJCA / PXD / CNP / GVM / OMIX / JR…) across the per-species Datasets/*.md pages to their canonical registries (GitHub + site), skipping already-linked ones.

Counts re-synced: papers 231→232, software 86→87, databases 83→84, datasets 146→147. lint:papers (0 errors), build, 296 vitest, and 34 e2e all pass. Each entry was verified by the read-only reviewer subagents before commit.

🤖 Generated with Claude Code

Add 27 primary-research references plus 9 reviews from the field-gap analysis in issue #32. Matrix cells and reference entries updated in lockstep against the post-#33 Taxonomy-linked matrix. Rebased onto post-#33 main: the reference ids were renumbered from the original 198-233 to 200-235 to avoid colliding with #33's new #198 (Thevenot 2015) and #199 (Rohart 2017). Primary refs #200-225 plus CellFM #235; reviews #226-234. Citation text is unchanged from the prior caail-citation-reviewer pass — only the ids shifted. Matrix placements use the renamed "Bioprocess & Scale-Up" column and the Taxonomy.md axis links. Fills two previously-empty cells: Reinforcement Learning x Bioprocess & Scale-Up (#200-203) and CNN x Cellular Engineering (#218, Yang L 2025). Two placements were re-grounded against the methods via the caail-classification-reviewer: - Xu et al. 2025 (#208) is a comparative Raman/capacitance viability- monitoring study; multi-anchored SVM + Ensemble Learning + Chemometrics x Bioprocess & Scale-Up (SVR and PLS are both implemented and competitive, per the Roell #32 multi-anchor precedent). - Serpe et al. 2025 (#217) uses a Fast Random Forest pixel classifier (Trainable Weka), not a CNN as the title's imaging framing suggests; placed Ensemble Learning x Scaffolding. CellFM (#235, Zeng et al. 2025) classified to Foundation Models: Masked Language Modeling x Cellular Engineering after confirming its masked gene-expression pretraining objective. ProCyon (#224) carries the 24-author bioRxiv v3 with APA 21+-author truncation.

Add 15 single-cell / chromatin atlases, foundation-model corpora, and bioprocess-characterization datasets from issue #32, plus the BioML-bench benchmark companion (Papers.md ref #223): - Sheep: hypoxia-acclimatization multi-omics atlas (PRJNA1001505) + new chromatin-accessibility cluster - Duck: follicular-granulosa multi-omics map (PRJNA1254901) + new follicle-development cluster - Fish: Atlantic salmon spleen single-nucleus atlas - Crustacean: four shrimp / prawn single-cell atlases (new cluster) - Mollusk: oyster gill + Pacific oyster PGC + scallop adductor scRNA (new single-cell cluster) - HumanReference: Genecorpus-104M, Mouse-Genecorpus-20M, scPerturb - CHOReference: pseudo-perfusion + DMFA datasets (new section) - Benchmarks: BioML-bench (benchmark triangle for ref #223) Every entry verified against source full text by caail-claim-reviewer; corrected the oyster species (Crassostrea hongkongensis, not the issue's C. virginica), removed unsupported counts and superlatives, and cited the now-published Biotechnology and Bioengineering version of the pseudo-perfusion characterization.

Add 16 tools from issue #32, verified against each repo's README and canonical paper by caail-claim-reviewer: - Media Optimization & Cell Line Engineering: ESMFold, ColabFold, OmegaFold, RFdiffusion, ProteinMPNN, EvoDiff, Boltz, Chai-1, IgFold, AbLang (protein structure prediction, generative design, inverse folding for growth-factor and recombinant-media-protein engineering) - Metabolic Modeling & Strain Design: moped, mergem - Quantitative Genetics & Multi-Omics Analysis: OmicVerse, CellRank, CellChat, Giotto Suite Repointed three moved repos to their current canonical homes (omicverse/omicverse, scverse/cellrank, giotto-suite/Giotto) and moped to gitlab.com/qtb-hhu/moped; softened Boltz ("approach" not "reproduce" AlphaFold3 accuracy) and other claims to exactly what the sources support. g2f from the issue is held out — its repo is dead and the CRAN package is archived (no live canonical home).

Add 11 databases from issue #32, verified against each canonical site by caail-claim-reviewer: - Sequence, Genome & Expression: RNAcentral - Protein & Structure: OMA Browser; ProteomeXchange (promoted from an inline PRIDE mention to a dedicated entry, placed beside PRIDE rather than the issue's "spectral databases" section, which is for spectral libraries) - NEW "Glycomics & Glycoprotein Databases" section: GlyGen, GlyTouCan, GlyConnect - Cell Line & Single-Cell: Tabula Sapiens; scBaseCount (dual-listed with Datasets/HumanReference.md) - Pathways, Metabolism & Metabolic Models: PathBank, SABIO-RK, MetaNetX (promoted from an inline BiGG mention) Corrected reviewer-flagged claims: the issue's dead Tabula Sapiens / scBaseCount URLs (repointed to live homes) and its wrong GlyGen attribution (University of Georgia + GWU, not "Boston College"); Tabula Sapiens counts (1.1M cells / 28 organs / 24 donors); PathBank scale (600,000+ pathways).

…primers (#32) Add 12 OtherResources.md entries and 2 Primers/AI.md learning playlists from issue #32, verified against canonical sources by caail-claim-reviewer: - NEW "University centers & consortia" subsection: iCAMP (UC Davis), NICA (USDA/Tufts), Bezos Centre for Sustainable Protein (Imperial) - New Harvest initiatives: CAPE - GFI initiatives: GFI Research Grant Program - Courses: TU Delft cellular-agriculture advanced course; UMN Cellular Bioprocess Technology summer course - Curated Bibliographies: awesome-CRISPR, awesome-protein-design-software, awesome-lipidomics (General bioinformatics) + a new "Scientific & biomolecule language models" subgroup (Awesome-Biomolecule-Language- Cross-Modeling, Awesome-Scientific-Language-Models) - Primers/AI.md: Stanford CS230 (Learn the fundamentals), CS224N (Go deeper) Corrected reviewer-flagged claims: CAPE is New-Harvest-led and multi-partner (not a two-party New Harvest / Alberta Innovates effort); GFI funds plant-based + fermentation + cultivated, not cultivated alone; dropped an unsupported "EMNLP 2024" venue tag; UMN ML workshop is an optional half-day add-on.

BioML-bench (Miller et al. 2025) is reference #225 in this branch's Papers.md; correct the Datasets/Benchmarks.md companion cross-ref (link text and anchor) from the stale #223.

Bump the real-corpus ground-truth assertions to match the integrated field-gap content (counts regenerated by `pnpm parse`): papers 195->231, software 70->86, databases 72->83, datasets 130->146. Also: refs-with-code-URL 71->75 (Cosenza 2021, ESCARGOT, PloverDB, BioML-bench); AI primer fundamentals playlists 5->6 (vitest) and the page-wide external playlist cards 5->7 (e2e: Stanford CS230 + CS224N); homepage Papers-count link 195->231. Fixture-based unit counts are unchanged.

Add reference #236 (Tac, Gardner & Kuhl 2026, arXiv 2602.03092) and its matrix anchor in Deep Learning x Sensory Prediction. The paper builds a multinomial-diffusion + score-based generative model that learns the human palate from 500k+ recipes and designs burgers validated in a blinded 101-participant sensory evaluation. Placed via caail-classification-reviewer: diffusion/score-based models have no dedicated row, so the Deep Learning catch-all is the best-available home (GAN/VAE rejected — diffusion is neither); column is Sensory Prediction (not Media Optimization, which is culture medium). Bibliographic entry verified by caail-citation-reviewer.

Add the CRISPR/Cas9 GGTA1-knockout bovine satellite cell RNA-seq deposit (GEO GSE330550; D'Costa et al. 2026, bioRxiv 10.64898/2026.05.20.726299) to the 'Bovine satellite cells & cultured-meat differentiation' cluster and the inventory table in Datasets/Cow.md. The study disrupts GGTA1 (alpha-1,3-galactosyltransferase) in immortalized bovine satellite cells to remove the alpha-gal epitope behind Alpha-gal Syndrome, toward AGS-compatible cultivated beef. Dataset deposit only (the paper applies CRISPR/Cas9, not an AI/ML method, so it gets no Papers.md matrix entry); the grkenney/AGAL analysis package is left out as too thin to stand alone in Software.md. Claims verified against the paper's full text by caail-claim-reviewer.

…Models) Add Context7 (upstash/context7) to Software.md's AI Agents & Foundation Models section: an open-source, MIT-licensed MCP server from Upstash that injects up-to-date, version-specific library docs into LLM prompts. Framed honestly as general developer infrastructure (not cell-ag- specific) relevant to CAAIL's AI-agent audience. Verified by caail-claim-reviewer.

…base Add Pando (pando.foraybio.com) to Databases.md's Ecosystem & Industry Directories section as a commercial in vitro plant knowledge base / AI workspace from Foray Bioscience (plant-side cell-ag), with the intro updated to acknowledge it as a second non-GFI industry addition. The 'largest' claim is attributed to Foray, not asserted. Routed here per CAAIL's split (databases live in Databases.md). Verified by caail-claim-reviewer.

Update real-corpus ground-truth to the parsed actuals after adding the burger paper (#236), the GSE330550 bovine deposit, Context7, and Pando: papers 231->232, software 86->87, databases 83->84, datasets 146->147, and the homepage '231 Papers' -> '232 Papers' e2e assertion. Fixture counts untouched.

Wrap 115 bare backtick code-span accessions across the per-species Datasets pages in their canonical registry URLs, in the existing [`ACC`](url) form (clickable on GitHub + the site). Covers GEO (GSE), NCBI BioProject (PRJNA/PRJEB), SRA (SRP/SRX/SRR/SRS + legacy), GenBank nuccore (JR), PRIDE (PXD), CNCB GSA (CRA)/BioProject (PRJCA)/GVM/OMIX, and CNGB CNSA (CNP). Already-linked accessions (inside study-title URLs or already wrapped) are skipped via a negative-lookbehind guard; non-accession tokens (pcPigMNet2025, GSB-codes, B8/CHO) are excluded by per-namespace full-match patterns with digit-count minimums. Databases.md/Software.md were already fully linked (no change). Counts unchanged; build, 296 vitest, and 34 e2e all pass.

Rename the 'Tools' stat label to 'Software' in the homepage hero (Hero.astro) and the By the Numbers dashboard (MetricsDashboard.astro) so the count is labelled consistently with the Software section, nav, and page.

'87 Software' read awkwardly; label the count 'Software tools' on the homepage hero and the By the Numbers dashboard so it reads '87 Software tools' (consistent with the two-word 'Research areas' stat).

benjibromberg added 15 commits June 10, 2026 18:43

fix(data): point BioML-bench companion to Papers.md #225

9358832

BioML-bench (Miller et al. 2025) is reference #225 in this branch's Papers.md; correct the Datasets/Benchmarks.md companion cross-ref (link text and anchor) from the stale #223.

fix(site): label the software stat "Software" on home + dashboard

6a51e9e

Rename the 'Tools' stat label to 'Software' in the homepage hero (Hero.astro) and the By the Numbers dashboard (MetricsDashboard.astro) so the count is labelled consistently with the Software section, nav, and page.

fix(site): clarify the software stat label to "Software tools"

ac29f4e

'87 Software' read awkwardly; label the count 'Software tools' on the homepage hero and the By the Numbers dashboard so it reads '87 Software tools' (consistent with the two-word 'Research areas' stat).

benjibromberg merged commit efdd2f8 into main Jun 12, 2026
1 check passed

benjibromberg deleted the worktree-feat+issue-32-field-gap-additions branch June 12, 2026 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate issue #32 field-gap additions (papers, datasets, software, databases, resources)#38

feat: integrate issue #32 field-gap additions (papers, datasets, software, databases, resources)#38
benjibromberg merged 15 commits into
mainfrom
worktree-feat+issue-32-field-gap-additions

benjibromberg commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benjibromberg commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's included (by category)

Rebase onto post-#33 main

Classification re-grounding (caail-classification-reviewer)

Curation deviations from the raw issue list (carried from the #32 checkpoints)

Verification

Update — 4 additional new-since-#32 Zotero items

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

benjibromberg commented Jun 10, 2026 •

edited

Loading

Rebase onto post-#33 `main`

Classification re-grounding (`caail-classification-reviewer`)