Skip to content

feat: integrate issue #32 field-gap additions (papers, datasets, software, databases, resources)#38

Merged
benjibromberg merged 15 commits into
mainfrom
worktree-feat+issue-32-field-gap-additions
Jun 12, 2026
Merged

feat: integrate issue #32 field-gap additions (papers, datasets, software, databases, resources)#38
benjibromberg merged 15 commits into
mainfrom
worktree-feat+issue-32-field-gap-additions

Conversation

@benjibromberg

@benjibromberg benjibromberg commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary

Integrates the 92 field-gap candidate additions from the issue #32 analysis into the CAAIL canonical files, following the zotero-to-caail-sync workflow with mandatory adversarial reviewer subagents. The branch was rebased onto post-#33 main and reconciled against the new Taxonomy.md-linked matrix.

Closes #32.

What's included (by category)

File(s) Added Notes
Papers.md 36 papers — 27 primary refs (#200–225, #235) + 9 reviews (#226–234) matrix cells + reference entries in lockstep
Datasets/ 16 atlases / FM corpora / bioprocess + benchmark deposits across Sheep, Duck, Fish, Crustacean, Mollusk, Human/CHO reference, Benchmarks
Software.md 16 open-source tools protein structure/design, metabolic modeling, multi-omics
Databases.md 11 repositories incl. a new Glycomics & Glycoprotein section
OtherResources.md + Primers/ 14 initiatives, courses, bibliographies, primers

Corpus counts: papers 195 → 231, software 70 → 86, databases 72 → 83, datasets 130 → 146.

Rebase onto post-#33 main

PR #33 merged after this branch was cut and rewrote the Papers.md matrix (Taxonomy-linked axes, the Bioprocess control → Bioprocess & Scale-Up rename, a new Chemometrics row, and reference IDs #198/#199). Reconciliation:

Classification re-grounding (caail-classification-reviewer)

Two placements were re-verified against the papers' own methods under the stricter post-#33 taxonomy:

  • Xu et al. 2025 (#208) — comparative Raman/capacitance viability-monitoring study; multi-anchored SVM + Ensemble Learning + Chemometrics × Bioprocess & Scale-Up (SVR and PLS are both genuinely implemented and competitive, per the Roell #32 multi-anchor precedent).
  • Serpe et al. 2025 (#217) — corrected from CNN × Scaffolding to Ensemble Learning × Scaffolding: the imaging ML is a Trainable-Weka Fast Random Forest pixel classifier, not a CNN (the only CNN mention is a related-work citation). Column stays Scaffolding (the deliverable is plant-scaffold quality assessment; no bioreactor was used).
  • CellFM (#235) classified to Foundation Models: Masked Language Modeling × Cellular Engineering (masked gene-expression pretraining objective).

Curation deviations from the raw issue list (carried from the #32 checkpoints)

  • g2f held out of Software.md — repo is dead and the CRAN package is archived (no live canonical home).
  • ProteomeXchange promoted from an inline PRIDE mention to a dedicated Databases.md entry, placed beside PRIDE rather than the issue's "spectral databases" section (which is reserved for spectral libraries).
  • Reviewer-flagged source corrections applied (oyster species → Crassostrea hongkongensis; dead Tabula Sapiens / scBaseCount URLs repointed; GlyGen attribution; unsupported counts/superlatives removed). The BioML-bench companion in Datasets/Benchmarks.md points at Papers.md #225.

Verification

  • pnpm --dir site lint:papers0 errors (matrix ↔ reference integrity OK).
  • pnpm --dir site build → succeeds (Taxonomy build guard + count-drift guards pass).
  • pnpm --dir site test296 passed; pnpm --dir site test:e2e34 passed (parser + e2e ground-truth synced to the new counts).

Every drafted entry was verified before commit by the read-only adversarial reviewer subagents (caail-citation-reviewer, caail-claim-reviewer, caail-classification-reviewer); the agent that wrote an entry never reviewed it.

Update — 4 additional new-since-#32 Zotero items

A follow-up Zotero ⇄ repo sync pass appended four resources added to the caail Zotero library after the #32 set above (Papers IDs continue from 236):

  • feat(papers) — Tac, Gardner & Kuhl 2026, Generative AI creates delicious, sustainable, and nutritious burgers (arXiv 2602.03092), ref #236, anchored in Deep Learning × Sensory Prediction. The model is a multinomial-diffusion + score-based generative architecture; the classification reviewer flagged a taxonomy gap (no diffusion/score-based row exists) and placed it in the Deep Learning catch-all as the best available row — worth considering a dedicated "Diffusion & Score-Based Generative Models" row later.

  • feat(data) — CRISPR/Cas9 GGTA1-knockout bovine satellite-cell RNA-seq deposit (GEO GSE330550; D'Costa et al. 2026) → Datasets/Cow.md. Dataset-only (the paper applies CRISPR, not an AI/ML method, so no matrix entry); the thin grkenney/AGAL R package was left out.

  • feat(software)Context7 (upstash/context7, MIT) → Software.md § AI Agents & Foundation Models, framed honestly as general AI-agent developer infrastructure (not cell-ag-specific).

  • feat(databases)Pando (Foray Bioscience) → Databases.md § Ecosystem & Industry Directories, a commercial in vitro plant knowledge base (plant-side cell-ag); the "largest" claim is attributed to Foray.

  • feat(data) — hyperlinked 115 bare accession IDs (GSE / PRJNA / SRA / CRA / PRJCA / PXD / CNP / GVM / OMIX / JR…) across the per-species Datasets/*.md pages to their canonical registries (GitHub + site), skipping already-linked ones.

Counts re-synced: papers 231→232, software 86→87, databases 83→84, datasets 146→147. lint:papers (0 errors), build, 296 vitest, and 34 e2e all pass. Each entry was verified by the read-only reviewer subagents before commit.

🤖 Generated with Claude Code

Add 27 primary-research references plus 9 reviews from the field-gap
analysis in issue #32. Matrix cells and reference entries updated in
lockstep against the post-#33 Taxonomy-linked matrix.

Rebased onto post-#33 main: the reference ids were renumbered from the
original 198-233 to 200-235 to avoid colliding with #33's new #198
(Thevenot 2015) and #199 (Rohart 2017). Primary refs #200-225 plus
CellFM #235; reviews #226-234. Citation text is unchanged from the
prior caail-citation-reviewer pass — only the ids shifted.

Matrix placements use the renamed "Bioprocess & Scale-Up" column and
the Taxonomy.md axis links. Fills two previously-empty cells:
Reinforcement Learning x Bioprocess & Scale-Up (#200-203) and CNN x
Cellular Engineering (#218, Yang L 2025).

Two placements were re-grounded against the methods via the
caail-classification-reviewer:
- Xu et al. 2025 (#208) is a comparative Raman/capacitance viability-
  monitoring study; multi-anchored SVM + Ensemble Learning +
  Chemometrics x Bioprocess & Scale-Up (SVR and PLS are both
  implemented and competitive, per the Roell #32 multi-anchor
  precedent).
- Serpe et al. 2025 (#217) uses a Fast Random Forest pixel classifier
  (Trainable Weka), not a CNN as the title's imaging framing suggests;
  placed Ensemble Learning x Scaffolding.

CellFM (#235, Zeng et al. 2025) classified to Foundation Models: Masked
Language Modeling x Cellular Engineering after confirming its masked
gene-expression pretraining objective. ProCyon (#224) carries the
24-author bioRxiv v3 with APA 21+-author truncation.
Add 15 single-cell / chromatin atlases, foundation-model corpora, and
bioprocess-characterization datasets from issue #32, plus the
BioML-bench benchmark companion (Papers.md ref #223):

- Sheep: hypoxia-acclimatization multi-omics atlas (PRJNA1001505) +
  new chromatin-accessibility cluster
- Duck: follicular-granulosa multi-omics map (PRJNA1254901) + new
  follicle-development cluster
- Fish: Atlantic salmon spleen single-nucleus atlas
- Crustacean: four shrimp / prawn single-cell atlases (new cluster)
- Mollusk: oyster gill + Pacific oyster PGC + scallop adductor scRNA
  (new single-cell cluster)
- HumanReference: Genecorpus-104M, Mouse-Genecorpus-20M, scPerturb
- CHOReference: pseudo-perfusion + DMFA datasets (new section)
- Benchmarks: BioML-bench (benchmark triangle for ref #223)

Every entry verified against source full text by caail-claim-reviewer;
corrected the oyster species (Crassostrea hongkongensis, not the
issue's C. virginica), removed unsupported counts and superlatives, and
cited the now-published Biotechnology and Bioengineering version of the
pseudo-perfusion characterization.
Add 16 tools from issue #32, verified against each repo's README and
canonical paper by caail-claim-reviewer:

- Media Optimization & Cell Line Engineering: ESMFold, ColabFold,
  OmegaFold, RFdiffusion, ProteinMPNN, EvoDiff, Boltz, Chai-1, IgFold,
  AbLang (protein structure prediction, generative design, inverse
  folding for growth-factor and recombinant-media-protein engineering)
- Metabolic Modeling & Strain Design: moped, mergem
- Quantitative Genetics & Multi-Omics Analysis: OmicVerse, CellRank,
  CellChat, Giotto Suite

Repointed three moved repos to their current canonical homes
(omicverse/omicverse, scverse/cellrank, giotto-suite/Giotto) and moped
to gitlab.com/qtb-hhu/moped; softened Boltz ("approach" not "reproduce"
AlphaFold3 accuracy) and other claims to exactly what the sources
support. g2f from the issue is held out — its repo is dead and the CRAN
package is archived (no live canonical home).
Add 11 databases from issue #32, verified against each canonical site
by caail-claim-reviewer:

- Sequence, Genome & Expression: RNAcentral
- Protein & Structure: OMA Browser; ProteomeXchange (promoted from an
  inline PRIDE mention to a dedicated entry, placed beside PRIDE rather
  than the issue's "spectral databases" section, which is for spectral
  libraries)
- NEW "Glycomics & Glycoprotein Databases" section: GlyGen, GlyTouCan,
  GlyConnect
- Cell Line & Single-Cell: Tabula Sapiens; scBaseCount (dual-listed
  with Datasets/HumanReference.md)
- Pathways, Metabolism & Metabolic Models: PathBank, SABIO-RK, MetaNetX
  (promoted from an inline BiGG mention)

Corrected reviewer-flagged claims: the issue's dead Tabula Sapiens /
scBaseCount URLs (repointed to live homes) and its wrong GlyGen
attribution (University of Georgia + GWU, not "Boston College");
Tabula Sapiens counts (1.1M cells / 28 organs / 24 donors); PathBank
scale (600,000+ pathways).
…primers (#32)

Add 12 OtherResources.md entries and 2 Primers/AI.md learning playlists
from issue #32, verified against canonical sources by caail-claim-reviewer:

- NEW "University centers & consortia" subsection: iCAMP (UC Davis),
  NICA (USDA/Tufts), Bezos Centre for Sustainable Protein (Imperial)
- New Harvest initiatives: CAPE
- GFI initiatives: GFI Research Grant Program
- Courses: TU Delft cellular-agriculture advanced course; UMN Cellular
  Bioprocess Technology summer course
- Curated Bibliographies: awesome-CRISPR, awesome-protein-design-software,
  awesome-lipidomics (General bioinformatics) + a new "Scientific &
  biomolecule language models" subgroup (Awesome-Biomolecule-Language-
  Cross-Modeling, Awesome-Scientific-Language-Models)
- Primers/AI.md: Stanford CS230 (Learn the fundamentals), CS224N (Go deeper)

Corrected reviewer-flagged claims: CAPE is New-Harvest-led and
multi-partner (not a two-party New Harvest / Alberta Innovates effort);
GFI funds plant-based + fermentation + cultivated, not cultivated alone;
dropped an unsupported "EMNLP 2024" venue tag; UMN ML workshop is an
optional half-day add-on.
BioML-bench (Miller et al. 2025) is reference #225 in this branch's
Papers.md; correct the Datasets/Benchmarks.md companion cross-ref
(link text and anchor) from the stale #223.
Bump the real-corpus ground-truth assertions to match the integrated
field-gap content (counts regenerated by `pnpm parse`):
papers 195->231, software 70->86, databases 72->83, datasets 130->146.
Also: refs-with-code-URL 71->75 (Cosenza 2021, ESCARGOT, PloverDB,
BioML-bench); AI primer fundamentals playlists 5->6 (vitest) and the
page-wide external playlist cards 5->7 (e2e: Stanford CS230 + CS224N);
homepage Papers-count link 195->231. Fixture-based unit counts are
unchanged.
Add reference #236 (Tac, Gardner & Kuhl 2026, arXiv 2602.03092) and its
matrix anchor in Deep Learning x Sensory Prediction. The paper builds a
multinomial-diffusion + score-based generative model that learns the
human palate from 500k+ recipes and designs burgers validated in a
blinded 101-participant sensory evaluation.

Placed via caail-classification-reviewer: diffusion/score-based models
have no dedicated row, so the Deep Learning catch-all is the
best-available home (GAN/VAE rejected — diffusion is neither); column is
Sensory Prediction (not Media Optimization, which is culture medium).
Bibliographic entry verified by caail-citation-reviewer.
Add the CRISPR/Cas9 GGTA1-knockout bovine satellite cell RNA-seq deposit
(GEO GSE330550; D'Costa et al. 2026, bioRxiv 10.64898/2026.05.20.726299)
to the 'Bovine satellite cells & cultured-meat differentiation' cluster
and the inventory table in Datasets/Cow.md. The study disrupts GGTA1
(alpha-1,3-galactosyltransferase) in immortalized bovine satellite cells
to remove the alpha-gal epitope behind Alpha-gal Syndrome, toward
AGS-compatible cultivated beef.

Dataset deposit only (the paper applies CRISPR/Cas9, not an AI/ML method,
so it gets no Papers.md matrix entry); the grkenney/AGAL analysis package
is left out as too thin to stand alone in Software.md. Claims verified
against the paper's full text by caail-claim-reviewer.
…Models)

Add Context7 (upstash/context7) to Software.md's AI Agents & Foundation
Models section: an open-source, MIT-licensed MCP server from Upstash
that injects up-to-date, version-specific library docs into LLM prompts.
Framed honestly as general developer infrastructure (not cell-ag-
specific) relevant to CAAIL's AI-agent audience. Verified by
caail-claim-reviewer.
…base

Add Pando (pando.foraybio.com) to Databases.md's Ecosystem & Industry
Directories section as a commercial in vitro plant knowledge base / AI
workspace from Foray Bioscience (plant-side cell-ag), with the intro
updated to acknowledge it as a second non-GFI industry addition. The
'largest' claim is attributed to Foray, not asserted. Routed here per
CAAIL's split (databases live in Databases.md). Verified by
caail-claim-reviewer.
Update real-corpus ground-truth to the parsed actuals after adding the
burger paper (#236), the GSE330550 bovine deposit, Context7, and Pando:
papers 231->232, software 86->87, databases 83->84, datasets 146->147,
and the homepage '231 Papers' -> '232 Papers' e2e assertion. Fixture
counts untouched.
Wrap 115 bare backtick code-span accessions across the per-species
Datasets pages in their canonical registry URLs, in the existing
[`ACC`](url) form (clickable on GitHub + the site). Covers GEO (GSE),
NCBI BioProject (PRJNA/PRJEB), SRA (SRP/SRX/SRR/SRS + legacy), GenBank
nuccore (JR), PRIDE (PXD), CNCB GSA (CRA)/BioProject (PRJCA)/GVM/OMIX,
and CNGB CNSA (CNP).

Already-linked accessions (inside study-title URLs or already wrapped)
are skipped via a negative-lookbehind guard; non-accession tokens
(pcPigMNet2025, GSB-codes, B8/CHO) are excluded by per-namespace
full-match patterns with digit-count minimums. Databases.md/Software.md
were already fully linked (no change). Counts unchanged; build, 296
vitest, and 34 e2e all pass.
Rename the 'Tools' stat label to 'Software' in the homepage hero
(Hero.astro) and the By the Numbers dashboard (MetricsDashboard.astro)
so the count is labelled consistently with the Software section, nav,
and page.
'87 Software' read awkwardly; label the count 'Software tools' on the
homepage hero and the By the Numbers dashboard so it reads '87 Software
tools' (consistent with the two-word 'Research areas' stat).
@benjibromberg benjibromberg merged commit efdd2f8 into main Jun 12, 2026
1 check passed
@benjibromberg benjibromberg deleted the worktree-feat+issue-32-field-gap-additions branch June 12, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CAAIL field-gap analysis: 92 candidate additions (papers, datasets, software, databases)

1 participant