get_md_format_spec(peptide): document dual-file requirement, Unique column, cross-table ProteinGroupId rule#4
Merged
infusini merged 2 commits intoJun 3, 2026
Conversation
…olumn, and cross-table ProteinGroupId rule
The peptide md_format path has three converter-enforced requirements that the
spec tool did not surface, so an agent trusting get_md_format_spec("peptide")
produces files that fail ingestion. The API masks the reader.py error as an
indefinite "processing" status, so the failure is invisible without server logs.
The three gaps (each reproduced as a real upload failure):
1. peptide is a DUAL-FILE upload (peptide file + companion protein file in
filenames=); a peptide file alone fails with "Protein data file not found"
(md_format/reader.py:47).
2. the REQUIRED Unique column (boolean) was missing from the spec.
3. ProteinGroupId/ProteinGroup must use an identical mapping across the two
files; the old peptide conversion template factorized them independently,
actively generating the mismatch (observed 99.9% mismatch -> silent fail).
Changes:
- _MD_FORMAT_PEPTIDE_SPEC: add Unique (required) + OtherProteinGroupIds,
ProteinNames, Description (optional); strengthen ProteinGroup/ProteinGroupId
text with the cross-table rule. Matches https://help.massdynamics.com/home/md-format
- get_md_format_spec peptide notes: add dual-file / Unique / id-consistency notes.
- new _GENERIC_PEPTIDE_TEMPLATE: emits Unique and derives ProteinGroupId from
the protein companion's map (replaces the independent-factorize template).
- create_upload docstring: add the md_format peptide dual-file subsection.
- test_format.py: assert Unique, dual-file notes, and the corrected template.
Follow-up (not in this PR): extend validate_upload_inputs (or add a
validate_md_format_files tool) to assert Unique present + ProteinGroupId mapping
identical + SampleName sets equal across the peptide/protein files, so the class
of error is caught locally before upload.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ids guard
PTM/peptide uploads map sites onto UniProt protein SEQUENCES. ProteinGroup
populated with Ensembl ids (ENSP/ENSG) or gene symbols (common in source data,
e.g. CPTAC umich) resolves to 0 sequences and the upload fails SILENTLY (sits
in "processing", produces no dataset, surfaces no error). This was the final
phospho upload failure and was only diagnosable from the server reader.py log.
Spec/docs:
- _MD_FORMAT_PROTEIN_SPEC and _MD_FORMAT_PEPTIDE_SPEC: ProteinGroup MUST be
UniProt accession(s), NOT Ensembl/gene ids, with the silent-failure symptom
and the Ensembl->UniProt remediation.
- peptide notes: add UniProt requirement, the verify-peptide-in-sequence step,
and the ';'-joined-peptide-form caveat.
New programmatic guard (the "never again"):
- validate_md_format_ids(file_path): reads header + a sample of rows only
(safe on multi-GB files, respects the entity-data boundary), and WARNs when
ProteinGroup looks like Ensembl ids or is mostly non-UniProt. Registered in
files/__init__ and the TOOL CATEGORIES prose; verified to WARN on the real
ENSP file that failed and pass the UniProt-fixed file.
- tests: pass/warn/error cases for validate_md_format_ids.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author
Update: added the ROOT-CAUSE fix (UniProt ProteinGroup) + a programmatic guardAfter this branch's original dual-file/Unique/id-consistency fixes, a real CPTAC phospho upload still failed — and the server Added in the latest commit:
This makes the failure catchable before upload rather than as a multi-hour silent hang. |
infusini
added a commit
that referenced
this pull request
Jun 4, 2026
…od/PD note Layers on top of #4 (which added the peptide Unique spec, cross-table ProteinGroupId rule, and dual-file docs). Net-new here: - create_upload / create_upload_from_csv: _check_md_format_composition rejects a peptide-only md_format upload via a bounded header read (the code-level guard #4 deferred). Plus the md_format ID-SHAPE PREFLIGHT in the docstring and md_format_metabolite added to the from_csv source list. - get_md_format_spec(peptide) notes: ModifiedSequence must be inline UniMod, not a tool's native annotation; Proteome Discoverer conversion (UniMod map + Nx multiplier) and the unlocalised-mod disclaimer (ask: drop vs assign-to-first). - plan_wide_to_md_format header: add md_format_metabolite. - _workflow_guide: bounded-exception carve-out to the never-read rule, composition rule, Workflow A preflight, and Workflow D PD-conversion guidance. - Tests for the composition guard and the inline-UniMod/unlocalised peptide notes. No {p} localization-probability handling (reverted earlier — md_format extractor is not {p}-aware; that needs an md-converter fix first).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
get_md_format_spec("peptide")under-specifies the peptide/dual-file contract relative to what the md-converter enforces and the public spec (https://help.massdynamics.com/home/md-format). An agent that trusts the tool output produces a peptide file that fails ingestion — and the API surfaces thereader.pyerror as an indefinite "processing" status, so the failure is invisible without server logs.Reproduced as three sequential real upload failures (a CPTAC phospho upload):
filenames=). Peptide alone →Protein data file not found(md_format/reader.py:47). The spec said nothing about this.Unique— a REQUIRED boolean column (TRUE if the peptide is unique to its protein group), absent from_MD_FORMAT_PEPTIDE_SPEC.ProteinGroupIdinconsistency — the old peptide conversion template didpd.factorize(ProteinGroup)on the peptide file independently of the protein file, producing mismatched ids (observed 99.9% mismatch) → silent ingestion failure. The spec requires an identicalProteinGroup→ProteinGroupIdmapping across both files.Changes (
src/mcp_tools/files/md_format.py,uploads/create.py)_MD_FORMAT_PEPTIDE_SPEC: addUnique(required) +OtherProteinGroupIds/ProteinNames/Description(optional); strengthenProteinGroup/ProteinGroupIddescriptions with the cross-table rule.notes: add dual-file, Unique-required, and id-consistency notes (previously the protein/peptide branch was shared and silent on all three)._GENERIC_PEPTIDE_TEMPLATE: emitsUniqueand derivesProteinGroupIdfrom the protein companion's map (peptide-only groups get fresh ids above the protein max) — replaces the independent-factorizetemplate that generated bug Feature/extra endpoints #3.create_uploaddocstring: add themd_formatpeptide dual-file subsection.test_format.py: assertUnique, the dual-file notes, and the corrected template.Tests
847 passed(846 existing + 1 new). No existing assertions changed behaviour; the existingtest_peptide_specsubset-check still holds and is now strengthened.Suggested follow-up (not in this PR)
Extend
validate_upload_inputs(or add avalidate_md_format_filestool) to assert, given the peptide + protein files:Uniquepresent,ProteinGroupId↔ProteinGroupmapping identical across both, and SampleName sets equal — so this class of error is caught locally before upload rather than as a masked "processing" hang. I scoped it out of this PR because it changes the tool surface and warrants its own review.A companion docs PR is open on md-skills-public (md-mcp-ops uploads.md / workflows.md): MassDynamics/md-skills-public#1.
🤖 Generated with Claude Code