Skip to content

get_md_format_spec(peptide): document dual-file requirement, Unique column, cross-table ProteinGroupId rule#4

Merged
infusini merged 2 commits into
MassDynamics:feature/mcp-serverfrom
webwebb56:fix/peptide-md-format-spec
Jun 3, 2026
Merged

get_md_format_spec(peptide): document dual-file requirement, Unique column, cross-table ProteinGroupId rule#4
infusini merged 2 commits into
MassDynamics:feature/mcp-serverfrom
webwebb56:fix/peptide-md-format-spec

Conversation

@webwebb56

Copy link
Copy Markdown

Problem

get_md_format_spec("peptide") under-specifies the peptide/dual-file contract relative to what the md-converter enforces and the public spec (https://help.massdynamics.com/home/md-format). An agent that trusts the tool output produces a peptide file that fails ingestion — and the API surfaces the reader.py error as an indefinite "processing" status, so the failure is invisible without server logs.

Reproduced as three sequential real upload failures (a CPTAC phospho upload):

  1. Dual-file — a peptide upload needs a companion protein-level md_format file (both in filenames=). Peptide alone → Protein data file not found (md_format/reader.py:47). The spec said nothing about this.
  2. Missing Unique — a REQUIRED boolean column (TRUE if the peptide is unique to its protein group), absent from _MD_FORMAT_PEPTIDE_SPEC.
  3. ProteinGroupId inconsistency — the old peptide conversion template did pd.factorize(ProteinGroup) on the peptide file independently of the protein file, producing mismatched ids (observed 99.9% mismatch) → silent ingestion failure. The spec requires an identical ProteinGroup→ProteinGroupId mapping across both files.

Changes (src/mcp_tools/files/md_format.py, uploads/create.py)

  • _MD_FORMAT_PEPTIDE_SPEC: add Unique (required) + OtherProteinGroupIds/ProteinNames/Description (optional); strengthen ProteinGroup/ProteinGroupId descriptions with the cross-table rule.
  • peptide notes: add dual-file, Unique-required, and id-consistency notes (previously the protein/peptide branch was shared and silent on all three).
  • new _GENERIC_PEPTIDE_TEMPLATE: emits Unique and derives ProteinGroupId from the protein companion's map (peptide-only groups get fresh ids above the protein max) — replaces the independent-factorize template that generated bug Feature/extra endpoints #3.
  • create_upload docstring: add the md_format peptide dual-file subsection.
  • test_format.py: assert Unique, the dual-file notes, and the corrected template.

Tests

847 passed (846 existing + 1 new). No existing assertions changed behaviour; the existing test_peptide_spec subset-check still holds and is now strengthened.

Suggested follow-up (not in this PR)

Extend validate_upload_inputs (or add a validate_md_format_files tool) to assert, given the peptide + protein files: Unique present, ProteinGroupId↔ProteinGroup mapping identical across both, and SampleName sets equal — so this class of error is caught locally before upload rather than as a masked "processing" hang. I scoped it out of this PR because it changes the tool surface and warrants its own review.

A companion docs PR is open on md-skills-public (md-mcp-ops uploads.md / workflows.md): MassDynamics/md-skills-public#1.

🤖 Generated with Claude Code

Andrew Webb and others added 2 commits June 1, 2026 16:46
…olumn, and cross-table ProteinGroupId rule

The peptide md_format path has three converter-enforced requirements that the
spec tool did not surface, so an agent trusting get_md_format_spec("peptide")
produces files that fail ingestion. The API masks the reader.py error as an
indefinite "processing" status, so the failure is invisible without server logs.

The three gaps (each reproduced as a real upload failure):
  1. peptide is a DUAL-FILE upload (peptide file + companion protein file in
     filenames=); a peptide file alone fails with "Protein data file not found"
     (md_format/reader.py:47).
  2. the REQUIRED Unique column (boolean) was missing from the spec.
  3. ProteinGroupId/ProteinGroup must use an identical mapping across the two
     files; the old peptide conversion template factorized them independently,
     actively generating the mismatch (observed 99.9% mismatch -> silent fail).

Changes:
  - _MD_FORMAT_PEPTIDE_SPEC: add Unique (required) + OtherProteinGroupIds,
    ProteinNames, Description (optional); strengthen ProteinGroup/ProteinGroupId
    text with the cross-table rule. Matches https://help.massdynamics.com/home/md-format
  - get_md_format_spec peptide notes: add dual-file / Unique / id-consistency notes.
  - new _GENERIC_PEPTIDE_TEMPLATE: emits Unique and derives ProteinGroupId from
    the protein companion's map (replaces the independent-factorize template).
  - create_upload docstring: add the md_format peptide dual-file subsection.
  - test_format.py: assert Unique, dual-file notes, and the corrected template.

Follow-up (not in this PR): extend validate_upload_inputs (or add a
validate_md_format_files tool) to assert Unique present + ProteinGroupId mapping
identical + SampleName sets equal across the peptide/protein files, so the class
of error is caught locally before upload.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ids guard

PTM/peptide uploads map sites onto UniProt protein SEQUENCES. ProteinGroup
populated with Ensembl ids (ENSP/ENSG) or gene symbols (common in source data,
e.g. CPTAC umich) resolves to 0 sequences and the upload fails SILENTLY (sits
in "processing", produces no dataset, surfaces no error). This was the final
phospho upload failure and was only diagnosable from the server reader.py log.

Spec/docs:
  - _MD_FORMAT_PROTEIN_SPEC and _MD_FORMAT_PEPTIDE_SPEC: ProteinGroup MUST be
    UniProt accession(s), NOT Ensembl/gene ids, with the silent-failure symptom
    and the Ensembl->UniProt remediation.
  - peptide notes: add UniProt requirement, the verify-peptide-in-sequence step,
    and the ';'-joined-peptide-form caveat.

New programmatic guard (the "never again"):
  - validate_md_format_ids(file_path): reads header + a sample of rows only
    (safe on multi-GB files, respects the entity-data boundary), and WARNs when
    ProteinGroup looks like Ensembl ids or is mostly non-UniProt. Registered in
    files/__init__ and the TOOL CATEGORIES prose; verified to WARN on the real
    ENSP file that failed and pass the UniProt-fixed file.
  - tests: pass/warn/error cases for validate_md_format_ids.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@webwebb56

Copy link
Copy Markdown
Author

Update: added the ROOT-CAUSE fix (UniProt ProteinGroup) + a programmatic guard

After this branch's original dual-file/Unique/id-consistency fixes, a real CPTAC phospho upload still failed — and the server reader.py log revealed why: ProteinGroup held Ensembl ids (ENSP), not UniProt accessions. PTM sites are mapped onto UniProt protein sequences, so non-UniProt ids match 0 sequences and the upload fails silently (stuck "processing", no dataset, no surfaced error). None of the prior format fixes could catch this.

Added in the latest commit:

  • Spec/docs: ProteinGroup MUST be UniProt accession(s) (protein + peptide specs), with the silent-failure symptom and the Ensembl→UniProt remediation; peptide notes now include a verify-peptide-falls-within-UniProt-sequence step and the ;-joined-peptide caveat.
  • New tool validate_md_format_ids(file_path) — the programmatic guard: reads header + a row sample only (safe on multi-GB files, respects the entity-data boundary), WARNs when ProteinGroup looks like Ensembl/gene ids. Verified to WARN on the real ENSP file that failed and pass the UniProt-fixed file. Registered + documented + tested.
  • Tests: 850 passed (was 847).

This makes the failure catchable before upload rather than as a multi-hour silent hang.

@infusini infusini merged commit 19e1618 into MassDynamics:feature/mcp-server Jun 3, 2026
infusini added a commit that referenced this pull request Jun 4, 2026
…od/PD note

Layers on top of #4 (which added the peptide Unique spec, cross-table ProteinGroupId
rule, and dual-file docs). Net-new here:

- create_upload / create_upload_from_csv: _check_md_format_composition rejects a
  peptide-only md_format upload via a bounded header read (the code-level guard #4
  deferred). Plus the md_format ID-SHAPE PREFLIGHT in the docstring and
  md_format_metabolite added to the from_csv source list.
- get_md_format_spec(peptide) notes: ModifiedSequence must be inline UniMod, not a
  tool's native annotation; Proteome Discoverer conversion (UniMod map + Nx
  multiplier) and the unlocalised-mod disclaimer (ask: drop vs assign-to-first).
- plan_wide_to_md_format header: add md_format_metabolite.
- _workflow_guide: bounded-exception carve-out to the never-read rule, composition
  rule, Workflow A preflight, and Workflow D PD-conversion guidance.
- Tests for the composition guard and the inline-UniMod/unlocalised peptide notes.

No {p} localization-probability handling (reverted earlier — md_format extractor is
not {p}-aware; that needs an md-converter fix first).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants