Skip to content

Mismatch between JSON sidecar formats accepted by MetaCat and those required by declad dropbox ingestion #68

@DouglasLeeTucker

Description

@DouglasLeeTucker

Summary

During a large‑scale MetaCat stress test (100k files) performed on dtucker@fifeutilgpvm03.fnal.gov, I generated JSON sidecar metadata files that MetaCat accepted without issue. However, when attempting to ingest the same files via declad’s dropbox mechanism on hypotpro@fermicloud848.fnal.gov, declad rejected them.

To proceed, I had to rewrite all JSON sidecars into a different structure — one that declad accepts but MetaCat does not require when ingesting directly.

This indicates a mismatch between:

  • the JSON structure MetaCat accepts directly, and
  • the JSON structure declad requires before it will ingest and forward metadata to MetaCat.

This mismatch caused ingestion failures and required a script to rewrite the JSON metadata files.

Environment

MetaCat stress test environment
Host: dtucker@fifeutilgpvm03.fnal.gov
MetaCat version: 4.1.?
JSON sidecars accepted directly by MetaCat

declad ingestion environment

  • Host: hypotpro@fermicloud848.fnal.gov
  • declad version: 2.3.8
  • declad dropbox ingestion using /home/hypotpro/declad_848/declad_config.yaml
  • declad rejected the original JSON sidecars

Example of the Mismatch
1. JSON sidecar that MetaCat accepted directly
This file (/home/dtucker/WORK/GitHub/MetacatStressTest/python/synthetic_minimal_n100000/data_ffff9d76-0e95-4062-81a5-edd7d0279791.parquet.json.orig) is representative of the structure used during the MetaCat stress test:

{
  "dh.type": "other",
  "fn.configuration": "c20240226",
  "fn.description": "metacat_stress_test_20260216_5",
  "fn.format": "txt",
  "fn.owner": "dtucker",
  "fn.tier": "etc",
  "rs.runs": [
    1000002
  ]
}

MetaCat accepted this structure without requiring additional top‑level fields.

2. Equivalent JSON sidecar required by declad
To make declad ingest the same file, I had to rewrite the JSON into the following structure (see /home/dtucker/WORK/GitHub/MetacatStressTest/python/synthetic_minimal_n100000/data_ffff9d76-0e95-4062-81a5-edd7d0279791.parquet.json:

{
  "name": "data_ffff9d76-0e95-4062-81a5-edd7d0279791.parquet",
  "namespace": "hypotpro",
  "size": 6967,
  "checksums": {
    "adler32": "b9a2d1fc"
  },
  "metadata": {
    "dh.type": "other",
    "fn.configuration": "c20240226",
    "fn.description": "metacat_stress_test_20260216_5",
    "fn.format": "txt",
    "fn.owner": "hypotraw",
    "fn.tier": "etc",
    "rs.runs": [
      1000002
    ]
  }
}

Key differences required by declad:

  • Mandatory top‑level fields:
    -- name
    -- namespace
    -- size
    -- checksums
  • All metadata must be nested under "metadata"
  • Dot‑notation keys must be preserved inside "metadata"
  • rs.runs must be a list, not a scalar
  • Missing or differently‑named fields cause declad to reject the file silently or stall

Possible Actions (suggested by Microsoft Copilot)

  • Document the required JSON structure for declad dropbox ingestion, including:
    -- required top‑level fields
    -- required nesting under "metadata"
    -- required checksum formats
    -- required run‑number fields
  • Clarify whether declad should accept the same JSON structure that MetaCat accepts directly, or whether the two systems are intentionally different.
  • Improve error reporting when JSON sidecars are malformed or missing required fields.
  • (Optional) Add a validation mode to declad that checks JSON sidecars and reports structural issues before ingestion.

(Gory details can be found in this Microsoft Copilot conversation: link )

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions