feat!: provenance for reconstruction and simulation files (#225) by Helveg · Pull Request #236 · dbbs-lab/bsb

Helveg · 2026-05-28T17:29:56Z

Status: first proposal, feedback wanted

This is a first proposal for end-to-end provenance, addressing #225. The set of attributes below is a starting point, not a finished spec; the bsb_schema_version fields exist precisely so we can evolve the layout. Please comment on missing attributes, naming, or anything that should be reshaped.

Notation: attributes in [brackets] are optional / diagnostic, best-effort, and hold the last known value (useful for sanity-checking, not load-bearing).

Revised per review feedback (thread above)

bsb_version -> bsb_core_version (explicit; every package and plugin is also in plugins).
Dropped the modified_at timestamps (not worth a write per mutation; state_id / revision already signal change, created_at is kept).
host, mpi_size, and the result file's scaffold.root are now bracketed optional/last-known, since reconstructions can be built in parts (redo / append).
Level 7 gains bsb_recording_kind so cells, compartments, synapses, LFP, ... each declare their own required fields (Level 8), instead of assuming every recording is a whole cell. (Named for the recording it annotates, distinct from bsb_device_kind.)
The schema-version fields identify this layout so future BSB versions can read / write / migrate older and newer schemas.

What this adds

Two artefacts gain provenance: the reconstruction file (the compiled network, written by a storage engine) and the simulation result file (the .nio Neo container). A result file back-references the reconstruction it was run against, so a recording can be traced to the exact network state, the software that produced it, and the cell it came from.

Full developer documentation lives under For Developers -> Interfaces -> Storage engines and Simulating Networks -> Simulation results.

Level 1 - Reconstruction: root metadata

Written on engine create(); state_id bumps on every mutating write. Exposed read-only as scaffold.storage_id, scaffold.state_id, scaffold.provenance.

Attribute	Type	Meaning
`storage_id`	str (UUID4)	Permanent identity of the file. Never rewritten.
`state_id`	int	Monotonic revision counter, bumped on every mutating write. Not a content hash.
`bsb_schema_version`	int	Version of this provenance layout.
`created_at`	str (ISO 8601 UTC)	Engine creation time.
`bsb_core_version`	str	`bsb-core` version at creation (all packages/plugins are also in `plugins`).
`engine_name`	str	`"hdf5"` or `"fs"`.
`engine_version`	str	Engine package version at creation.
`plugins`	dict	`{category: {entry_name: {package, version}}}` over all plugin categories.
`[host]`	dict	`{platform, python_version, hostname, user, cwd}` of the last writer. Diagnostic.
`[mpi_size]`	int	`comm.get_size()` of the last writer. Diagnostic.

Level 2 - Reconstruction: per PlacementSet

Attribute	Type	Meaning
`revision`	int	Per-set counter, bumped on every write.
`created_at`	str (ISO 8601)	Set creation.
`morphology_hashes`	JSON list[str]	Per-loader content hashes, refreshed when morphology data changes.

(Alongside the existing len, morphology_loaders, labelsets, chunks attributes.)

Level 3 - Reconstruction: per ConnectivitySet

Attribute	Type	Meaning
`revision`	int	Per-set counter, bumped on every write.
`created_at`	str (ISO 8601)	Set creation.

Level 4 - Reconstruction: file store meta (per file)

Attribute	Type	Meaning
`content_sha256`	str	SHA-256 of the stored bytes.
`producer`	dict	`{package, version}` of whoever stored the file.

Level 5 - Simulation result: Block annotation `bsb_provenance`

Attribute	Type	Meaning
`schema_version`	int	Version of this provenance layout.
`simulation_id`	str (UUID4)	Identity of this run.
`simulation_name`	str	Configured simulation name.
`started_at` / `finished_at`	str (ISO 8601)	Run start / end.
`wall_seconds`	float	Wall-clock run duration.
`seed`	int / null	Simulation seed.
`duration_ms` / `resolution_ms`	float	Simulated duration and step.
`scaffold`	dict	Back-reference: `{storage_id, state_id, [root]}` (`root` = best-effort last-known absolute path of the reconstruction file).
`plugins`	dict	Plugin manifest (as Level 1).
`simulator`	dict	`{name, version, extra}` (e.g. NEST modules loaded).
`[host]`	dict	`{platform, python_version, hostname, user, cwd}`. Diagnostic.
`[mpi_size]`	int	Number of ranks. Diagnostic.

Level 6 - Simulation result: per Segment annotations

Attribute	Type	Meaning
`segment_id`	str (UUID4)	Identity of this flush.
`checkpoint_index`	int	0-based flush index.
`t_start_ms` / `t_stop_ms`	float	Segment time window.
`simulator_state`	dict	Free slot for simulator-specific notes.

Level 7 - Simulation result: per recorded object, baseline (every recorder)

The convention is documented, not enforced; a recorder may emit any number of objects. Every recorded Neo object carries this baseline, regardless of what it records. What is recorded uses Neo's native name / units.

Annotation	Type	Meaning
`bsb_device_name`	str	Configured device name.
`bsb_device_kind`	str	Device `classmap_entry` (e.g. `spike_recorder`, `multimeter`).
`bsb_recording_kind`	str	What kind of thing is recorded: `cell`, `compartment`, `synapse`, `lfp`, `stimulus`, ... Selects the Level 8 fields.
`bsb_simulation_id`	str	Mirror of the Block `simulation_id`.
`bsb_segment_id`	str	Mirror of the Segment `segment_id`.

Level 8 - Simulation result: per recorded object, recording-kind extension

On top of the baseline, each bsb_recording_kind adds first-class flat bsb_* fields (siblings of the baseline keys) that locate its target, using BSB-native morphology addressing (branch / point / arc), never simulator-internal names. Open to feedback / new kinds.

`bsb_recording_kind`	Adds	Records
`cell`	`bsb_ps_name`, `bsb_cell_id`, `bsb_cell_model`	a whole cell (placement set, index within it, cell model)
`compartment`	the `cell` fields + `bsb_branch`, `bsb_point`, `bsb_arc` (+ proposed `bsb_coordinates` `{x,y,z,r}`)	a location on a cell's morphology
`synapse`	the postsynaptic `cell` fields + `bsb_branch`, `bsb_point`, `bsb_arc`, `bsb_synapse_type` + presynaptic identity (proposed: `bsb_pre_ps_name`, `bsb_pre_cell_id`)	a synapse on a post cell
`lfp`	electrode/probe identity + position (proposed: `bsb_probe`, `bsb_position`)	a field potential over a region
`stimulus`	`bsb_target_count`	a stimulator's own emitted output (e.g. a Poisson generator's spikes)

Built-in recorders: NEST spike_recorder / multimeter and Arbor spike_recorder -> cell; NEURON voltage_recorder / current_clamp -> compartment; NEURON synapse_recorder -> synapse; NEST poisson_generator / sinusoidal_poisson_generator -> stimulus. The lfp kind has no built-in recorder yet. The proposed bsb_coordinates is also the per-segment geometry an LFP probe needs.

Recorder interface (runtime inspection)

To support controller-style devices (e.g. an LFP probe per #50), a SimulationRecorder is inspectable at runtime, before anything is written to file:

recorder.device_name links it back to the device that created it (every built-in recorder passes device=self).
recorder.meta(property) exposes recorder-level metadata (e.g. recorder.meta("lfp_source_geometry")).

Combined with the per-object bsb_device_name annotation, this lets a controller find the recorders of the devices it manages and query their metadata during a flush. The remaining piece for a functioning LFP probe, per-checkpoint flushing of results, is tracked separately in #50.

Reader helper

from bsb import read_nio, iter_recordings flattens a result file into Recording records; filter by ps_name, cell_id, device, recorded quantity, or any bsb_* annotation key (e.g. bsb_recording_kind, bsb_branch).

Compatibility

Legacy reconstruction files without a bundle are backfilled on first write (a one-shot BsbProvenanceUpgradeWarning); read-only opens leave storage_id / state_id as None.
Breaking: recorder output annotations moved from the ad-hoc device / senders / cell_type / cell_id keys to the layered bsb_* convention.

Test plan

bsb-hdf5 and bsb-core provenance unit tests (root attrs round-trip, state bumping, legacy auto-upgrade, FS metadata.json migration, Scaffold API, Block/Segment/recorder annotations, baseline + recording-kind layering, recorder device_name / meta() runtime inspection, iter_recordings filtering)
MPI-safe: suite runs clean under mpiexec -n 2 (FS provenance writes locked; single-rank-only assertions marked skip_parallel)
existing bsb-core / bsb-hdf5 suites still green
check-api passes; full docs build passes with zero warnings
review-feedback alignment applied: bsb_version -> bsb_core_version, dropped modified_at, bsb_target_kind -> bsb_recording_kind, bracketed host / mpi_size / root
reviewer feedback on the attribute set per level, especially the Level 8 recording-kind taxonomy and the LFP/LFPy integration #50 alignment

🤖 Generated with Claude Code

📚 Documentation preview 📚: https://bsb-nest--236.org.readthedocs.build/en/236/

📚 Documentation preview 📚: https://bsb-hdf5--236.org.readthedocs.build/en/236/

📚 Documentation preview 📚: https://bsb-arbor--236.org.readthedocs.build/en/236/

📚 Documentation preview 📚: https://bsb--236.org.readthedocs.build/en/236/

📚 Documentation preview 📚: https://bsb-core--236.org.readthedocs.build/en/236/

📚 Documentation preview 📚: https://bsb-neuron--236.org.readthedocs.build/en/236/

Reconstruction files (HDF5 and FS engines) now carry a root-level provenance bundle: a permanent storage_id (UUID4), a monotonic state_id revision counter, timestamps, the bsb-core/engine versions, a plugin manifest, host info and mpi_size. Placement and connectivity sets gain per-set revision/timestamps (and morphology_hashes for placement); the file store records content_sha256 and a producer per file. Legacy files without a bundle are backfilled on first write. Simulation result (.nio) files annotate the Block with a bsb_provenance dict that back-references the reconstruction (storage_id + state_id), the simulator and its version, the plugin manifest, timing, seed and host. Each recorder's Neo objects follow a documented bsb_* annotation convention identifying the source cell (ps_name, cell_id, cell_model) and device. Adds read_nio / iter_recordings helpers. BREAKING CHANGE: recorder output annotations changed from the ad-hoc device/senders/cell_type/cell_id keys to the bsb_* convention, and the storage root gains a provenance schema. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The FS engine's _bump_state and legacy-upgrade wrote metadata.json without the engine lock, so concurrent MPI ranks raced on the tmp+os.replace and could clobber each other (and double-upgrade a legacy root). Take the write lock for the read-modify-write, and re-check inside the lock during upgrade so only the first rank stamps the bundle. Mark the three single-rank provenance tests skip_parallel: they assert behaviour that only holds on one rank (the upgrade warning is emitted on the main rank only, and two use rank-local temp paths / exact direct bump counts). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Helveg · 2026-05-28T20:06:49Z

One first note to self is that level 7 currently seems to assume that everything is a cell, need a set of standardized attributes for synapses, compartments, (other?) as well

drodarie · 2026-05-29T09:24:39Z

Here are my first thoughts on this proposal (I did not check the code implementation yet):

Regarding reconstruction files:

Consider that reconstructions (unlike simulations) can be done in parts (thanks to redo, append), hence some attributes might be not relevant in this context: e.g: host, mpi_size
bsb_version --> bsb_core_version make the attribute explicit
Regarding saving every mutating write in modified_at, I wonder if it is worth it: why not saving the last mutation upon closing or catching exception? If there is an error, I do not think knowing the last writing in file time is worth having an additional write for every write operation in file.
I am not sure of what bsb_schema_version and schema_version in the simulation results file are supposed to represent

Regarding simulation results files:

Does scaffold root attribute correspond to the path to the reconstruction file? or something else?
As was pointed out before, bsb_cell_id and bsb_cell_model does indeed not work with synapstic or LFP recordings.

Helveg · 2026-05-29T12:16:26Z

I'll edit my post and will mark some attributes as "possibly helpful but not so important" like scaffold.root --> scaffold.[root]. It may be helpful for someone looking to diagnose or sanity check what the absolute path was this file was (last/first) written to.

host, mpi_size, root --> [host], [mpi_size], [root]. Should indicate the last known value
bsb_version --> bsb_core_version (please note that all bsb plugins and packages listed in packages are also included in the plugin manifest)
will remove the modified_at timestamps
the schema versions refer to exactly this schema; by including a schema version we can evolve it over time and provide tools to read/write/migrate older/newer schemas as well
let's introduce a bsb_recorder_kind for level 7 so that we can define different required attributes.

…et kind Split the recorder convention into a baseline every recorder shares (bsb_device_name/kind, bsb_target_kind, bsb_simulation_id/segment_id) and a target-kind layer selected by bsb_target_kind ("cell", "compartment", "synapse", "lfp", ...). Per-kind fields are now first-class flat bsb_* annotations (e.g. bsb_section, bsb_arc, bsb_synapse_type) instead of a nested bsb_location dict. Built-in recorders emit cell (NEST/Arbor spikes, multimeter), compartment (NEURON voltage/current clamp) and synapse (NEURON synapse) kinds. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pse recordings Replace the NEURON-flavoured section/segment location fields with the BSB-native morphology address: bsb_branch, bsb_point, bsb_arc, taken straight from the recorder's location accessor (loc.location -> (branch, point), loc.arc()). Reserve a proposed bsb_coordinates {x, y, z, r} dict for the resolved point position, which is also the per-segment geometry an LFP probe consumes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Extend SimulationRecorder with a device_name attribute and a meta(property) method, both queryable at runtime before anything is written to file. Every built-in recorder now passes device=self to create_recorder, and a recorder can carry metadata (e.g. an LFP source geometry). This lets a controller find the recorders of the devices it manages and inspect their metadata during a flush, the missing piece for LFP-style probes (see #50). Also migrate sinusoidal_poisson_generator to the bsb_* convention and tag both Poisson generators with the "stimulus" target kind so the baseline holds. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rding_kind, drop modified_at) Apply the agreed thread decisions on #236: - rename the root-metadata `bsb_version` -> `bsb_core_version` (explicit; packages and plugins are already in the plugin manifest) - drop the `modified_at` timestamps from the root bundle and from per-PlacementSet / per-ConnectivitySet attrs; `state_id` / `revision` already signal change and `created_at` is kept - rename the recorder discriminator `bsb_target_kind` -> `bsb_recording_kind` (annotates a recording; avoids confusion with `bsb_device_kind`) - document `host` / `mpi_size` (and the result file's `scaffold.root`) as optional, diagnostic, best-effort last-known values, since reconstructions can be partial Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-225 # Conflicts: # packages/bsb-core/bsb/__init__.py # packages/bsb-core/bsb/storage/fs/file_store.py

- cache the plugin manifest so repeated engine create() stays within the storage interface test timeout - emit one spiketrain per targeted cell in the NEST and Arbor spike recorders so population size stays recoverable, and update the simulation tests to the bsb_* annotation convention - resolve ruff SIM105/I001/E501 findings surfaced by the merge - reference neo classes via their neo.core.* targets so the bsb-core docs build clean under -nW Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The classmap entry is stored on the dynamic root, not on the leaf class, so reading self.__class__.classmap_entry crashed the poisson and sinusoidal generators. Reverse-look it up in _device_kind and add a stimulus_train helper so both generators share the baseline annotation path instead of building the SpikeTrain by hand. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

codecov · 2026-05-29T18:25:43Z

Codecov Report

❌ Patch coverage is 73.57724% with 130 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (v8@4274292). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
packages/bsb-core/bsb/storage/fs/__init__.py	51.72%	24 Missing and 4 partials ⚠️
packages/bsb-hdf5/bsb_hdf5/__init__.py	78.57%	10 Missing and 5 partials ⚠️
packages/bsb-hdf5/bsb_hdf5/placement_set.py	64.10%	13 Missing and 1 partial ⚠️
packages/bsb-core/bsb/simulation/results.py	85.71%	10 Missing and 3 partials ⚠️
...ges/bsb-neuron/bsb_neuron/devices/current_clamp.py	7.14%	13 Missing ⚠️
.../bsb-neuron/bsb_neuron/devices/synapse_recorder.py	7.14%	13 Missing ⚠️
packages/bsb-core/bsb/storage/provenance.py	80.35%	11 Missing ⚠️
.../bsb-neuron/bsb_neuron/devices/voltage_recorder.py	8.33%	11 Missing ⚠️
packages/bsb-hdf5/bsb_hdf5/connectivity_set.py	79.16%	4 Missing and 1 partial ⚠️
packages/bsb-core/bsb/storage/fs/file_store.py	77.77%	2 Missing ⚠️
... and 3 more

Additional details and impacted files

@@          Coverage Diff          @@
##             v8     #236   +/-   ##
=====================================
  Coverage      ?   84.00%           
=====================================
  Files         ?      132           
  Lines         ?    14332           
  Branches      ?     1677           
=====================================
  Hits          ?    12039           
  Misses        ?     1890           
  Partials      ?      403

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

A device already exposes the classmap entry it was configured under on its dynamic attribute (`self.device`), so read that directly instead of reverse-looking-up the dynamic root's classmap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Helveg · 2026-05-29T18:37:16Z

@drodarie ready for review and for another round of feedback; especially L7 and L8 have been changed

… writer Drop the separate _atomic_write_json and route the provenance bundle through _atomic_write_bytes (staged outside the discovery dir + os.replace) so the engine keeps a single, reviewed race-safe write path instead of a parallel implementation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ests The provenance tests opened storage in a per-rank tempfile.TemporaryDirectory while the engine broadcasts rank-0's root, so whichever rank left its `with` block first removed the shared directory out from under the others, flaking `test_scaffold_exposes_storage_id_state_id_provenance` under mpiexec with an empty provenance bundle. Route the parallel tests through RandomStorageFixture, which derives an MPI-safe root and cleans up collectively in tearDownClass; the single-rank @skip_parallel tests keep their own tempdir. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

nx may write project.json with // or /* */ comments, which json.loads rejects, breaking the monorepo docs conf.py that reads doc dependencies (surfacing as a failed bsb-otel Read the Docs build). Parse it as JSONC. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…json" This reverts commit 9783250.

drodarie · 2026-05-30T11:00:30Z

I believe stimulus and recorder devices should be stored the same way in level 8 since a stimulus is basically the inverse of a recorder. We could maybe add or reuse a flag in level 7 to indicate if the device is recording or stimulating?

drodarie · 2026-06-03T10:41:39Z

Another important point, we should provide an utility script to help users to update their current reconstruction and simulation files to the new format. Otherwise, these would not update.

github-actions Bot added feat breaking change labels May 28, 2026

Helveg changed the base branch from main to v8 May 28, 2026 17:52

Helveg and others added 7 commits May 29, 2026 14:37

Merge remote-tracking branch 'origin/main' into feat/provenance-issue…

79f6457

…-225 # Conflicts: # packages/bsb-core/bsb/__init__.py # packages/bsb-core/bsb/storage/fs/file_store.py

Helveg and others added 4 commits May 29, 2026 20:52

Revert "fix(sphinxext): tolerate JSONC comments when reading project.…

68869d7

…json" This reverts commit 9783250.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat!: provenance for reconstruction and simulation files (#225)#236

feat!: provenance for reconstruction and simulation files (#225)#236
Helveg wants to merge 14 commits into
v8from
feat/provenance-issue-225

Helveg commented May 28, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Helveg commented May 28, 2026

Uh oh!

drodarie commented May 29, 2026

Uh oh!

Helveg commented May 29, 2026

Uh oh!

codecov Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Helveg commented May 29, 2026

Uh oh!

drodarie commented May 30, 2026

Uh oh!

drodarie commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Helveg commented May 28, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status: first proposal, feedback wanted

Revised per review feedback (thread above)

What this adds

Level 1 - Reconstruction: root metadata

Level 2 - Reconstruction: per PlacementSet

Level 3 - Reconstruction: per ConnectivitySet

Level 4 - Reconstruction: file store meta (per file)

Level 5 - Simulation result: Block annotation bsb_provenance

Level 6 - Simulation result: per Segment annotations

Level 7 - Simulation result: per recorded object, baseline (every recorder)

Level 8 - Simulation result: per recorded object, recording-kind extension

Recorder interface (runtime inspection)

Reader helper

Compatibility

Test plan

Uh oh!

Helveg commented May 28, 2026

Uh oh!

drodarie commented May 29, 2026

Uh oh!

Helveg commented May 29, 2026

Uh oh!

codecov Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Helveg commented May 29, 2026

Uh oh!

drodarie commented May 30, 2026

Uh oh!

drodarie commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Helveg commented May 28, 2026 •

edited by github-actions Bot

Loading

Level 5 - Simulation result: Block annotation `bsb_provenance`

codecov Bot commented May 29, 2026 •

edited

Loading