Skip to content

feat!: provenance for reconstruction and simulation files (#225)#236

Open
Helveg wants to merge 14 commits into
v8from
feat/provenance-issue-225
Open

feat!: provenance for reconstruction and simulation files (#225)#236
Helveg wants to merge 14 commits into
v8from
feat/provenance-issue-225

Conversation

@Helveg
Copy link
Copy Markdown
Contributor

@Helveg Helveg commented May 28, 2026

Status: first proposal, feedback wanted

This is a first proposal for end-to-end provenance, addressing #225. The set of attributes below is a starting point, not a finished spec; the bsb_schema_version fields exist precisely so we can evolve the layout. Please comment on missing attributes, naming, or anything that should be reshaped.

Notation: attributes in [brackets] are optional / diagnostic, best-effort, and hold the last known value (useful for sanity-checking, not load-bearing).

Revised per review feedback (thread above)

  • bsb_version -> bsb_core_version (explicit; every package and plugin is also in plugins).
  • Dropped the modified_at timestamps (not worth a write per mutation; state_id / revision already signal change, created_at is kept).
  • host, mpi_size, and the result file's scaffold.root are now bracketed optional/last-known, since reconstructions can be built in parts (redo / append).
  • Level 7 gains bsb_recording_kind so cells, compartments, synapses, LFP, ... each declare their own required fields (Level 8), instead of assuming every recording is a whole cell. (Named for the recording it annotates, distinct from bsb_device_kind.)
  • The schema-version fields identify this layout so future BSB versions can read / write / migrate older and newer schemas.

What this adds

Two artefacts gain provenance: the reconstruction file (the compiled network, written by a storage engine) and the simulation result file (the .nio Neo container). A result file back-references the reconstruction it was run against, so a recording can be traced to the exact network state, the software that produced it, and the cell it came from.

Full developer documentation lives under For Developers -> Interfaces -> Storage engines and Simulating Networks -> Simulation results.


Level 1 - Reconstruction: root metadata

Written on engine create(); state_id bumps on every mutating write. Exposed read-only as scaffold.storage_id, scaffold.state_id, scaffold.provenance.

Attribute Type Meaning
storage_id str (UUID4) Permanent identity of the file. Never rewritten.
state_id int Monotonic revision counter, bumped on every mutating write. Not a content hash.
bsb_schema_version int Version of this provenance layout.
created_at str (ISO 8601 UTC) Engine creation time.
bsb_core_version str bsb-core version at creation (all packages/plugins are also in plugins).
engine_name str "hdf5" or "fs".
engine_version str Engine package version at creation.
plugins dict {category: {entry_name: {package, version}}} over all plugin categories.
[host] dict {platform, python_version, hostname, user, cwd} of the last writer. Diagnostic.
[mpi_size] int comm.get_size() of the last writer. Diagnostic.

Level 2 - Reconstruction: per PlacementSet

Attribute Type Meaning
revision int Per-set counter, bumped on every write.
created_at str (ISO 8601) Set creation.
morphology_hashes JSON list[str] Per-loader content hashes, refreshed when morphology data changes.

(Alongside the existing len, morphology_loaders, labelsets, chunks attributes.)

Level 3 - Reconstruction: per ConnectivitySet

Attribute Type Meaning
revision int Per-set counter, bumped on every write.
created_at str (ISO 8601) Set creation.

Level 4 - Reconstruction: file store meta (per file)

Attribute Type Meaning
content_sha256 str SHA-256 of the stored bytes.
producer dict {package, version} of whoever stored the file.

Level 5 - Simulation result: Block annotation bsb_provenance

Attribute Type Meaning
schema_version int Version of this provenance layout.
simulation_id str (UUID4) Identity of this run.
simulation_name str Configured simulation name.
started_at / finished_at str (ISO 8601) Run start / end.
wall_seconds float Wall-clock run duration.
seed int / null Simulation seed.
duration_ms / resolution_ms float Simulated duration and step.
scaffold dict Back-reference: {storage_id, state_id, [root]} (root = best-effort last-known absolute path of the reconstruction file).
plugins dict Plugin manifest (as Level 1).
simulator dict {name, version, extra} (e.g. NEST modules loaded).
[host] dict {platform, python_version, hostname, user, cwd}. Diagnostic.
[mpi_size] int Number of ranks. Diagnostic.

Level 6 - Simulation result: per Segment annotations

Attribute Type Meaning
segment_id str (UUID4) Identity of this flush.
checkpoint_index int 0-based flush index.
t_start_ms / t_stop_ms float Segment time window.
simulator_state dict Free slot for simulator-specific notes.

Level 7 - Simulation result: per recorded object, baseline (every recorder)

The convention is documented, not enforced; a recorder may emit any number of objects. Every recorded Neo object carries this baseline, regardless of what it records. What is recorded uses Neo's native name / units.

Annotation Type Meaning
bsb_device_name str Configured device name.
bsb_device_kind str Device classmap_entry (e.g. spike_recorder, multimeter).
bsb_recording_kind str What kind of thing is recorded: cell, compartment, synapse, lfp, stimulus, ... Selects the Level 8 fields.
bsb_simulation_id str Mirror of the Block simulation_id.
bsb_segment_id str Mirror of the Segment segment_id.

Level 8 - Simulation result: per recorded object, recording-kind extension

On top of the baseline, each bsb_recording_kind adds first-class flat bsb_* fields (siblings of the baseline keys) that locate its target, using BSB-native morphology addressing (branch / point / arc), never simulator-internal names. Open to feedback / new kinds.

bsb_recording_kind Adds Records
cell bsb_ps_name, bsb_cell_id, bsb_cell_model a whole cell (placement set, index within it, cell model)
compartment the cell fields + bsb_branch, bsb_point, bsb_arc (+ proposed bsb_coordinates {x,y,z,r}) a location on a cell's morphology
synapse the postsynaptic cell fields + bsb_branch, bsb_point, bsb_arc, bsb_synapse_type + presynaptic identity (proposed: bsb_pre_ps_name, bsb_pre_cell_id) a synapse on a post cell
lfp electrode/probe identity + position (proposed: bsb_probe, bsb_position) a field potential over a region
stimulus bsb_target_count a stimulator's own emitted output (e.g. a Poisson generator's spikes)

Built-in recorders: NEST spike_recorder / multimeter and Arbor spike_recorder -> cell; NEURON voltage_recorder / current_clamp -> compartment; NEURON synapse_recorder -> synapse; NEST poisson_generator / sinusoidal_poisson_generator -> stimulus. The lfp kind has no built-in recorder yet. The proposed bsb_coordinates is also the per-segment geometry an LFP probe needs.

Recorder interface (runtime inspection)

To support controller-style devices (e.g. an LFP probe per #50), a SimulationRecorder is inspectable at runtime, before anything is written to file:

  • recorder.device_name links it back to the device that created it (every built-in recorder passes device=self).
  • recorder.meta(property) exposes recorder-level metadata (e.g. recorder.meta("lfp_source_geometry")).

Combined with the per-object bsb_device_name annotation, this lets a controller find the recorders of the devices it manages and query their metadata during a flush. The remaining piece for a functioning LFP probe, per-checkpoint flushing of results, is tracked separately in #50.


Reader helper

from bsb import read_nio, iter_recordings flattens a result file into Recording records; filter by ps_name, cell_id, device, recorded quantity, or any bsb_* annotation key (e.g. bsb_recording_kind, bsb_branch).

Compatibility

  • Legacy reconstruction files without a bundle are backfilled on first write (a one-shot BsbProvenanceUpgradeWarning); read-only opens leave storage_id / state_id as None.
  • Breaking: recorder output annotations moved from the ad-hoc device / senders / cell_type / cell_id keys to the layered bsb_* convention.

Test plan

  • bsb-hdf5 and bsb-core provenance unit tests (root attrs round-trip, state bumping, legacy auto-upgrade, FS metadata.json migration, Scaffold API, Block/Segment/recorder annotations, baseline + recording-kind layering, recorder device_name / meta() runtime inspection, iter_recordings filtering)
  • MPI-safe: suite runs clean under mpiexec -n 2 (FS provenance writes locked; single-rank-only assertions marked skip_parallel)
  • existing bsb-core / bsb-hdf5 suites still green
  • check-api passes; full docs build passes with zero warnings
  • review-feedback alignment applied: bsb_version -> bsb_core_version, dropped modified_at, bsb_target_kind -> bsb_recording_kind, bracketed host / mpi_size / root
  • reviewer feedback on the attribute set per level, especially the Level 8 recording-kind taxonomy and the LFP/LFPy integration #50 alignment

🤖 Generated with Claude Code


📚 Documentation preview 📚: https://bsb-nest--236.org.readthedocs.build/en/236/


📚 Documentation preview 📚: https://bsb-hdf5--236.org.readthedocs.build/en/236/


📚 Documentation preview 📚: https://bsb-arbor--236.org.readthedocs.build/en/236/


📚 Documentation preview 📚: https://bsb--236.org.readthedocs.build/en/236/


📚 Documentation preview 📚: https://bsb-core--236.org.readthedocs.build/en/236/


📚 Documentation preview 📚: https://bsb-neuron--236.org.readthedocs.build/en/236/

Reconstruction files (HDF5 and FS engines) now carry a root-level
provenance bundle: a permanent storage_id (UUID4), a monotonic state_id
revision counter, timestamps, the bsb-core/engine versions, a plugin
manifest, host info and mpi_size. Placement and connectivity sets gain
per-set revision/timestamps (and morphology_hashes for placement); the
file store records content_sha256 and a producer per file. Legacy files
without a bundle are backfilled on first write.

Simulation result (.nio) files annotate the Block with a bsb_provenance
dict that back-references the reconstruction (storage_id + state_id),
the simulator and its version, the plugin manifest, timing, seed and
host. Each recorder's Neo objects follow a documented bsb_* annotation
convention identifying the source cell (ps_name, cell_id, cell_model)
and device. Adds read_nio / iter_recordings helpers.

BREAKING CHANGE: recorder output annotations changed from the ad-hoc
device/senders/cell_type/cell_id keys to the bsb_* convention, and the
storage root gains a provenance schema.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Helveg Helveg changed the base branch from main to v8 May 28, 2026 17:52
The FS engine's _bump_state and legacy-upgrade wrote metadata.json without
the engine lock, so concurrent MPI ranks raced on the tmp+os.replace and
could clobber each other (and double-upgrade a legacy root). Take the write
lock for the read-modify-write, and re-check inside the lock during upgrade
so only the first rank stamps the bundle.

Mark the three single-rank provenance tests skip_parallel: they assert
behaviour that only holds on one rank (the upgrade warning is emitted on the
main rank only, and two use rank-local temp paths / exact direct bump counts).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Helveg
Copy link
Copy Markdown
Contributor Author

Helveg commented May 28, 2026

One first note to self is that level 7 currently seems to assume that everything is a cell, need a set of standardized attributes for synapses, compartments, (other?) as well

@drodarie
Copy link
Copy Markdown
Contributor

Here are my first thoughts on this proposal (I did not check the code implementation yet):

Regarding reconstruction files:

  • Consider that reconstructions (unlike simulations) can be done in parts (thanks to redo, append), hence some attributes might be not relevant in this context: e.g: host, mpi_size
  • bsb_version --> bsb_core_version make the attribute explicit
  • Regarding saving every mutating write in modified_at, I wonder if it is worth it: why not saving the last mutation upon closing or catching exception? If there is an error, I do not think knowing the last writing in file time is worth having an additional write for every write operation in file.
  • I am not sure of what bsb_schema_version and schema_version in the simulation results file are supposed to represent

Regarding simulation results files:

  • Does scaffold root attribute correspond to the path to the reconstruction file? or something else?
  • As was pointed out before, bsb_cell_id and bsb_cell_model does indeed not work with synapstic or LFP recordings.

@Helveg
Copy link
Copy Markdown
Contributor Author

Helveg commented May 29, 2026

I'll edit my post and will mark some attributes as "possibly helpful but not so important" like scaffold.root --> scaffold.[root]. It may be helpful for someone looking to diagnose or sanity check what the absolute path was this file was (last/first) written to.


  • host, mpi_size, root --> [host], [mpi_size], [root]. Should indicate the last known value
  • bsb_version --> bsb_core_version (please note that all bsb plugins and packages listed in packages are also included in the plugin manifest)
  • will remove the modified_at timestamps
  • the schema versions refer to exactly this schema; by including a schema version we can evolve it over time and provide tools to read/write/migrate older/newer schemas as well
  • let's introduce a bsb_recorder_kind for level 7 so that we can define different required attributes.

Helveg and others added 7 commits May 29, 2026 14:37
…et kind

Split the recorder convention into a baseline every recorder shares
(bsb_device_name/kind, bsb_target_kind, bsb_simulation_id/segment_id) and a
target-kind layer selected by bsb_target_kind ("cell", "compartment",
"synapse", "lfp", ...). Per-kind fields are now first-class flat bsb_*
annotations (e.g. bsb_section, bsb_arc, bsb_synapse_type) instead of a nested
bsb_location dict. Built-in recorders emit cell (NEST/Arbor spikes, multimeter),
compartment (NEURON voltage/current clamp) and synapse (NEURON synapse) kinds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pse recordings

Replace the NEURON-flavoured section/segment location fields with the BSB-native
morphology address: bsb_branch, bsb_point, bsb_arc, taken straight from the
recorder's location accessor (loc.location -> (branch, point), loc.arc()). Reserve
a proposed bsb_coordinates {x, y, z, r} dict for the resolved point position, which
is also the per-segment geometry an LFP probe consumes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extend SimulationRecorder with a device_name attribute and a meta(property)
method, both queryable at runtime before anything is written to file. Every
built-in recorder now passes device=self to create_recorder, and a recorder can
carry metadata (e.g. an LFP source geometry). This lets a controller find the
recorders of the devices it manages and inspect their metadata during a flush,
the missing piece for LFP-style probes (see #50).

Also migrate sinusoidal_poisson_generator to the bsb_* convention and tag both
Poisson generators with the "stimulus" target kind so the baseline holds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rding_kind, drop modified_at)

Apply the agreed thread decisions on #236:
- rename the root-metadata `bsb_version` -> `bsb_core_version` (explicit; packages
  and plugins are already in the plugin manifest)
- drop the `modified_at` timestamps from the root bundle and from per-PlacementSet
  / per-ConnectivitySet attrs; `state_id` / `revision` already signal change and
  `created_at` is kept
- rename the recorder discriminator `bsb_target_kind` -> `bsb_recording_kind`
  (annotates a recording; avoids confusion with `bsb_device_kind`)
- document `host` / `mpi_size` (and the result file's `scaffold.root`) as optional,
  diagnostic, best-effort last-known values, since reconstructions can be partial

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-225

# Conflicts:
#	packages/bsb-core/bsb/__init__.py
#	packages/bsb-core/bsb/storage/fs/file_store.py
- cache the plugin manifest so repeated engine create() stays within the
  storage interface test timeout
- emit one spiketrain per targeted cell in the NEST and Arbor spike
  recorders so population size stays recoverable, and update the
  simulation tests to the bsb_* annotation convention
- resolve ruff SIM105/I001/E501 findings surfaced by the merge
- reference neo classes via their neo.core.* targets so the bsb-core
  docs build clean under -nW

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The classmap entry is stored on the dynamic root, not on the leaf class,
so reading self.__class__.classmap_entry crashed the poisson and
sinusoidal generators. Reverse-look it up in _device_kind and add a
stimulus_train helper so both generators share the baseline annotation
path instead of building the SpikeTrain by hand.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

❌ Patch coverage is 73.57724% with 130 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (v8@4274292). Learn more about missing BASE report.

Files with missing lines Patch % Lines
packages/bsb-core/bsb/storage/fs/__init__.py 51.72% 24 Missing and 4 partials ⚠️
packages/bsb-hdf5/bsb_hdf5/__init__.py 78.57% 10 Missing and 5 partials ⚠️
packages/bsb-hdf5/bsb_hdf5/placement_set.py 64.10% 13 Missing and 1 partial ⚠️
packages/bsb-core/bsb/simulation/results.py 85.71% 10 Missing and 3 partials ⚠️
...ges/bsb-neuron/bsb_neuron/devices/current_clamp.py 7.14% 13 Missing ⚠️
.../bsb-neuron/bsb_neuron/devices/synapse_recorder.py 7.14% 13 Missing ⚠️
packages/bsb-core/bsb/storage/provenance.py 80.35% 11 Missing ⚠️
.../bsb-neuron/bsb_neuron/devices/voltage_recorder.py 8.33% 11 Missing ⚠️
packages/bsb-hdf5/bsb_hdf5/connectivity_set.py 79.16% 4 Missing and 1 partial ⚠️
packages/bsb-core/bsb/storage/fs/file_store.py 77.77% 2 Missing ⚠️
... and 3 more
Additional details and impacted files
@@          Coverage Diff          @@
##             v8     #236   +/-   ##
=====================================
  Coverage      ?   84.00%           
=====================================
  Files         ?      132           
  Lines         ?    14332           
  Branches      ?     1677           
=====================================
  Hits          ?    12039           
  Misses        ?     1890           
  Partials      ?      403           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

A device already exposes the classmap entry it was configured under on its
dynamic attribute (`self.device`), so read that directly instead of
reverse-looking-up the dynamic root's classmap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@Helveg
Copy link
Copy Markdown
Contributor Author

Helveg commented May 29, 2026

@drodarie ready for review and for another round of feedback; especially L7 and L8 have been changed

Helveg and others added 4 commits May 29, 2026 20:52
… writer

Drop the separate _atomic_write_json and route the provenance bundle
through _atomic_write_bytes (staged outside the discovery dir + os.replace)
so the engine keeps a single, reviewed race-safe write path instead of a
parallel implementation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ests

The provenance tests opened storage in a per-rank tempfile.TemporaryDirectory
while the engine broadcasts rank-0's root, so whichever rank left its `with`
block first removed the shared directory out from under the others, flaking
`test_scaffold_exposes_storage_id_state_id_provenance` under mpiexec with an
empty provenance bundle. Route the parallel tests through RandomStorageFixture,
which derives an MPI-safe root and cleans up collectively in tearDownClass; the
single-rank @skip_parallel tests keep their own tempdir.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
nx may write project.json with // or /* */ comments, which json.loads
rejects, breaking the monorepo docs conf.py that reads doc dependencies
(surfacing as a failed bsb-otel Read the Docs build). Parse it as JSONC.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@drodarie
Copy link
Copy Markdown
Contributor

I believe stimulus and recorder devices should be stored the same way in level 8 since a stimulus is basically the inverse of a recorder. We could maybe add or reuse a flag in level 7 to indicate if the device is recording or stimulating?

@drodarie
Copy link
Copy Markdown
Contributor

drodarie commented Jun 3, 2026

Another important point, we should provide an utility script to help users to update their current reconstruction and simulation files to the new format. Otherwise, these would not update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants