diff --git a/.claude/DESIGN.md b/.claude/DESIGN.md index a059390..7624c52 100644 --- a/.claude/DESIGN.md +++ b/.claude/DESIGN.md @@ -1,6 +1,7 @@ # ngio-collections-py — Preliminary Design -**Status:** draft · 2026-06-11 +**Status:** draft · 2026-06-11 · **partly superseded by the functional rewrite +(see banner)** **Context:** Greenfield successor to the `fractal-collections-tools` (/Users/locerr/Projects/Fractal/fractal-v3-prototyping/fractal-collections-tools) prototype (an implementation of the OME-NGFF RFC-8 *Collections* draft). This document @@ -9,6 +10,43 @@ this package starts from. --- +## Implementation note — functional rewrite (2026-06-19) + +The shipped package took a **functional / immutable** direction that supersedes +several specifics below. The **rationale still holds** (RFC-8 round-trip +fidelity, lazy resolution, document-granular saves, async-native core, +URL-addressed stores, mixed-store as the eventual goal, graceful degradation of +unknown types/attributes). What changed in the implementation: + +- **Single frozen node layer, not a Stored/Resolved split (§11).** Nodes are + frozen Pydantic values (`Node` / `RefNode` and the `Collection*`/`Multiscale*` + subtypes); editing returns a NEW tree and never mutates the source. There is + no `StoredNode`/`ResolvedNode` pair and no `models/nodes.py` / `resolved.py`. +- **Provenance is `PrivateAttr` on the node** (`_document`, and `_origin` on a + collapsed boundary — an `Origin`/`NodeMetaInfos` snapshot), carried only via + `model_copy`. The §5 merge rule lives in one place — `merge` / `split` in + `models._base` — and `split` inverts the merge by origin (§9.4). +- **No registry.** Node type is chosen by the `type` discriminator with a + graceful fallback to generic `Node`/`RefNode` (`build_node` / `build_ref_node` + / `build_any_node`). `NodeRegistry` / `DEFAULT_REGISTRY` / validation-context + registration (§2.6, §3.4) are not implemented. +- **No typed `attrs` view and no attribute model classes** (§3.5, §7). + `attributes` stays a raw `dict[str, JsonValue]` for round-trip fidelity; + `PlateAttribute` / `LabelObj` / `SinglescaleNode` etc. do not exist. +- **No sync facade / `ngio_collections.api`** (§5). The Resolver is async; use + `asyncio.run(...)`. +- **Resolver surface is `inline` / `create` / `save` / `delete_subtree`** (not + `open` / `children` / `resolve_tree` / `save_tree`). `MetadataDocument` is a + Protocol over one file (`content` / `store` / `url` + `deserialize_payload` / + `serialize_payload`), not a root-bearing object with `form`/`version`/`stub_path`. + +The authoritative module map is **§8 (Module layout)** and the architecture +overview is **§4**, both updated to the rewrite. Sections §2–§3, §5 (API +sketch / sync API), §7, and §11 are kept as historical design narrative; read +them through this banner. + +--- + ## 1. Goals - A faithful, round-trip-safe implementation of RFC-8 collection metadata: @@ -21,8 +59,11 @@ this package starts from. filesystem (writable), referenced from one collection tree. - Extensible by third-party packages: new node types can be registered without forking. -- Graceful degradation: unknown node types, unknown attributes, and - custom-prefixed fields survive a read–modify–write cycle untouched. +- Graceful degradation: unknown node types degrade to a generic node, and + unknown / custom-prefixed *attributes* survive a read–modify–write cycle + untouched. Unknown *node-level* keys are rejected (`extra="forbid"`): a + node's structural fields are a closed set, so arbitrary metadata must live + in the open `attributes` dict. ### Current scope (revised 2026-06-11: simplicity over completeness) @@ -65,7 +106,10 @@ These were deliberate in the prototype and remain in force: 6. **Registry fallback to a generic node.** An unregistered `type` parses as an opaque `BaseNode` rather than failing, per the RFC's graceful-degradation rules. -7. **`extra="allow"` everywhere** so unknown/custom-prefixed fields round-trip. +7. **`extra="allow"` for non-node OME objects** (`BaseObj`: paths, references) + so unknown/custom keys round-trip; **`extra="forbid"` for nodes** (`NodeObj`) + so node-level structural fields stay a closed set — arbitrary data goes in + `attributes`. --- @@ -200,26 +244,29 @@ a stub using that document's `stub_path`. ``` ┌──────────────────────────────────────────────────────────┐ -│ models/ pure Pydantic: BaseNode, node types, │ -│ attributes, coordinates. No IO, no URLs. │ +│ models/_base.py pure Pydantic: frozen Node / RefNode │ +│ (+ Collection/Multiscale subtypes), │ +│ PathObj, the §5 merge/split rule, and │ +│ the functional edit engine. No IO. │ ├──────────────────────────────────────────────────────────┤ -│ document MetadataDocument: provenance + pure │ -│ (de)serialize of ONE metadata file │ -│ (json or zarr form). │ +│ _document.py MetadataDocument Protocol + Json/Zarr │ +│ impls: pure (de)serialize of ONE │ +│ metadata file's `ome` payload. │ ├──────────────────────────────────────────────────────────┤ -│ resolver async open / resolve / children / │ -│ resolve_tree / save / write. │ -│ URL-keyed MetadataDocument cache. │ -│ The only caller of the Store. │ +│ _resolver.py async Resolver: inline / create / │ +│ save / delete_subtree. URL-keyed │ +│ document cache. The only Store caller. │ ├──────────────────────────────────────────────────────────┤ -│ store/ ReadableStore / WritableStore protocols, │ -│ fsspec-backed default, zero-dep LocalStore.│ +│ store/ ReadableStore / WritableStore protocols│ +│ (_protocols), zero-dep LocalStore │ +│ (_local), FsspecStore skeleton (_fsspec)│ └──────────────────────────────────────────────────────────┘ - sync.py — thin synchronous facade over resolver ``` Dependency rule: each layer imports only downward. Models never import the -document layer; the document layer never imports the store. +document layer; the document layer never imports the store. Editing is +functional — every edit returns a new frozen tree; the parsed source is never +mutated. --- @@ -281,9 +328,11 @@ multiscale that lives on a read-only store). `inline()` is where the merge is materialized: when a stub is collapsed into its resolved subtree, the collapsed node carries the target root's attributes overlaid by the stub's own — **shallow, key-level, stub wins** (the stub annotates the reference; -the nearer scope overrides) — and the stub's `id`/`name`. The rule lives in -one pure function, `models.merged_attributes(stub, target_root)`, the single -home of the §5 merge. +the nearer scope overrides) — and the stub's `id`/`name`. The rule has a single +home in `models._base`: `merged_attributes(stub, target)` computes the overlay, +`merge(stub, target)` materializes the collapsed boundary node (recording an +`Origin` so the merge is invertible), and `split(node)` inverts it by origin on +write-back (§9.4). `inline()` is copy-building end to end: the input tree, the cached documents, and the resolver cache are never touched, and the result is a @@ -417,11 +466,16 @@ absolutely and local derived data relatively. ## 7. Models layer (mostly unchanged from the prototype) -- `BaseObj`: camelCase aliasing, `populate_by_name`, `extra="allow"`. -- `BaseNode`: `type`, `id` (pattern-validated, required), `name` - (`str | None`, optional), `path: ZarrPath | JsonPath | None`, raw `attributes` dict, - `attrs` typed view (§3.5). **No `version` field** — that lives on - `MetadataDocument`. +- `BaseObj`: camelCase aliasing, `populate_by_name`, `extra="allow"` — for + non-node OME objects (paths, references). +- `NodeObj`: same config but `extra="forbid"` — the base of the node hierarchy + (and of consumer field-mixins for custom node types), so node-level keys are + a closed set. +- `BaseNode` (subclasses `NodeObj`): `type` (required `str` — every node carries + one), `id` (pattern-validated, required), `name` (`str | None`, optional), raw + `attributes` dict, `attrs` typed view (§3.5); `nodes` / `path` come from the + concrete hierarchies (embedded vs reference). **No `version` field** — that + lives on `MetadataDocument`. - Built-in node types with their structural validators: - `CollectionNode` — exactly one of `nodes`/`path`. - `MultiscaleNode` — exactly one of `nodes`/`path`; full (inlined) form @@ -450,20 +504,29 @@ absolutely and local derived data relatively. ``` src/ngio_collections/ + __init__.py # the public surface (19 names): Resolver, stores + + # protocols, node/path model types + _document.py # MetadataDocument Protocol + Json/Zarr impls + _resolver.py # async Resolver (inline / create / save / delete_subtree) models/ - base.py # BaseObj, BaseNode, IdStr, Path objects, attrs view - nodes.py # CollectionNode, MultiscaleNode, SinglescaleNode - attributes.py # plate / well / acquisition / labels - coordinates.py # CoordinateSystem, CoordinateTransformation, scene - registry.py # NodeRegistry (no singletons) - document.py # MetadataDocument, parse_metadata_document, single serialize path - resolver.py # async Resolver + __init__.py # re-exports the model public subset + _base.py # BaseObj; frozen Node / RefNode (+ Collection/Multiscale + # subtypes); ZarrPath / JsonPath / PathObj; NodeState; + # the §5 merge/split rule; build_* constructors; + # the functional edit engine (update/add/remove/…) store/ - protocols.py # ReadableStore, WritableStore, StoreReadOnlyError - local.py # LocalStore (zero-dep) - fsspec.py # FsspecStore skeleton (optional dependency) + __init__.py # re-exports the store public subset + _protocols.py # ReadableStore, WritableStore, StoreReadOnlyError + _local.py # LocalStore (zero-dep) + _fsspec.py # FsspecStore skeleton (optional dependency) ``` +Every module under `models/` and `store/` is private (`_*.py`); the public +names are re-exported from each subpackage's `__init__` and from the top-level +`ngio_collections`. The merge engine, node constructors, provenance dataclasses +(`Origin` / `NodeMetaInfos`), and the document layer are intentionally NOT part +of the public surface. + --- ## 9. Open spec questions (RFC-8) @@ -488,7 +551,12 @@ Tracked here because the implementation takes a position on each: be able to override metadata on read-only targets; the merged view's `id`/`name` are likewise the stub's. Worth an RFC clarification, including whether a stub may satisfy an attribute MUST (e.g. - `coordinateSystems`) on the parent side. + `coordinateSystems`) on the parent side. **Write-back position (§11):** the + merge is invertible — *by origin, edge keeps overrides*. An edited key that + originated on the stub is written back to the parent edge (the target keeps + its original, shadowed value); every other current key — including + brand-new ones — is written to the home (target) document; a removed key + drops from both layers. --- @@ -528,3 +596,58 @@ future-work section): `gather` when frontier sizes get large). - Optional dirty tracking on top of document-granular saves. - A typed RFC-5 transformation union once that spec settles. + +--- + +## 11. Stored/Resolved node split (2026-06-18) + +The headline use case — open an inlined collection, edit it in memory, write it +back keeping the file structure and attributes correct — was blocked by +`BaseNode` wearing three hats: the on-disk wire model, the parsed/provenance +node, and (post-`inline`) the merged editing surface. The merge was lossy (a +key present on both stub and target lost the target's value) and the inlined +tree was one synthetic document, so saving it flattened the whole collection +into one file. The fix splits the node into two layers. + +- **`StoredNode`** (`models/base.py`, `models/nodes.py`) — the faithful + Pydantic mirror of one document's node (`extra="allow"`, structural + validators, `path`/ref forms, `_document`/`_parent` provenance). The + pre-split names (`BaseNode`, `CollectionNode`, …) stay as back-compat + aliases. Each stored type gets a `resolved_form` ClassVar (mirroring + `ref_form`); `None` ⇒ the generic fallback. +- **`ResolvedNode`** (`models/resolved.py`) — produced ONLY by `inline()`: a + plain (non-Pydantic) mutable working model holding private references back + into the stored layer (`_home` document, `_stored` node, `_edge` → + `EdgeRef`), with the ergonomic edit API (`attrs`, `add`, `pop`, `walk`, + `find`, `target_path`). Typed twins exist for the built-ins; custom types + fall back to the generic `ResolvedNode` (or opt into a twin via + `resolved_form`). No on-mutation validation — invariants re-apply once, at + `to_stored_root()`. + +Resolution vocabulary, made consistent: `inline()` (verb) → `ResolvedNode` +(fully-resolved result); `resolve()` / `resolve_tree()` are the lazy partial +steps that leave stubs in place (§3.3). So `inline()` reframes as +**StoredNode-tree → ResolvedNode-tree**, and write-back as +**ResolvedNode-tree → StoredNode-documents**. + +**`Resolver.save_tree(root)`** is the inverse of `inline()`: it partitions the +resolved tree by home document (each boundary node — `_edge` set — roots its +own document and is re-emitted as a path stub in its parent), rebuilds each +document via `to_stored_root` (attributes un-merged by origin per §9.4; added +nodes embedded in their parent's document; unknown `extra` keys carried through +from the cached original `StoredNode` by `model_copy`), and saves only the +documents whose serialized payload changed. A tree saved with no edits writes +nothing. **`Resolver.delete_subtree(node)`** (with a new `WritableStore.delete`) +is the destructive companion to `pop()`'s in-memory unlink: deletes the +external file(s) of the boundary nodes in a subtree (call before popping). + +Sync API: `open_collection` / `open_multiscale` now return the `ResolvedNode` +root; `write_collection_back` / `write_multiscale_back` wrap `save_tree`. The +compose-by-reference writers (`write_collection` / `write_multiscale`) keep +taking `StoredNode`s — the document-granular `save()` editing path is +unchanged. + +Partly retires §10's `write()` item: bottom-up composition (writers) and +write-back of an opened tree (`save_tree`) are now covered; restructuring by +*externalizing* an added node into its own new document stays deferred (added +nodes embed in their parent's document). diff --git a/.claude/ROADMAP.md b/.claude/ROADMAP.md index 6714dca..aa8d796 100644 --- a/.claude/ROADMAP.md +++ b/.claude/ROADMAP.md @@ -1,14 +1,26 @@ # Roadmap **Status:** revised 2026-06-11 (simplicity over completeness) · companion to -[DESIGN.md](DESIGN.md) +[DESIGN.md](DESIGN.md) · **partly superseded by the functional rewrite (see banner)** + +> **Implementation note (2026-06-19).** The local read+write story is +> implemented and green, but via the **functional / immutable rewrite** — see +> DESIGN.md's "Implementation note" banner. The milestones below are kept as a +> historical record; their *specifics* that no longer apply: the registry +> (M1: `NodeRegistry` / `DEFAULT_REGISTRY`), the structural-validator / +> attribute-class / `SinglescaleNode` model (M1, in `models/nodes.py`), the +> `Stored`/`Resolved` split (M5, `models/resolved.py`), and the sync +> `ngio_collections.api` facade (M3/M5). The real surface is a single frozen +> `Node`/`RefNode` model in `models/_base.py` and an async `Resolver` +> (`inline` / `create` / `save` / `delete_subtree`); the §5 merge rule lives in +> `merge` / `split`. Use `asyncio.run(...)` — there is no sync facade. Scope of this roadmap: a complete, round-trip-safe **local** implementation — parse, validate, navigate, edit, save on the local filesystem. Remote and mixed-store support remain the primary eventual use case (the core stays async-native for that reason), but their implementation is deferred: `FsspecStore` exists only as an interface skeleton; `RouterStore` is -design-only (DESIGN.md �6), with no code yet. Everything +design-only (DESIGN.md §6), with no code yet. Everything deferred is recorded under [Future work](#future-work) below and in DESIGN.md §10. @@ -20,8 +32,9 @@ Sequencing principles: the roadmap ends when the local write path round-trips. - CI from milestone 1, so every subsequent milestone lands gated. -Current state: structural skeleton (modules, signatures, and trivial pieces -in place; behavior stubbed), trimmed to the local-scope surface. +Current state: the full local story is implemented and tested green +(parse → inline → edit → write-back, document-granular) via the functional +rewrite; remote/mixed-store remain deferred (see Future work). --- @@ -114,7 +127,33 @@ Document-granular editing — the core value proposition. **Done when:** editing one node's attributes and saving touches exactly one file on disk, and a re-opened tree reflects the edit with everything else -byte-identical. That is also the end of this roadmap. +byte-identical. + +## M5 — Round-trip via the Stored/Resolved split (added 2026-06-18) + +The headline use case: open an inlined collection, edit it in memory, write it +back keeping the file structure and attributes correct. See DESIGN.md §11. + +- [x] Split `BaseNode` into `StoredNode` (wire/parse/serialize, `BaseNode` + alias kept) and a new `ResolvedNode` layer (`models/resolved.py`): plain + mutable working model with the `attrs` / `add` / `pop` edit API, typed twins + for the built-ins + generic fallback (opt-in via the `resolved_form` + ClassVar). +- [x] `Resolver.inline()` rebuilt as StoredNode-tree → ResolvedNode-tree, each + node retaining `_home` / `_stored` / `_edge` provenance. +- [x] `Resolver.save_tree(root)`: ResolvedNode-tree → StoredNode-documents — + attributes un-merged by origin (edge keeps overrides, DESIGN.md §9.4), added + nodes embedded in their parent's document, only changed documents rewritten, + unknown keys preserved. +- [x] `WritableStore.delete` + `Resolver.delete_subtree` (destructive companion + to `pop()`'s in-memory unlink). +- [x] Sync `write_collection_back` / `write_multiscale_back`; `open_*` return + the resolved tree. + +**Done when:** open → edit (attrs add/remove, add node, pop node) → write-back +lands each change in the right document, leaves untouched files byte-identical, +and a no-op write-back touches nothing (`tests/test_resolved_roundtrip.py`, +`examples/08_resolved_edit.py`). --- @@ -135,7 +174,10 @@ re-evaluate once M4 is done. See also DESIGN.md §10. - **`Resolver.write(node, url, stub_path=...)`** — externalizing a node into a new document (collection restructuring). Bottom-up *composition* by reference is covered 2026-06-12 by the sync writers returning reference - stubs (with `relativize` path rewriting); restructuring stays deferred. + stubs (with `relativize` path rewriting), and write-back of an opened tree + by M5's `save_tree` (2026-06-18); what remains deferred is externalizing an + *added* node into its own new document (added nodes embed in their parent's + document). - **Attribute-registry extensibility** — removed as dead code (the `attrs` view takes attribute classes directly); re-add only if a use case appears. - **Conformance suite** against the RFC-8 examples; revisit the open spec diff --git a/.gitignore b/.gitignore index ec173d0..9a48134 100644 --- a/.gitignore +++ b/.gitignore @@ -14,6 +14,9 @@ build/ .ruff_cache/ .mypy_cache/ +# benchmark dataset cache +benchmarks/.data/ + # editors / OS .DS_Store .idea/ diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..d783e36 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,20 @@ +# ngio-collections + +Python library for RFC-8 OME-Zarr collection metadata. Async-first; local-only +for now (`LocalStore`); `FsspecStore` is a skeleton. + +## Commands + +```bash +pixi run lint # ruff (includes Google-style docstring rules) +pixi run type-check # ty +pixi run --environment dev pytest +``` + +## Rules + +- **Immutability:** nodes are frozen Pydantic models. Every edit goes through + `model_copy` and returns a NEW tree — never mutate in place. +- **No IO in `models/`:** `models/_base.py` is pure Pydantic. The only IO + surface is the `store/` layer. +- **Docstrings:** Google-style. Use single backticks for code spans (Markdown). diff --git a/benchmarks/README.md b/benchmarks/README.md new file mode 100644 index 0000000..9a44a20 --- /dev/null +++ b/benchmarks/README.md @@ -0,0 +1,104 @@ +# benchmarks + +Performance benchmarks for ngio-collections. Builds a realistic RFC-8 HCS +collection at a configurable scale (up to ~1M nodes), caches it on disk, runs +the core operations, and prints average run time + peak memory per operation. +Standard-library only (no extra dependencies). + +## Run + +```bash +# first run: generate + cache the dataset, then run all ops +pixi run --environment dev python -m benchmarks + +# run again: the dataset is reused from cache (no build/write), much faster +pixi run --environment dev python -m benchmarks + +# the real target: ~1M nodes (lower repeats — read/write are heavy) +pixi run --environment dev python -m benchmarks --target 1000000 --shard none --repeats 3 + +# run only some operations +pixi run --environment dev python -m benchmarks --ops read,walk +pixi run --environment dev python -m benchmarks --ops write + +# force regeneration of the cached dataset +pixi run --environment dev python -m benchmarks --rebuild +``` + +## Dataset + +``` +root -> 20 plates -> 240 wells/plate (4800 wells) -> n scenes/well +each scene: 3 multiscale images + 5 multiscale labels + 10 tables +total nodes = 4821 + 4800 * n * 19 +``` + +`--target` (default 100,000) is the only scale knob: the dataset uses the +smallest whole number of scenes-per-well that reaches at least `target` nodes. +Because each scenes-per-well step adds 91,200 nodes (4800 wells × 19), the +realized count rounds up to that granularity — e.g. `--target 100000` ⇒ 187,221 +nodes, `--target 1000000` ⇒ 1,008,021. + +### Local cache + +Datasets are expensive to build, so each is generated once into +`benchmarks/.data/-n/` (gitignored) and reused on later runs — +keyed by sharding + scale. A `.benchmark.json` marker records the entry document +and node count. Use `--rebuild` to regenerate, or `--data-root PATH` to store +the cache elsewhere (e.g. a faster disk). The cache key covers only +`(shard, scenes-per-well)`; if you change the builders in `dataset.py`, pass +`--rebuild`. + +## Operations (in order of importance) + +| operation | what it measures | +| ---------------- | ---------------------------------------------------------- | +| `read (inlined)` | `open_inlined` over the on-disk layout, cold cache | +| `walk` | full depth-first traversal of the in-memory tree | +| `find` | `find(id)` for a random existing id (an index lookup, O(1))| +| `edit` | `set_attrs` on a random node (new tree; copy-on-write map) | +| `write` | `save_inlined` snapshot of a tree to one document | + +`--ops` selects any subset (comma-separated, or `all`). `walk` / `find` / `edit` +share one in-memory `open_inlined` view; `read` and `write` are independent: +`read` builds and discards a view each run, and `write` owns a *separate* view +and writes to its own scratch dir (removed afterwards) — it never touches the +cached dataset or the read path. Running the walk-group and `write` together +holds two full trees in memory (~2× at 1M); isolate them with `--ops` for the +largest runs. + +`write` uses `save_inlined` to snapshot the resolved (inlined) view to one +self-contained document, repeatably, into a scratch dir. + +## Results & backend (v5) + +At ~1.0M nodes (in-memory, stdlib `PersistentMap`): build ~5.9 s, walk ~127 ms, +`find` ~90 ns, a single `set_attrs` ~9.3 ms (one 1M-entry dict copy), and 200 +edits batched through one `mutate()` evolver ~38 µs each. The single-edit cost is +fine for interactive use and batch edits amortise via the evolver, so the +**stdlib backend is sufficient — no `pyrsistent` needed**. Bulk construction goes +through `graph.TreeBuilder` (one O(n) pass); per-node `add_child` is O(n) and only +for incremental edits. + +## Sharding (`--shard`) + +Controls the document boundary of the on-disk layout, which dominates the +`read (inlined)` cost (inlining resolves across every boundary): + +- `scene` (default): root/plate/well are documents referencing per-scene docs. +- `well`: root/plate documents referencing per-well docs (scenes inline). +- `plate`: root referencing per-plate docs (everything below inline). +- `none`: a single monolithic document. +- `leaf`: every multiscale image, label, and table is its own single-node + document (scene docs reference them). **Warning:** at ~1M nodes this writes + ~1M files — use it at a small `--target` for comparison, not routine 1M runs. + +## Modifying + +- **Add an operation:** write a zero-arg callable, add it to `ALL_OPS` and append + `measure("", fn, args.repeats)` to `results` in `benchmarks/__main__.py`. +- **Change the layout / attributes:** edit the builder in + `benchmarks/dataset.py` (`build_monolithic` / `_scene_children`), then + re-run with `--rebuild`. Attributes are intentionally light (a `{"role": ...}` + marker) so node *count*, not attribute validation, is what scales. +- **Change measurement:** `benchmarks/harness.py` (timing, memory, formatting). diff --git a/benchmarks/__init__.py b/benchmarks/__init__.py new file mode 100644 index 0000000..e775e4b --- /dev/null +++ b/benchmarks/__init__.py @@ -0,0 +1,9 @@ +"""Performance benchmarks for ngio-collections. + +Run as a module from the repo root:: + + pixi run --environment dev python -m benchmarks # fast default + pixi run --environment dev python -m benchmarks --target 1000000 # ~1M nodes + +See ``benchmarks/README.md`` for the full set of knobs. +""" diff --git a/benchmarks/__main__.py b/benchmarks/__main__.py new file mode 100644 index 0000000..448acb7 --- /dev/null +++ b/benchmarks/__main__.py @@ -0,0 +1,193 @@ +"""Benchmark runner (v5): load (or generate) a dataset, time ops, print a table. + +Operations, in the spec's order of importance: + + read (inlined) open_inlined over the on-disk (sharded) layout, cold cache + walk traverse every node of the in-memory tree + find look up a random existing id (an index lookup) + edit set_attrs on a random node (returns a new tree) + write snapshot a tree to one document, into a scratch dir + +The dataset is generated once and cached under ``benchmarks/.data/`` (keyed by +shard + scale); later runs reuse it. ``walk`` / ``find`` / ``edit`` share one +in-memory ``open_inlined`` view; ``write`` owns a separate view + scratch dir. + +Examples:: + + python -m benchmarks # all ops, default scale + python -m benchmarks --target 1000000 --shard scene # ~1M nodes + python -m benchmarks --ops read,walk +""" + +from __future__ import annotations + +import argparse +import itertools +import random +import shutil +import tempfile +from pathlib import Path + +import ngio_collections as ngc +from benchmarks import dataset as ds +from benchmarks._stores import DelayStore +from benchmarks.harness import Result, format_table, measure + +ALL_OPS = ("read", "walk", "find", "edit", "write") + + +def _parse_ops(value: str) -> list[str]: + if value.strip().lower() == "all": + return list(ALL_OPS) + ops = [o.strip() for o in value.split(",") if o.strip()] + unknown = [o for o in ops if o not in ALL_OPS] + if unknown: + raise argparse.ArgumentTypeError( + f"unknown op(s): {', '.join(unknown)}; choose from {', '.join(ALL_OPS)}" + ) + return [o for o in ALL_OPS if o in ops] + + +def _parse_args() -> argparse.Namespace: + p = argparse.ArgumentParser(prog="benchmarks", description=__doc__) + p.add_argument( + "--target", type=int, default=90_000, help="approx node count (default: 90,000)" + ) + p.add_argument( + "--shard", + choices=["leaf", "scene", "well", "plate", "none"], + default="scene", + help="document boundary for the on-disk layout (default: scene)", + ) + p.add_argument( + "--ops", + type=_parse_ops, + default=list(ALL_OPS), + help=f"subset of {{{','.join(ALL_OPS)}}} or 'all' (default: all)", + ) + p.add_argument( + "--repeats", type=int, default=5, help="timed runs per op (default: 5)" + ) + p.add_argument( + "--rebuild", action="store_true", help="regenerate the cached dataset" + ) + p.add_argument( + "--data-root", type=str, default=None, help="override the dataset cache root" + ) + p.add_argument( + "--seed", type=int, default=0, help="RNG seed for find/edit id sampling" + ) + p.add_argument( + "--io-latency-ms", + type=float, + default=0.0, + help="inject per-read latency (ms) via a DelayStore (default: 0)", + ) + return p.parse_args() + + +def _consume(iterable: object) -> int: + count = 0 + for _ in iterable: # type: ignore[attr-defined] + count += 1 + return count + + +def main() -> None: + """Load/generate the dataset, run the selected benchmarks, print the table.""" + args = _parse_args() + + scenes_per_well = ds.scenes_for_target(args.target) + total_nodes = ds.estimate_nodes(scenes_per_well) + counts = ds.node_counts(scenes_per_well) + n_docs = ds.document_count(args.shard, scenes_per_well) + data_root = Path(args.data_root) if args.data_root else None + ops: list[str] = args.ops + + target_dir = ds.dataset_dir(args.shard, scenes_per_well, data_root) + entry_url, setup_results = ds.ensure_dataset( + args.shard, scenes_per_well, rebuild=args.rebuild, data_root=data_root + ) + reused = not setup_results + + print() + print("ngio-collections benchmark (v5)") + print(f" shard level : {args.shard}") + print(f" target : {args.target:,}") + print(f" total nodes : {total_nodes:,}") + print(f" files : {n_docs:,}") + print(f" scenes : {counts['scenes']:,}") + print(f" ops : {', '.join(ops)}") + print(f" repeats : {args.repeats}") + print(f" io latency : {args.io_latency_ms} ms") + print(f" dataset : {target_dir} ({'reused' if reused else 'generated'})") + print() + + def _read_store() -> ngc.LocalStore | DelayStore: + """A fresh store honoring --io-latency-ms (cold cache: new each call).""" + store = ngc.LocalStore() + return ( + DelayStore(store, args.io_latency_ms / 1000) + if args.io_latency_ms + else store + ) + + results: list[Result] = list(setup_results) + scratch: Path | None = None + try: + view = None + if any(op in ops for op in ("walk", "find", "edit")): + view = ngc.open_inlined(entry_url, _read_store()) + ids = [n.id for n in view.walk() if n.id is not None] + sample = random.Random(args.seed).sample(ids, k=min(len(ids), 1000)) + find_ids = itertools.cycle(sample) + edit_ids = itertools.cycle(sample) + + if "read" in ops: + results.append( + measure( + "read (inlined)", + lambda: ngc.open_inlined(entry_url, _read_store()), + args.repeats, + ) + ) + if "walk" in ops: + assert view is not None + results.append(measure("walk", lambda: _consume(view.walk()), args.repeats)) + if "find" in ops: + assert view is not None + results.append( + measure( + "find (random id)", lambda: view.find(next(find_ids)), args.repeats + ) + ) + if "edit" in ops: + assert view is not None + results.append( + measure( + "edit (set_attrs)", + lambda: view.find(next(edit_ids)).set_attrs({"bench": 1}), + args.repeats, + ) + ) + if "write" in ops: + write_view = ngc.open_inlined(entry_url, _read_store()) + scratch = Path(tempfile.mkdtemp(prefix="ngio-bench-write-")) + out_url = str(scratch / "snapshot.json") + results.append( + measure( + "write (snapshot)", + lambda: ngc.save_inlined(write_view, out_url, overwrite=True), + args.repeats, + ) + ) + + print(format_table(results)) + print() + finally: + if scratch is not None: + shutil.rmtree(scratch, ignore_errors=True) + + +if __name__ == "__main__": + main() diff --git a/benchmarks/_stores.py b/benchmarks/_stores.py new file mode 100644 index 0000000..5596714 --- /dev/null +++ b/benchmarks/_stores.py @@ -0,0 +1,43 @@ +"""Benchmark-only store wrappers. + +`DelayStore` injects a fixed per-read latency so the payoff of concurrent +document I/O is measurable deterministically — independent of the OS page cache +(which makes warm local reads ~µs and hides the win). It models the real target: +high-latency / remote stores. With sequential reads the cost is ~``N * delay``; +with `Resolver(prefetch_concurrency=k)` it drops toward ~``ceil(N/k) * delay``. +""" + +from __future__ import annotations + +import asyncio + +from ngio_collections.io.store import ReadableStore, WritableStore + + +class DelayStore: + """Wrap a store, sleeping `delay_s` before each `get` (simulated read latency).""" + + def __init__(self, inner: ReadableStore, delay_s: float) -> None: + """Wrap `inner`, adding `delay_s` seconds of latency to every `get`.""" + self.inner = inner + self.delay_s = delay_s + + async def get(self, url: str) -> bytes: + """Sleep `delay_s`, then delegate the read to the wrapped store.""" + await asyncio.sleep(self.delay_s) + return await self.inner.get(url) + + async def put(self, url: str, data: bytes) -> None: + """Delegate the write to the wrapped store (no injected delay).""" + await _writable(self.inner).put(url, data) + + async def delete(self, url: str) -> None: + """Delegate the delete to the wrapped store (no injected delay).""" + await _writable(self.inner).delete(url) + + +def _writable(store: ReadableStore) -> WritableStore: + """Narrow `store` to a `WritableStore`, raising if it is read-only.""" + if not isinstance(store, WritableStore): + raise TypeError(f"{type(store).__name__} is not writable") + return store diff --git a/benchmarks/dataset.py b/benchmarks/dataset.py new file mode 100644 index 0000000..3a67ace --- /dev/null +++ b/benchmarks/dataset.py @@ -0,0 +1,296 @@ +"""Builds the benchmark dataset (v5) and writes it to disk at a chosen sharding. + +The dataset mirrors a realistic RFC-8 HCS collection: + + root -> 20 plates -> 240 wells/plate -> n scenes/well + +and every scene holds 3 multiscale images, 5 multiscale labels and 10 tables +(18 child nodes). Node count scales linearly with ``scenes_per_well``:: + + total = 4821 + 4800 * scenes_per_well * 19 + +so ``scenes_per_well == 11`` yields ~1.0M nodes (see ``scenes_for_target``). + +``build_monolithic`` returns the full detached in-memory ``NodeTree`` (every node +inline), built in one O(n) pass with ``TreeBuilder``. ``write_sharded`` writes it +to disk, splitting it into one document per node at or above the chosen boundary +(``leaf`` / ``scene`` / ``well`` / ``plate``) or a single document (``none``); +parent documents reference their children with relativized ``path`` stubs. That is +the layout the ``open_inlined`` read benchmark resolves across. Datasets are cached +under ``benchmarks/.data/-n/`` and reused across runs. +""" + +from __future__ import annotations + +import asyncio +import json +import shutil +from dataclasses import replace +from pathlib import Path +from typing import Literal + +from benchmarks.harness import Result, measure_value +from ngio_collections.api._node import reference_path +from ngio_collections.graph import ( + ROOT, + NodeId, + NodeRecord, + NodeTree, + Reference, + TreeBuilder, +) +from ngio_collections.io.store import LocalStore, WritableStore +from ngio_collections.resolve import write_document + +# --- fixed layout (per the spec) -------------------------------------------- + +PLATES = 20 +WELLS_PER_PLATE = 240 +TOTAL_WELLS = PLATES * WELLS_PER_PLATE # 4800 +IMAGES = 3 +LABELS = 5 +TABLES = 10 +NODES_PER_SCENE = 1 + IMAGES + LABELS + TABLES # 19 (scene node + 18 children) +FIXED_NODES = 1 + PLATES + TOTAL_WELLS # 4821 (root + plates + wells) + +ShardLevel = Literal["leaf", "scene", "well", "plate", "none"] + +# Depth at (and above) which a node becomes its own document. root=0, plate=1, +# well=2, scene=3, leaf (multiscale/table)=4. +_BOUNDARY_DEPTH: dict[ShardLevel, int] = { + "none": 0, + "plate": 1, + "well": 2, + "scene": 3, + "leaf": 4, +} + + +def register_tables() -> None: + """No-op: tables are plain `bench:table` records (no typed handle needed).""" + + +# --- node-count helpers ------------------------------------------------------ + + +def estimate_nodes(scenes_per_well: int) -> int: + """Total node count for a dataset with ``scenes_per_well`` scenes per well.""" + return FIXED_NODES + TOTAL_WELLS * scenes_per_well * NODES_PER_SCENE + + +def scenes_for_target(target: int = 1_000_000) -> int: + """Smallest ``scenes_per_well`` whose dataset has at least ``target`` nodes.""" + per_n = TOTAL_WELLS * NODES_PER_SCENE + return max(1, -(-(target - FIXED_NODES) // per_n)) # ceil division + + +def node_counts(scenes_per_well: int) -> dict[str, int]: + """Per-kind node counts for the generated dataset.""" + scenes = TOTAL_WELLS * scenes_per_well + return { + "plates": PLATES, + "wells": TOTAL_WELLS, + "scenes": scenes, + "images": scenes * IMAGES, + "labels": scenes * LABELS, + "tables": scenes * TABLES, + } + + +def document_count(shard: ShardLevel, scenes_per_well: int) -> int: + """Number of documents (files) written for the given sharding.""" + scenes = TOTAL_WELLS * scenes_per_well + leaves = scenes * (IMAGES + LABELS + TABLES) + per_depth = (1, PLATES, TOTAL_WELLS, scenes, leaves) # depth 0..4 + return sum(per_depth[: _BOUNDARY_DEPTH[shard] + 1]) + + +# --- in-memory builder (detached, one O(n) pass) ---------------------------- + + +def _scene_children(tb: TreeBuilder, scene_key: NodeId, scene_id: str) -> None: + for i in range(IMAGES): + tb.add_child( + scene_key, + NodeRecord( + type="multiscale", + id=f"{scene_id}-img{i}", + name=f"img{i}", + attributes={"role": "image"}, + ), + ) + for i in range(LABELS): + tb.add_child( + scene_key, + NodeRecord( + type="multiscale", + id=f"{scene_id}-lbl{i}", + name=f"lbl{i}", + attributes={"role": "label"}, + ), + ) + for i in range(TABLES): + tb.add_child( + scene_key, + NodeRecord( + type="bench:table", + id=f"{scene_id}-tbl{i}", + name=f"tbl{i}", + attributes={"role": "table"}, + ), + ) + + +def build_monolithic(scenes_per_well: int) -> NodeTree: + """Build the full detached collection tree with every node inline (O(n)).""" + tb = TreeBuilder( + NodeRecord( + type="collection", + id="root", + name="root", + attributes={"role": "root"}, + children=(), + ) + ) + for p in range(PLATES): + pid = f"p{p}" + pk = tb.add_child( + ROOT, + NodeRecord( + type="collection", + id=pid, + name=pid, + attributes={"role": "plate"}, + children=(), + ), + ) + for w in range(WELLS_PER_PLATE): + wid = f"{pid}-w{w}" + wk = tb.add_child( + pk, + NodeRecord( + type="collection", + id=wid, + name=wid, + attributes={"role": "well"}, + children=(), + ), + ) + for s in range(scenes_per_well): + sid = f"{wid}-s{s}" + sk = tb.add_child( + wk, + NodeRecord( + type="collection", + id=sid, + name=sid, + attributes={"role": "scene"}, + children=(), + ), + ) + _scene_children(tb, sk, sid) + return tb.finish() + + +# --- on-disk sharding -------------------------------------------------------- + + +def _writable(store: object) -> WritableStore: + if not isinstance(store, WritableStore): + raise TypeError(f"{type(store).__name__} is not writable") + return store + + +async def _write_node( + tree: NodeTree, + node_id: NodeId, + depth: int, + boundary: int, + store: object, + parent_dir: Path, +) -> tuple[str, Reference]: + """Write `node_id`'s subtree to disk; return its URL and a reference to it.""" + rec = tree.record(node_id) + url = str(parent_dir / (rec.id or "node") / "collection.json") + children = tree.children_ids(node_id) + if depth >= boundary or not children: + await write_document(store, url, tree, root_id=node_id) + else: + # container document: children written as their own docs, referenced here + container = TreeBuilder(replace(rec, children=())) + for child in children: + child_url, child_ref = await _write_node( + tree, child, depth + 1, boundary, store, parent_dir / (rec.id or "node") + ) + crec = tree.record(child) + container.add_child( + ROOT, NodeRecord(type=crec.type, name=crec.name, ref=child_ref) + ) + await write_document(store, url, container.finish(), root_id=ROOT) + return url, Reference(path=reference_path(url), id=rec.id) + + +async def write_sharded( + tree: NodeTree, shard: ShardLevel, store: object, workdir: str | Path +) -> str: + """Write `tree` to `workdir` at the given sharding; return the entry URL.""" + _writable(store) + url, _ = await _write_node( + tree, ROOT, 0, _BOUNDARY_DEPTH[shard], store, Path(workdir) + ) + return url + + +# --- local dataset cache ----------------------------------------------------- + +DATA_ROOT = Path(__file__).parent / ".data" +_MARKER = ".benchmark.json" + + +def dataset_dir( + shard: ShardLevel, scenes_per_well: int, data_root: Path | None = None +) -> Path: + """Cache directory for a given (shard, scale) dataset.""" + return (data_root or DATA_ROOT) / f"{shard}-n{scenes_per_well}" + + +def ensure_dataset( + shard: ShardLevel, + scenes_per_well: int, + *, + rebuild: bool = False, + data_root: Path | None = None, +) -> tuple[str, list[Result]]: + """Return the entry-document URL for the dataset, generating it if needed.""" + target_dir = dataset_dir(shard, scenes_per_well, data_root) + marker = target_dir / _MARKER + + if rebuild and target_dir.exists(): + shutil.rmtree(target_dir) + if not rebuild and marker.exists(): + return json.loads(marker.read_text())["entry_url"], [] + + build_res, tree = measure_value( + "build (in-mem)", lambda: build_monolithic(scenes_per_well) + ) + + async def _write() -> str: + return await write_sharded(tree, shard, LocalStore(), target_dir) + + write_res, entry_url = measure_value( + "dataset write (disk)", lambda: asyncio.run(_write()) + ) + + target_dir.mkdir(parents=True, exist_ok=True) + marker.write_text( + json.dumps( + { + "entry_url": entry_url, + "shard": shard, + "scenes_per_well": scenes_per_well, + "nodes": estimate_nodes(scenes_per_well), + }, + indent=2, + ) + ) + return entry_url, [build_res, write_res] diff --git a/benchmarks/harness.py b/benchmarks/harness.py new file mode 100644 index 0000000..f73b8b0 --- /dev/null +++ b/benchmarks/harness.py @@ -0,0 +1,112 @@ +"""Tiny measurement harness: average wall time + peak memory, no dependencies. + +Timing and memory are measured in separate passes on purpose: ``tracemalloc`` +roughly doubles allocation cost, so timing runs without it for clean numbers and +peak memory is captured in one extra, untimed run. All measurement uses the +standard library (``time.perf_counter`` + ``tracemalloc``). +""" + +from __future__ import annotations + +import time +import tracemalloc +from dataclasses import dataclass +from typing import Callable, TypeVar + +T = TypeVar("T") + + +@dataclass +class Result: + """One operation's measured cost.""" + + name: str + avg_s: float + min_s: float + repeats: int + peak_bytes: int + + +def _peak_of(fn: Callable[[], object]) -> int: + """Run ``fn`` once and return the peak traced memory (bytes) of that run.""" + tracemalloc.start() + try: + fn() + _, peak = tracemalloc.get_traced_memory() + finally: + tracemalloc.stop() + return peak + + +def measure(name: str, fn: Callable[[], object], repeats: int) -> Result: + """Time ``fn`` over ``repeats`` runs, then measure its peak memory once.""" + times: list[float] = [] + for _ in range(repeats): + start = time.perf_counter() + fn() + times.append(time.perf_counter() - start) + peak = _peak_of(fn) + return Result(name, sum(times) / len(times), min(times), repeats, peak) + + +def measure_value(name: str, fn: Callable[[], T]) -> tuple[Result, T]: + """Run ``fn`` once, capturing time and peak memory, returning its value too. + + Used for setup steps (building / writing the dataset) where the produced + value is needed and a second run would be wasteful. + """ + tracemalloc.start() + start = time.perf_counter() + value = fn() + elapsed = time.perf_counter() - start + _, peak = tracemalloc.get_traced_memory() + tracemalloc.stop() + return Result(name, elapsed, elapsed, 1, peak), value + + +# --- formatting -------------------------------------------------------------- + + +def human_time(seconds: float) -> str: + """Format a duration with a sensible unit.""" + if seconds < 1e-3: + return f"{seconds * 1e6:.1f} us" + if seconds < 1.0: + return f"{seconds * 1e3:.2f} ms" + return f"{seconds:.3f} s" + + +def human_bytes(n: int) -> str: + """Format a byte count with a binary unit.""" + size = float(n) + for unit in ("B", "KiB", "MiB", "GiB"): + if size < 1024.0 or unit == "GiB": + return f"{size:.1f} {unit}" + size /= 1024.0 + return f"{size:.1f} GiB" + + +def format_table(results: list[Result]) -> str: + """Render results as an aligned text table.""" + headers = ("operation", "avg", "min", "peak mem", "runs") + rows = [ + ( + r.name, + human_time(r.avg_s), + human_time(r.min_s), + human_bytes(r.peak_bytes), + str(r.repeats), + ) + for r in results + ] + widths = [ + max(len(headers[i]), *(len(row[i]) for row in rows)) + for i in range(len(headers)) + ] + + def fmt(cells: tuple[str, ...]) -> str: + return " ".join(cell.ljust(widths[i]) for i, cell in enumerate(cells)) + + sep = " ".join("-" * w for w in widths) + lines = [fmt(headers), sep, *(fmt(row) for row in rows)] + return "\n".join(lines) diff --git a/docs/optional-pydantic-plan.md b/docs/optional-pydantic-plan.md new file mode 100644 index 0000000..912408a --- /dev/null +++ b/docs/optional-pydantic-plan.md @@ -0,0 +1,196 @@ +# Make pydantic an optional dependency + +## Context + +`ngio-collections` declares `pydantic>=2` as a hard runtime dependency. But the +v5 redesign (flat indexed immutable graph: `graph/` `NodeTree`+`NodeRecord`, +`resolve/`, `validate/`, `api/`) keeps the node spine **off pydantic** — a +`NodeRecord` is a frozen `dataclass`, and node `attributes` ride as a raw +`dict[str, JsonValue]` validated only on typed access (`node[WellAttribute]`). As +a result the **core read/write/resolve/inline path no longer needs pydantic** — +the only remaining couplings are: + +- `JsonValue` type-alias imports across `graph/`, `resolve/`, and `api/`; +- the small value types `DocPath` (`models/_paths.py`) and `ReferenceObj` + (`models/_references.py`), plus the `BaseObj` base; +- the genuinely validation-heavy **typed attribute layer** (`models/attributes/*`: + discriminated transformation union, `RootModel` lists, `Field` constraints); +- the **built-in capability validators** (`validate/_builtins.py`), which import + those typed models to do their checks. + +Goal: `import ngio_collections` plus open/edit/save/resolve/inline work with **no +pydantic installed**. Typed attribute models and the built-in validators stay on +pydantic but behind an optional extra — `pip install ngio-collections[validation]`. +Accessing a typed attribute symbol without pydantic raises a clear, actionable +`ImportError`. + +The hard part (reimplementing pydantic's discriminated unions / constraints) is +explicitly **out of scope** — we keep pydantic for what it's good at and gate it. + +## Approach + +### 1. Packaging — `pyproject.toml` +- Remove `pydantic` from `[project].dependencies` (leave `dependencies = []`). +- Add to `[project.optional-dependencies]`: + `validation = ["pydantic>=2.0.0,<3.0.0"]`. +- Add `pydantic = "*"` to `[tool.pixi.feature.dev.dependencies]` and + `[tool.pixi.feature.test.dependencies]` so the existing suites still exercise + attributes (today pydantic reaches those envs via the editable install's + `dependencies`). +- Add a pydantic-free guard environment: a `core` pixi feature/environment with + pytest but **no pydantic**, used to prove the core stays import-clean (see + Verification). + +### 2. A pydantic-free home for shared primitives — `models/_config.py` +`_config.py` becomes fully pydantic-free: +- Add `JsonValue` — a local recursive alias replacing `from pydantic import + JsonValue` (`dict[str, JsonValue] | list[JsonValue] | str | int | float | bool + | None`). +- Keep `NodeStateError`. +- **Move `BaseObj`** (the frozen, camelCase-aliased pydantic base) out to the + attributes layer — after this change only the attributes layer uses it. + +### 3. De-pydantic the two core value types — stdlib dataclasses +Make `DocPath` and `ReferenceObj` plain **`@dataclass(frozen=True, slots=True)`** +with no pydantic import. `frozen=True` gives value `__eq__`/`__hash__` for free +(callers/tests rely on it). Validation goes in `__post_init__`. + +The key design point (verified against pydantic v2): **pydantic consumes a stdlib +dataclass natively when it appears as a field type** — it coerces an incoming +`dict` into the dataclass, runs the dataclass's `__post_init__` (so pattern +validation still fires through pydantic), and serializes it back on `model_dump`. +So `ReferenceObj` is defined *once*, pydantic-free, and the pydantic +transformation models keep `input: ReferenceObj | None` / `output: ReferenceObj | +None` (`models/attributes/_transformation.py`) unchanged — no duplication, no +`arbitrary_types_allowed`, no `__get_pydantic_core_schema__`. + +- **`models/_references.py`** — `ReferenceObj`: `id: str` (validated against + `ID_PATTERN` in `__post_init__`), `path: DocPath | None = None`. Drop the + `BaseObj` base and `from pydantic import Field`. Add thin + `model_validate` (classmethod, `dict -> ReferenceObj`) / `model_dump` + (`ReferenceObj -> dict`) shims so the **pydantic-free core** (`Node.ref()`, + `resolve` serialization) can (de)serialize without pydantic. Keep `ID_PATTERN` + here. +- **`models/_paths.py`** — `DocPath` (and `ZarrPath`/`JsonPath` subclasses): + `type: Literal["zarr","json"]`, `path: str`, validated in `__post_init__`. Keep + `resolve()` / `relativize()` delegating to the module functions, plus + `model_validate`/`model_dump`/`model_copy` shims for the core path. `PathObj = + DocPath` alias stays. (`ReferenceObj.path` nests `DocPath`; pydantic handles the + nested dataclass-in-dataclass too.) +- **`IdStr`** (decision): keep `ID_PATTERN` in `_references.py` (pydantic-free) + and validate it in `ReferenceObj.__post_init__`. **Keep the pydantic + `IdStr = Annotated[str, Field(pattern=ID_PATTERN)]` in the attributes tier** + (move its definition to `models/attributes/`), because `CoordinateSystem.id` + (a pydantic model in `_coordinate.py`) relies on it — flattening `IdStr` to a + plain `str` would silently drop pattern validation there. +- **`extra="allow"` (decision):** the old `BaseObj`/`DocPath` round-tripped + unknown keys; a fixed-field dataclass cannot. These are tiny closed-shape + locators, so **drop extra-allow for `DocPath`/`ReferenceObj`** rather than carry + an `extras: dict` field + custom schema. (Revisit only if real documents are + found stashing extra keys on a reference/path.) The two serialization paths — + the core's `model_dump` shim and pydantic's nested-dataclass serializer — must + agree on the wire dict; trivial here since `id`/`path`/`type` need no camelCase + aliasing. + +### 4. Repoint the `JsonValue` imports to `_config` +In each of these, `from pydantic import JsonValue` → `from +ngio_collections.models._config import JsonValue` (all annotation / raw-dict +walking — no validation involved): +`graph/_record.py`, `graph/_tree.py`, `api/_node.py`, `resolve/_build.py`, +`resolve/_jsonrefs.py`. + +### 5. Decouple the core handle & validator engine from the attributes import +`api/_node.py`, `validate/_engine.py`, and `validate/_views.py` import +`AnyAttribute` from `models.attributes` only for a `TypeVar` bound plus type +annotations. All three already have `from __future__ import annotations`, so their +annotations are strings already. For each: +- Move `from ngio_collections.models.attributes import AnyAttribute` under a + `TYPE_CHECKING` block. +- Change `A = TypeVar("A", bound=AnyAttribute)` → `A = TypeVar("A", + bound="AnyAttribute")` (a string forward-ref bound is not evaluated at import). + +No inline guard is needed in the runtime methods (`set_attr`/`drop_attrs`/ +`__contains__`, `read_attribute`/`get_attribute`/`has_attribute`): they only ever +call `.key` / `.model_dump` / `.model_validate` on a caller-supplied attribute +*instance or class*, and you cannot obtain one without importing the attributes +layer — which already requires pydantic. Add a one-line comment noting this +invariant. + +### 6. Attributes **and** built-in validators stay on pydantic, behind a guard +- `models/attributes/_base.py` — define `BaseObj` here (moved from `_config`); + keep `_AttributeKey`, `BaseAttribute`, `BaseListAttribute(RootModel[...])`. +- `models/attributes/__init__.py` — top-of-module guard: + `try: import pydantic` / `except ModuleNotFoundError: raise ImportError("Typed + attribute models require pydantic — install ngio-collections[validation]")`. The + `_coordinate` / `_hcs` / `_attributes` / `_transformation` modules stay + unchanged pydantic code. +- `validate/_builtins.py` is genuinely pydantic (it imports `PlateAttribute`, + `ScaleTransformation`, … and does `isinstance` checks on them). It belongs to + the validation tier. Keep it as-is, but make `validate/__init__.py` serve its + symbols (`ScaleMatchesAxes`, `WellUnderPlate`, `register_builtins`) **lazily** + via a module `__getattr__`, eager-importing only the engine (`_engine`) and the + lenses (`_views`). + +### 7. Make the public surface lazy — the crux +Today `models/__init__.py` (eager attribute imports), `api/__init__.py` +(`from ngio_collections.models import *`), and the top-level `__init__.py` +(`from .api import *`) eagerly re-export every attribute/transformation symbol, so +`import ngio_collections` still needs pydantic. Convert all three to PEP 562 lazy +loading. + +Key gotcha: **`from X import *` resolves every name in `X.__all__` eagerly and +does *not* trigger module `__getattr__`.** So the two star-imports must be +replaced with explicit eager imports of the core names plus a forwarding +`__getattr__` for the lazy names. + +- `models/__init__.py`: import the **core** symbols eagerly (`NodeStateError`; + `DocPath`/`JsonPath`/`PathObj`/`ZarrPath`; `IdStr`/`ReferenceObj`). The ~34 + attribute/transformation names **plus `BaseObj`** become lazy: define + `__getattr__(name)` importing them from `ngio_collections.models.attributes` on + first access (surfacing the guard's `ImportError` if pydantic is absent). Keep + `__all__` complete and add `__dir__`. +- `api/__init__.py`: replace `from ngio_collections.models import *` with an + explicit eager import of the core model names, and add a module `__getattr__` + forwarding the lazy attribute names to `ngio_collections.models`. Keep `__all__`. + The composition root must **not** eagerly register pydantic-bound validators: + guard `register_builtins(DEFAULT_VALIDATORS)` on pydantic availability + (`try: import pydantic` → register; `except ModuleNotFoundError: pass`). + `register_node_types()` stays unconditional (no pydantic). +- top-level `src/ngio_collections/__init__.py`: replace `from .api import *` with + an eager import of `api`'s eager names plus a `__getattr__` forwarding the lazy + names to `.api`. Keep `__all__`. + +## Critical files +- `pyproject.toml` — dependency move + extra + pixi envs. +- `src/ngio_collections/models/_config.py` — drop pydantic; add `JsonValue`; + remove `BaseObj`. +- `src/ngio_collections/models/_paths.py` — hand-rolled `DocPath`. +- `src/ngio_collections/models/_references.py` — hand-rolled `ReferenceObj`, + plain `IdStr`. +- `src/ngio_collections/graph/_record.py`, `graph/_tree.py`, + `resolve/_build.py`, `resolve/_jsonrefs.py` — `JsonValue` import from `_config`. +- `src/ngio_collections/api/_node.py` — `JsonValue` import; `TYPE_CHECKING` + `AnyAttribute` + string `TypeVar` bound. +- `src/ngio_collections/validate/_engine.py`, `validate/_views.py` — + `TYPE_CHECKING` `AnyAttribute` + string `TypeVar` bound. +- `src/ngio_collections/models/attributes/_base.py` — host `BaseObj`. +- `src/ngio_collections/models/attributes/__init__.py` — pydantic guard. +- `src/ngio_collections/validate/__init__.py` — lazy `_builtins` symbols. +- `src/ngio_collections/models/__init__.py`, `api/__init__.py`, + `src/ngio_collections/__init__.py` — lazy `__getattr__`, replace star-imports, + guard composition-root validator registration. + +## Verification +1. Full suite with pydantic: `pixi run --environment dev pytest` — green + (attributes + built-in validators still validated). +2. `pixi run lint` and `pixi run type-check` — clean. +3. Pydantic-free core (new `core` env, no pydantic). A small test asserting: + - `import ngio_collections` succeeds; + - an open → edit (`set_attrs`/`rename`/`add`) → save round-trip works + (attributes round-trip as raw dicts); + - reference resolve + `open_inlined` works; + - accessing a typed symbol (e.g. `ngio_collections.WellAttribute` or + `node[WellAttribute]`) raises `ImportError` mentioning + `ngio-collections[validation]`. + Run via `pixi run --environment core pytest tests/test_no_pydantic.py` (or a + subprocess that blocks `pydantic` from `sys.modules`). diff --git a/examples/01_sync_api.py b/examples/01_sync_api.py deleted file mode 100644 index 5ee4fe3..0000000 --- a/examples/01_sync_api.py +++ /dev/null @@ -1,99 +0,0 @@ -"""Quickstart: the sync convenience API for scripts and notebooks. - -No ``asyncio.run()`` anywhere — that's the point. ``write_multiscale`` / -``write_collection`` emit one document each (children embedded) and return a -reference stub for it, so a collection can reference the written document -instead of embedding a copy; ``open_collection`` / ``open_multiscale`` return -the fully inlined tree. ``walk()`` / ``find()`` navigate the result. - -Run with: - - pixi run -e dev python examples/01_sync_api.py -""" - -import shutil -from pathlib import Path - -import ngio_collections as ngc - -ROOT = Path(__file__).parent / "data" / "sync_api" - - -def build_multiscale() -> ngc.MultiscaleNode: - systems = ngc.CoordinateSystemsAttribute( - [ - ngc.CoordinateSystem( - id="physical", - axes=[{"name": "y", "type": "space"}, {"name": "x", "type": "space"}], - ) - ] - ) - return ngc.MultiscaleNode( - id="image", - name="DAPI", - nodes=[ - ngc.SinglescaleNode( - id="s0", - name="s0", - path=ngc.ZarrPath(path="./s0"), - attributes={"coordinateTransformations": []}, - ) - ], - attributes={systems.key: systems.model_dump(mode="json", by_alias=True)}, - ) - - -def show(node: ngc.BaseNode, depth: int = 0) -> None: - print(f"{' ' * depth}{node.type} {node.id!r} attrs={list(node.attributes)}") - for child in getattr(node, "nodes", None) or []: - if isinstance(child, ngc.BaseNode): - show(child, depth + 1) - - -def main() -> None: - shutil.rmtree(ROOT, ignore_errors=True) - - # A multiscale as its own zarr-form document (singlescales embedded). - # The writer hands back the reference form: {type, id, name, path}. - ref: ngc.MultiscaleRef = ngc.write_multiscale( - build_multiscale(), str(ROOT / "image.zarr") - ) - # Parent-level attributes live on the stub; they win over the target's - # attributes when the reference is inlined on read. - ref.attributes["ngio:description"] = "sync api demo" - - # The collection references the existing document instead of embedding it; - # the stub path is relativized on write ("./image.zarr"). - collection = ngc.CollectionNode( - id="experiment", - name="My Experiment", - nodes=[ref], - ) - ngc.write_collection(collection, str(ROOT / "collection.json")) - root = ngc.open_collection(str(ROOT / "collection.json")) - print("collection tree (fully inlined):") - show(root) - - # Navigation: walk() flattens the subtree (self first, depth-first) and - # find() looks a node up by id — no hand-rolled recursion needed. - print("\nwalk() — flat view of the tree:") - for node in root.walk(): - print(f" {node.type} {node.id!r}") - - s0 = root.find("s0") - assert s0 is not None - print(f"\nfind('s0') -> {s0.type} {s0.id!r} name={s0.name!r}") - - # A multiscale document can also be opened directly, and its attributes - # read through the typed attrs view. - image = ngc.open_multiscale(str(ROOT / "image.zarr")) - systems = image.attrs[ngc.CoordinateSystemsAttribute] - print(f"\nopen_multiscale: coordinate systems = {[cs.id for cs in systems.root]}") - - # the stub's path is the target URL, even on the inlined document - # this can be used for data access - print(f"\nimage.nodes[0].target_url -> {image.nodes[0].target_url}") - - -if __name__ == "__main__": - main() diff --git a/examples/02_models_and_documents.py b/examples/02_models_and_documents.py deleted file mode 100644 index 91b6866..0000000 --- a/examples/02_models_and_documents.py +++ /dev/null @@ -1,106 +0,0 @@ -"""The pure model layer: nodes, typed attributes, and document round-trips. - -No IO in this script — nodes and documents are plain Pydantic objects -(DESIGN.md §7). Shows structural validation at construction, the typed -``attrs`` view over the raw attributes dict, ``walk()`` / ``find()`` -navigation, and the ``parse_metadata_document`` / ``serialize`` round-trip. - -Run with: - - pixi run -e dev python examples/02_models_and_documents.py -""" - -import json - -from pydantic import ValidationError - -import ngio_collections as ngc -from ngio_collections.models import LabelObj - - -def build_tree() -> ngc.CollectionNode: - s0 = ngc.SinglescaleNode( - id="s0", - name="s0", - path=ngc.ZarrPath(path="./s0"), - attributes={ - "coordinateTransformations": [ - { - "type": "scale", - "scale": [0.65, 0.65], - "input": {"id": "s0"}, - "output": {"id": "physical"}, - } - ] - }, - ) - systems = ngc.CoordinateSystemsAttribute( - [ - ngc.CoordinateSystem( - id="physical", - axes=[{"name": "y", "type": "space"}, {"name": "x", "type": "space"}], - ) - ] - ) - image = ngc.MultiscaleNode( - id="image", - name="DAPI", - nodes=[s0], - attributes={systems.key: systems.model_dump(mode="json", by_alias=True)}, - ) - return ngc.CollectionNode(id="experiment", name="My Experiment", nodes=[image]) - - -def main() -> None: - # --- Structural rules are enforced at construction ---------------------- - try: - ngc.CollectionNode(id="c", name="c") # neither `nodes` nor `path` - except ValidationError as err: - print("validation error:", err.errors()[0]["msg"]) - - root = build_tree() - - # --- walk() / find(): flat traversal and id lookup ---------------------- - print("\nwalk:", [node.id for node in root.walk()]) - image = root.find("image") - assert isinstance(image, ngc.MultiscaleNode) - - # --- The attrs view: typed, validating reads and writes ----------------- - # Reads validate the raw JSON into the attribute model; assignment dumps - # the spec shape back into the dict. The raw dict stays the source of - # truth, so unknown attributes round-trip untouched. - systems = image.attrs[ngc.CoordinateSystemsAttribute] - print("axes:", [axis["name"] for axis in systems.root[0].axes]) - - image.attrs[ngc.LabelsAttribute] = ngc.LabelsAttribute( - label_attributes=[LabelObj(label_value=1, color=[255, 0, 0, 255])] - ) - print("labels set:", ngc.LabelsAttribute in image.attrs) - - # --- Documents: serialize and re-parse, no IO --------------------------- - # A MetadataDocument is the unit of serialization; the `ome` version - # lives on it, off the node models. - doc = ngc.MetadataDocument( - root=root, url="memory://collection.json", form="json", version="0.x" - ) - payload = doc.serialize() - print("\nserialized document:") - print(json.dumps(payload, indent=2)[:300], "...") - - reparsed = ngc.parse_metadata_document(payload, url="memory://collection.json") - assert [n.id for n in reparsed.root.walk()] == [n.id for n in root.walk()] - print("\nround-trip preserves the tree:", [n.id for n in reparsed.root.walk()]) - - # --- Graceful degradation: unknown types stay opaque -------------------- - custom = ngc.CollectionNode( - id="c", - name="c", - nodes=[{"type": "mobie:view", "id": "v1", "name": "view", "customField": 42}], - ) - view = custom.nodes[0] - print("\nunknown type parses as:", type(view).__name__) - print("extras round-trip:", view.model_dump(by_alias=True)["customField"]) - - -if __name__ == "__main__": - main() diff --git a/examples/03_resolver_read_write.py b/examples/03_resolver_read_write.py deleted file mode 100644 index 670db6d..0000000 --- a/examples/03_resolver_read_write.py +++ /dev/null @@ -1,187 +0,0 @@ -"""The async core: externalized documents, lazy reads, document-granular saves. - -Writes an RFC-8 collection with one document per externalized node: - - data/resolver/ - ├── collection.json <- root collection (stubs for the children) - ├── tables/measurements.json <- nested collection, its own document - └── image.zarr/zarr.json <- multiscale, stored in zarr.json form - -Then reads it back through the ``Resolver``: ``open()`` reads only the root -document (children stay stubs), ``resolve()`` fetches one child on demand, -``resolve_tree()`` warms the cache for the whole reachable tree, and editing -a node rewrites only its owning document on ``save()``. - -Run with: - - pixi run -e dev python examples/03_resolver_read_write.py -""" - -import asyncio -import hashlib -import shutil -from pathlib import Path - -import ngio_collections as ngc - -ROOT = Path(__file__).parent / "data" / "resolver" -VERSION = "0.x" - - -def build_image() -> ngc.MultiscaleNode: - """A multiscale image with one resolution level. - - The singlescale's path points at the array data; its scale transformation - maps it into the "physical" coordinate system declared on the multiscale. - """ - s0 = ngc.SinglescaleNode( - id="s0", - name="s0", - path=ngc.ZarrPath(path="./s0"), - attributes={ - "coordinateTransformations": [ - { - "type": "scale", - "scale": [1.0, 0.65, 0.65], - "input": {"id": "s0"}, - "output": {"id": "physical"}, - } - ] - }, - ) - physical = ngc.CoordinateSystem( - id="physical", - axes=[ - {"name": "z", "type": "space", "unit": "micrometer"}, - {"name": "y", "type": "space", "unit": "micrometer"}, - {"name": "x", "type": "space", "unit": "micrometer"}, - ], - ) - systems = ngc.CoordinateSystemsAttribute([physical]) - return ngc.MultiscaleNode( - id="image", - name="DAPI", - nodes=[s0], - attributes={ - systems.key: systems.model_dump( - mode="json", by_alias=True, exclude_none=True - ) - }, - ) - - -async def write_fixture(resolver: ngc.Resolver) -> None: - """One MetadataDocument per externalized node; the root references them - through path stubs. ``stub_path`` is how the parent document will - reference each child document.""" - image_doc = ngc.MetadataDocument( - root=build_image(), - url=str(ROOT / "image.zarr" / "zarr.json"), - form="zarr", - version=VERSION, - stub_path=ngc.ZarrPath(path="./image.zarr"), - ) - await resolver.save(image_doc) - - tables = ngc.CollectionNode( - id="tables", - name="Tables", - nodes=[ - # Unregistered node types are perfectly valid: readers treat them - # as opaque nodes and keep their custom fields. - {"type": "fractal:table", "id": "t1", "name": "regionprops"}, - ], - ) - tables_doc = ngc.MetadataDocument( - root=tables, - url=str(ROOT / "tables" / "measurements.json"), - form="json", - version=VERSION, - stub_path=ngc.JsonPath(path="./tables/measurements.json"), - ) - await resolver.save(tables_doc) - - root = ngc.CollectionNode( - id="my-experiment", - name="My Experiment", - nodes=[ - ngc.MultiscaleNode( - id="image", name="DAPI", path=ngc.ZarrPath(path="./image.zarr") - ), - ngc.CollectionNode( - id="tables", - name="Tables", - path=ngc.JsonPath(path="./tables/measurements.json"), - ), - ], - ) - root_doc = ngc.MetadataDocument( - root=root, url=str(ROOT / "collection.json"), form="json", version=VERSION - ) - await resolver.save(root_doc) - - -def print_tree(node: ngc.BaseNode, indent: int = 0) -> None: - stub = f" -> {node.path.path}" if node.path is not None else "" - print(f"{' ' * indent}[{node.type}] {node.id}{stub}") - for child in getattr(node, "nodes", None) or []: - if isinstance(child, ngc.BaseNode): - print_tree(child, indent + 1) - - -def snapshot() -> dict[Path, str]: - return {p: hashlib.sha256(p.read_bytes()).hexdigest() for p in ROOT.rglob("*.json")} - - -async def main() -> None: - shutil.rmtree(ROOT, ignore_errors=True) - await write_fixture(ngc.Resolver(ngc.LocalStore())) - - # A fresh resolver, so open() parses the documents from disk. - resolver = ngc.Resolver(ngc.LocalStore()) - - # --- Open only reads the root document; children stay as stubs ---------- - doc = await resolver.open(str(ROOT / "collection.json")) - root = doc.root - print("After open() (lazy, one file read):") - print_tree(root) - - # --- Resolve a single child on demand ------------------------------------ - # Resolution never mutates the tree: `root` keeps its stub, the resolved - # document lives in the resolver's URL-keyed cache. - image_stub = root.find("image") - image_doc = await resolver.resolve(image_stub) - systems = image_doc.root.attrs[ngc.CoordinateSystemsAttribute] - print( - f"\nResolved {image_doc.root.id!r}: " - f"coordinate systems = {[cs.id for cs in systems.root]}" - ) - - # --- Or warm the cache for the whole reachable tree ---------------------- - # Stubs whose path points at plain Zarr data rather than an OME metadata - # document (the singlescale's `./s0` array) are skipped by default - # (`on_error="skip"`). Afterwards children() / resolve() are cache reads. - documents = await resolver.resolve_tree(doc) - print(f"\nresolve_tree() fetched {len(documents)} documents:") - for d in documents: - print(f" {d.form:>4} {d.url}") - - # children() transparently replaces stubs with their resolved roots. - for child in await resolver.children(root): - print(f"child {child.id!r}: attrs={list(child.attributes)}") - - # --- Edit one node and save: only its owning document is rewritten ------ - before = snapshot() - tables_doc = await resolver.resolve(root.find("tables")) - tables_doc.root.attributes["fractal:status"] = "validated" - await resolver.save(tables_doc) - after = snapshot() - - print("\nFiles changed by save(tables):") - for path in after: - if before[path] != after[path]: - print(f" {path.relative_to(ROOT)}") - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/examples/04_inline_and_merge.py b/examples/04_inline_and_merge.py deleted file mode 100644 index 044c566..0000000 --- a/examples/04_inline_and_merge.py +++ /dev/null @@ -1,124 +0,0 @@ -"""Inline a resolved tree: the attribute merge materialized (DESIGN.md §5). - -The collection's `image` stub carries its own attributes -(`ngio:description`) on top of the target multiscale's -(`coordinateSystems`, `labels`). `Resolver.inline()` builds a NEW document -in which the stub is collapsed into a copy of its resolved subtree, with -the merged attributes: target root's overlaid by the stub's (stub wins) and -the stub's id/name. The originals are never touched. - -Writes stay explicit and document-granular — annotating a multiscale on a -read-only store means writing to the stub and saving only the collection -document, which this script also demonstrates. - -Run with: - - pixi run -e dev python examples/04_inline_and_merge.py -""" - -import asyncio -import hashlib -import shutil -from pathlib import Path - -import ngio_collections as ngc -from ngio_collections.models import LabelObj - -ROOT = Path(__file__).parent / "data" / "inline" -VERSION = "0.x" - - -def digest(path: Path) -> str: - return hashlib.sha256(path.read_bytes()).hexdigest()[:12] - - -async def write_fixture() -> None: - """A collection whose stub annotates an externalized multiscale.""" - resolver = ngc.Resolver(ngc.LocalStore()) - systems = ngc.CoordinateSystemsAttribute( - [ngc.CoordinateSystem(id="physical", axes=[{"name": "x", "type": "space"}])] - ) - image = ngc.MultiscaleNode( - id="image", - name="DAPI", - nodes=[ - ngc.SinglescaleNode( - id="s0", - name="s0", - path=ngc.ZarrPath(path="./s0"), - attributes={"coordinateTransformations": []}, - ) - ], - attributes={systems.key: systems.model_dump(mode="json", by_alias=True)}, - ) - image.attrs[ngc.LabelsAttribute] = ngc.LabelsAttribute( - label_attributes=[LabelObj(label_value=1, color=[255, 0, 0, 255])] - ) - await resolver.save( - ngc.MetadataDocument( - root=image, - url=str(ROOT / "image.zarr" / "zarr.json"), - form="zarr", - version=VERSION, - stub_path=ngc.ZarrPath(path="./image.zarr"), - ) - ) - root = ngc.CollectionNode( - id="my-experiment", - name="My Experiment", - nodes=[ - ngc.MultiscaleNode( - id="image", - name="DAPI", - path=ngc.ZarrPath(path="./image.zarr"), - attributes={"ngio:description": "stub-side annotation"}, - ) - ], - ) - await resolver.save( - ngc.MetadataDocument( - root=root, url=str(ROOT / "collection.json"), form="json", version=VERSION - ) - ) - - -async def main() -> None: - shutil.rmtree(ROOT, ignore_errors=True) - await write_fixture() - - # A fresh resolver, so open() parses the documents from disk. - resolver = ngc.Resolver(ngc.LocalStore()) - doc = await resolver.open(str(ROOT / "collection.json")) - stub = doc.root.nodes[0] - - print("stub attributes: ", list(stub.attributes)) - target = (await resolver.resolve(stub)).root - print("target attributes:", list(target.attributes)) - - # The §5 merge, materialized: the stub collapsed into its resolved - # subtree, target attributes overlaid by the stub's (stub wins). - inlined = await resolver.inline(doc) - image = inlined.root.nodes[0] - print("merged attributes:", list(image.attributes)) - - # The inlined node is a real node: typed reads via the normal attrs view. - labels = image.attrs[ngc.LabelsAttribute] - print("label colors:", [label.color for label in labels.label_attributes]) - - # The originals are untouched: the parsed tree keeps its stub. - print("original stub intact:", stub.path is not None and stub.nodes is None) - - # Annotate the (possibly read-only) multiscale via the stub: only the - # collection document is rewritten, image.zarr/zarr.json is untouched. - zarr_json = ROOT / "image.zarr" / "zarr.json" - before = digest(zarr_json) - stub.attributes["ngio:reviewed"] = True - await resolver.save(doc) - print("\nsaved", doc.url) - print("image.zarr/zarr.json untouched:", digest(zarr_json) == before) - reinlined = await resolver.inline(doc) - print("merged attributes:", list(reinlined.root.nodes[0].attributes)) - - -if __name__ == "__main__": - asyncio.run(main()) diff --git a/examples/05_custom_node_types.py b/examples/05_custom_node_types.py deleted file mode 100644 index b4e3aee..0000000 --- a/examples/05_custom_node_types.py +++ /dev/null @@ -1,76 +0,0 @@ -"""Registering custom node types (DESIGN.md §3.4). - -A third-party package subclasses ``BaseNode``, registers it under its -``type`` key in a ``NodeRegistry``, and children parse as the custom class. -Registries are plain objects, not singletons: pass one to -``parse_metadata_document(..., registry=...)`` or ``Resolver(store, -registry=...)``. Unregistered types degrade gracefully to a plain -``BaseNode`` and round-trip their extra fields untouched. - -Run with: - - pixi run -e dev python examples/05_custom_node_types.py -""" - -from typing import Literal - -import ngio_collections as ngc - - -class TableNode(ngc.BaseNode): - """A custom node type with its own typed field.""" - - type: Literal["fractal:table"] = "fractal:table" - region: str | None = None - - -def build_registry() -> ngc.NodeRegistry: - # A fresh registry starts empty: register the built-ins you need plus - # your own types (DEFAULT_REGISTRY keeps the everyday ergonomics). - registry = ngc.NodeRegistry() - registry.register("collection", ngc.CollectionNode) - registry.register("multiscale", ngc.MultiscaleNode) - registry.register("singlescale", ngc.SinglescaleNode) - registry.register("fractal:table", TableNode) - return registry - - -DATA = { - "ome": { - "version": "0.x", - "type": "collection", - "id": "experiment", - "name": "My Experiment", - "nodes": [ - { - "type": "fractal:table", - "id": "t1", - "name": "regionprops", - "region": "FOV_1", - }, - {"type": "mobie:view", "id": "v1", "name": "view", "customField": 42}, - ], - } -} - - -def main() -> None: - doc = ngc.parse_metadata_document( - DATA, url="memory://collection.json", registry=build_registry() - ) - table = doc.root.find("t1") - print(f"registered type parses as {type(table).__name__}, region={table.region!r}") - - # Unregistered types stay opaque BaseNodes; their extras round-trip. - view = doc.root.find("v1") - dumped = doc.serialize()["ome"]["nodes"][1] - print(f"unregistered type parses as {type(view).__name__}, dump={dumped}") - - # Without the registry, the custom type is opaque too (graceful - # degradation) — same document, no error, no custom field typing. - plain = ngc.parse_metadata_document(DATA, url="memory://collection.json") - print("without registry:", type(plain.root.find("t1")).__name__) - - -if __name__ == "__main__": - main() diff --git a/examples/06_hcs_plate_single_collection.py b/examples/06_hcs_plate_single_collection.py deleted file mode 100644 index 31cd2a5..0000000 --- a/examples/06_hcs_plate_single_collection.py +++ /dev/null @@ -1,133 +0,0 @@ -"""An HCS plate as a SINGLE collection document, images externalized. - -One ``collection.json`` holds the whole plate->well hierarchy inline: the -plate is a collection carrying the ``plate`` attribute, each well a child -collection carrying the ``well`` attribute. Only the multiscale images are -externalized, into per-field subdirectories the wells reference by path: - - data/hcs_single/ - ├── collection.json <- plate + wells inline, image stubs - ├── A/1/0.zarr/zarr.json <- multiscale image (field 0 of well A/1) - ├── A/2/0.zarr/zarr.json - ├── B/1/0.zarr/zarr.json - └── B/2/0.zarr/zarr.json - -Built bottom-up with the sync API: ``write_multiscale`` emits each image and -hands back a reference stub, the wells embed those stubs, and -``write_collection`` emits the one plate document — relativizing every nested -image path against it (``./A/1/0.zarr`` …). - -Run with: - - pixi run -e dev python examples/06_hcs_plate_single_collection.py -""" - -import shutil -from pathlib import Path - -import ngio_collections as ngc -from ngio_collections.models import ColumnObj, RowObj - -ROOT = Path(__file__).parent / "data" / "hcs_single" - -ROWS = ["A", "B"] -COLUMNS = ["1", "2"] - - -def build_image(row: str, col: str) -> ngc.MultiscaleNode: - """A one-level multiscale for field 0 of well ``{row}/{col}``. - - Node ids are unique per well so the whole plate stays collision-free once - every image is inlined into one tree on read. - """ - systems = ngc.CoordinateSystemsAttribute( - [ - ngc.CoordinateSystem( - id="physical", - axes=[{"name": "y", "type": "space"}, {"name": "x", "type": "space"}], - ) - ] - ) - return ngc.MultiscaleNode( - id=f"img_{row}{col}", - name="0", - nodes=[ - ngc.SinglescaleNode( - id=f"s0_{row}{col}", - name="s0", - path=ngc.ZarrPath(path="./s0"), - attributes={"coordinateTransformations": []}, - ) - ], - attributes={systems.key: systems.model_dump(mode="json", by_alias=True)}, - ) - - -def build_well(row: str, col: str) -> ngc.CollectionNode: - """A well collection whose single child is the externalized image stub.""" - image_ref = ngc.write_multiscale( - build_image(row, col), str(ROOT / row / col / "0.zarr") - ) - well = ngc.WellAttribute( - row=ngc.ReferenceObj(id=row), column=ngc.ReferenceObj(id=col) - ) - return ngc.CollectionNode( - id=f"well_{row}{col}", - name=f"{row}{col}", - nodes=[image_ref], - attributes={ - well.key: well.model_dump(mode="json", by_alias=True, exclude_none=True) - }, - ) - - -def show(node: ngc.BaseNode, depth: int = 0) -> None: - stub = f" -> {node.path.path}" if node.path is not None else "" - print(f"{' ' * depth}[{node.type}] {node.id} attrs={list(node.attributes)}{stub}") - for child in getattr(node, "nodes", None) or []: - if isinstance(child, ngc.BaseNode): - show(child, depth + 1) - - -def main() -> None: - shutil.rmtree(ROOT, ignore_errors=True) - - # Wells (with their image stubs) stay inline; only images are externalized. - wells = [build_well(row, col) for row in ROWS for col in COLUMNS] - plate = ngc.PlateAttribute( - rows=[RowObj(id=row) for row in ROWS], - columns=[ColumnObj(id=col) for col in COLUMNS], - ) - plate_node = ngc.CollectionNode( - id="plate", - name="My Plate", - nodes=wells, - attributes={ - plate.key: plate.model_dump(mode="json", by_alias=True, exclude_none=True) - }, - ) - ngc.write_collection(plate_node, str(ROOT / "collection.json")) - - print("written files:") - for file in sorted(ROOT.rglob("*.json")): - print(f" {file.relative_to(ROOT)}") - - # Read back: open_collection inlines the image documents into the plate. - root = ngc.open_collection(str(ROOT / "collection.json")) - print("\nplate tree (fully inlined):") - show(root) - - # Navigate the flattened plate with walk() / find(). - plate_attr = root.attrs[ngc.PlateAttribute] - print( - f"\nplate: {len(plate_attr.rows)} rows x {len(plate_attr.columns)} columns, " - f"{sum(n.type == 'collection' for n in root.walk()) - 1} wells" - ) - well_b2 = root.find("well_B2") - assert well_b2 is not None - location = well_b2.attrs[ngc.WellAttribute] - print(f"well_B2 at row={location.row.id!r} column={location.column.id!r}") - - -if __name__ == "__main__": - main() diff --git a/examples/07_hcs_plate_nested.py b/examples/07_hcs_plate_nested.py deleted file mode 100644 index 19ca726..0000000 --- a/examples/07_hcs_plate_nested.py +++ /dev/null @@ -1,136 +0,0 @@ -"""An HCS plate as a fully externalized tree, one document per node. - -Mirrors the on-disk OME-Zarr plate layout: the plate at the top level, each -well its own document in a ``{row}/{col}`` subdirectory, and each image in -``{row}/{col}/{image}``. Every parent references its children by path stub: - - data/hcs_nested/ - ├── collection.json <- plate, well stubs -> ./A/1/well.json … - ├── A/1/well.json <- well A/1, image stubs -> ./0.zarr - ├── A/1/0.zarr/zarr.json <- multiscale image (field 0) - ├── A/2/well.json - ├── A/2/0.zarr/zarr.json - └── … - -Built bottom-up with the sync API: each ``write_*`` emits one document and -returns a reference stub, which the parent embeds; ``write_collection`` -relativizes the embedded stub paths against the parent's URL, so the well -document references ``./0.zarr`` and the plate references ``./A/1/well.json``. - -Run with: - - pixi run -e dev python examples/07_hcs_plate_nested.py -""" - -import shutil -from pathlib import Path - -import ngio_collections as ngc -from ngio_collections.models import ColumnObj, RowObj - -ROOT = Path(__file__).parent / "data" / "hcs_nested" - -ROWS = ["A", "B"] -COLUMNS = ["1", "2"] - - -def build_image(row: str, col: str) -> ngc.MultiscaleNode: - """A one-level multiscale for field 0 of well ``{row}/{col}``. - - Node ids stay unique across the plate so the inlined-on-read tree (every - document collapsed into one) has no id collisions. - """ - systems = ngc.CoordinateSystemsAttribute( - [ - ngc.CoordinateSystem( - id="physical", - axes=[{"name": "y", "type": "space"}, {"name": "x", "type": "space"}], - ) - ] - ) - return ngc.MultiscaleNode( - id=f"img_{row}{col}", - name="0", - nodes=[ - ngc.SinglescaleNode( - id=f"s0_{row}{col}", - name="s0", - path=ngc.ZarrPath(path="./s0"), - attributes={"coordinateTransformations": []}, - ) - ], - attributes={systems.key: systems.model_dump(mode="json", by_alias=True)}, - ) - - -def write_well(row: str, col: str) -> ngc.CollectionRef: - """Write image then well, each its own document; return the well stub.""" - image_ref = ngc.write_multiscale( - build_image(row, col), str(ROOT / row / col / "0.zarr") - ) - well = ngc.WellAttribute( - row=ngc.ReferenceObj(id=row), column=ngc.ReferenceObj(id=col) - ) - well_node = ngc.CollectionNode( - id=f"well_{row}{col}", - name=f"{row}{col}", - nodes=[image_ref], - attributes={ - well.key: well.model_dump(mode="json", by_alias=True, exclude_none=True) - }, - ) - # Writing the well relativizes the image stub against it: ./0.zarr - return ngc.write_collection(well_node, str(ROOT / row / col / "well.json")) - - -def show(node: ngc.BaseNode, depth: int = 0) -> None: - stub = f" -> {node.path.path}" if node.path is not None else "" - print(f"{' ' * depth}[{node.type}] {node.id} attrs={list(node.attributes)}{stub}") - for child in getattr(node, "nodes", None) or []: - if isinstance(child, ngc.BaseNode): - show(child, depth + 1) - - -def main() -> None: - shutil.rmtree(ROOT, ignore_errors=True) - - # Each well is its own document; the plate references them by path. - well_refs = [write_well(row, col) for row in ROWS for col in COLUMNS] - plate = ngc.PlateAttribute( - rows=[RowObj(id=row) for row in ROWS], - columns=[ColumnObj(id=col) for col in COLUMNS], - ) - plate_node = ngc.CollectionNode( - id="plate", - name="My Plate", - nodes=well_refs, - attributes={ - plate.key: plate.model_dump(mode="json", by_alias=True, exclude_none=True) - }, - ) - ngc.write_collection(plate_node, str(ROOT / "collection.json")) - - print("written files (one document per node):") - for file in sorted(ROOT.rglob("*.json")): - print(f" {file.relative_to(ROOT)}") - - # The lazy view: open() reads only the plate; wells stay stubs. - print("\nplate document alone (wells are stubs):") - show(ngc.open_collection(str(ROOT / "collection.json"), max_depth=0)) - - # The hydrated view: open_collection inlines wells and their images. - root = ngc.open_collection(str(ROOT / "collection.json")) - print("\nplate tree (fully inlined):") - show(root) - - well_a1 = root.find("well_A1") - assert well_a1 is not None - location = well_a1.attrs[ngc.WellAttribute] - print( - f"\nwell_A1 at row={location.row.id!r} column={location.column.id!r}, " - f"images={[n.id for n in well_a1.walk() if n.type == 'multiscale']}" - ) - - -if __name__ == "__main__": - main() diff --git a/examples/README.md b/examples/README.md deleted file mode 100644 index 2b59af8..0000000 --- a/examples/README.md +++ /dev/null @@ -1,18 +0,0 @@ -# Examples - -Each script is self-contained (it writes its own fixture data under -`examples/data/`, which is gitignored) and runnable with: - -```bash -pixi run -e dev python examples/