Binary container format for episodic data. 64-byte header, O(1) name lookup, per-entry compression, CRC32C checksums, five native language implementations.
any_file.shard
├── Header (64 bytes) Magic "SHRD", version, role byte, alignment, compression default
├── Index (N × 48 bytes) xxHash64 name → offset, size, compression flags, CRC32C
├── String Table Entry names (UTF-8, null-terminated)
└── Data Blocks Aligned to 16/32/64 bytes, independently compressed
Shard is the container. Profiles — concrete schemas for what the entries mean — live downstream of core. The role byte identifies the profile; readers that understand a given role validate it, others treat the file as a generic named-entry archive.
The flagship downstream profile is WShard — episodic world-model data (signal / action / omen / uncert / residual lanes).
| Role | Profile | Status |
|---|---|---|
| 0x04 | Manifest | Core — multi-file coordination |
| 0x05 | WShard | Active downstream profile |
| 0x01 | ModelShard (.mosh) |
Parked — see attic/profiles/mosh/ |
| 0x02 | SampleShard (.smpl) |
Parked — see attic/profiles/sampleshard/ |
| 0x08 | ColumnShard (.cshard) |
Parked — see attic/profiles/columnshard/ |
Parked profiles still decode (the container reader does not care about role semantics) but are not part of the lead surface or default build.
Five independent, idiomatic implementations of the core container. All produce byte-identical output verified by golden file tests.
| Language | Location | Status |
|---|---|---|
| Go | go/ |
Production |
| Rust | rs/ |
Production |
| Python | py/ |
Production |
| C | c/ |
Production |
| TypeScript | ts/ |
Production |
# Go
import "github.com/Neumenon/shard/go/shard"
# Rust
[dependencies]
shard = { git = "https://github.com/Neumenon/shard.git", path = "rs" }
# Python
cd py && pip install -e .
# TypeScript
cd ts && npm install
# C
cd c && makeimport "github.com/Neumenon/shard/go/shard"
// Write a generic named-entry archive
w, _ := shard.NewShardStreamWriter("data.shard", shard.ShardOptions{
Alignment: 64,
Compression: shard.CompressionZstd,
MaxEntries: 1000,
})
w.WriteEntry("entry/0", payloadBytes, shard.ContentTypeRaw)
w.Finalize()
// Read
r, _ := shard.OpenShard("data.shard")
data, _ := r.ReadEntry("entry/0")from shard_format import ShardWriter, ShardReader
# Write
with ShardWriter("data.shard", alignment=64) as w:
w.write_entry("entry/0", payload_bytes, compression="zstd")
# Read
reader = ShardReader("data.shard")
data = reader.read_entry("entry/0")O(1) name lookup. xxHash64 of the entry name maps directly to a 48-byte index entry. Finding one entry in a 400-entry file costs one hash + one 48-byte read.
Per-entry compression. Each entry independently chooses none/zstd/lz4. JSON configs compress 10:1. Pre-quantized payloads stay raw. The reader auto-detects from entry flags.
CRC32C checksums. Hardware-accelerated (SSE4.2/ARM CRC32) integrity verification on every entry. Checksum mismatch is a hard error, not a warning.
Aligned mmap. Data blocks start at 32/64-byte boundaries. mmap() + pointer cast gives AVX-aligned arrays with zero copies.
Streaming writes. Pre-reserve header space, write entries sequentially, finalize header at the end. Memory stays constant regardless of entry count.
Security limits. Four hard limits enforced on read: MAX_ENTRY_COUNT (10M), MAX_INDEX_SIZE (1GB), MAX_STRING_TABLE_SIZE (100MB), MAX_DECOMPRESS_SIZE (1GB). No code execution risk.
The file format for world model episode data. One file = one episode. Cross-language. Zero-copy reads.
episode.wshard
├── meta/episode → {"episode_id": "ep_001", "env_id": "Manip-v2", "length_T": 500}
├── signal/rgb → [500, 84, 84, 3] uint8 (zstd compressed)
├── signal/joint_pos → [500, 7] float32 (32-byte aligned)
├── action/ctrl → [500, 7] float32
├── omen/rgb/dreamer → [500, 84, 84, 3] uint8 (model predictions)
├── uncert/rgb/std → [500, 84, 84, 3] float32 (uncertainty estimates)
├── reward → [500] float32
└── done → [500] bool
Semantic lane prefixes give meaning to raw data:
| Prefix | Purpose |
|---|---|
signal/ |
Ground truth observations |
action/ |
Agent actions |
omen/ |
Model predictions (dreamed trajectories) |
uncert/ |
Uncertainty estimates (ensemble variance, entropy) |
residual/ |
Compressed residuals (sign-2nd-diff encoding) |
meta/ |
JSON metadata |
time/ |
Timestamps |
Key capabilities:
- Crash-safe streaming writes via
.partialfile pattern - Chunked episodes with manifest-tracked continuity
- Per-block compression (zstd video, raw scalars)
- Format conversion from DreamerV3 NPZ, Minari, D4RL
- 13 data types including bf16
Implementations: Python (wshard/py/), TypeScript (wshard/js/), Go (go/shard/).
See wshard/README.md for full API docs. See wshard/docs/DEEP_DIVE.md for the byte-level format spec.
Multi-file coordination — references to other shard files with range keys. Chunked WShard episodes, split datasets, file groupings. Part of core (all 5 languages).
shard/
├── go/ Go core
├── rs/ Rust core
├── py/ Python core
├── c/ C core
├── ts/ TypeScript core
├── wshard/ WShard profile (Python + TypeScript + golden tests)
│ ├── py/
│ ├── js/
│ ├── golden/ Cross-language golden test fixtures
│ └── docs/
├── attic/profiles/ Parked profiles (mosh, sampleshard, columnshard)
├── ucodec/ Golden test data + safety test fixtures
└── explainer.html Interactive format explainer
All implementations agree on:
- CRC32C polynomial: Castagnoli (
0x82F63B78) - Name hashing: xxHash64 (seed 0)
- Alignment: Configurable 0/16/32/64 bytes
- Compression threshold: Only compress entries > 256 bytes, only keep if ratio < 0.9
- Security limits: Identical across all 5 languages
Verified by golden files and a shared manifest (testdata/golden_manifest.json).
CRC32C("hello") = 0x9a71bb4c
xxHash64("signal/obs") = 0x86f8c8413116a0ae
xxHash64("meta/manifest") = 0x9a191dcd325813d3
| Document | Description |
|---|---|
| WShard Deep Dive | Byte-level format spec + cross-language interop |
MIT