fix(runner): init std.testing.io_instance — fixes #44#45
Merged
Conversation
In Zig 0.16 `std.testing` declares:
pub var io_instance: Io.Threaded = undefined;
pub const io = if (builtin.is_test) io_instance.io() else ...;
`Threaded.io(t)` bakes `&io_instance` into the returned Io's
`userdata` *at comptime* (lib/std/Io/Threaded.zig:1806). Without
a runtime init, the global memory at that address stays zero —
worker_threads = null, etc. — and every operation that needs to
enqueue work onto the thread pool deadlocks on linux. macOS lets
more zero-init paths through so the bug only surfaces in CI.
The stdlib's terminal runner already handles this (see
lib/compiler/test_runner.zig:273-282) — it does
`testing.io_instance = .init(testing.allocator, .{...})` before
every test and tears down after. zspec's `mode: .simple` runner
never did, so every consumer routing `.simple`-mode tests through
zspec was actually running against an uninitialized Io.Threaded
the whole time — only ops that took fast paths happened to work.
Add a single-shot init at the top of `runner.main()`. A `.simple`
runner doesn't need the per-test re-init the stdlib does — once
suffices and is cheaper than spinning up a worker pool 160 times.
Same memory address as the comptime-baked userdata pointers, so
every previously-broken call site now reaches a real Threaded.
Plus a regression test under tests/ that exercises tmpDir +
writeFile + readFileAlloc and the file-not-found readFileAlloc
path — both deadlocked on linux before this fix; both return in
ms after. Routed through the `.simple` runner on purpose: a
default-runner version would always pass because the stdlib
runner does its own init.
Verified end-to-end via labelle-toolkit/labelle-gui@6e11736: that
project's test suite hung for 6h on every linux CI run since the
0.16 migration and now passes 162/162 in <1s after adopting an
equivalent fix as a project-local beforeAll hook. This PR pulls
the fix into zspec so consumers don't have to.
Closes #44.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes a Linux-only deadlock in zspec’s .simple test_runner when tests use std.testing.io by initializing std.testing.io_instance in the custom runner, and adds a regression test to ensure this remains covered.
Changes:
- Initialize/deinitialize
std.testing.io_instanceinsrc/runner.zigsostd.testing.iohas a properly initializedIo.Threadedworker pool. - Add a regression test that exercises the previously-deadlocking
std.testing.iocode paths under the.simplerunner. - Wire the new regression test into
zig build testviabuild.zig, explicitly using the.simplerunner.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/runner.zig |
Adds one-time initialization of std.testing.io_instance for the .simple runner to prevent Linux deadlocks. |
tests/testing_io_test.zig |
New regression tests covering tmpDir + writeFile + readFileAlloc and missing-path readFileAlloc under std.testing.io. |
build.zig |
Adds the new regression test executable and includes it in the main test build step, routed through the .simple runner. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+46
to
+48
| // worker_threads = null, etc. — and every operation that needs to | ||
| // enqueue work onto the thread pool deadlocks on linux. macOS | ||
| // satisfies more zero-init paths, so the bug only shows up there. |
Copilot flagged that the sentence "macOS satisfies more zero-init paths, so the bug only shows up there" reads as if the deadlock is a macOS bug, while the rest of the comment correctly attributes it to linux. Reworded to make the asymmetry explicit: macOS hides the bug, linux exposes it.
3 tasks
apotema
added a commit
to labelle-toolkit/labelle-gui
that referenced
this pull request
May 15, 2026
…esourceDef (#121) * test(fixtures): introduce zspec Factory pattern for ProjectConfig + ResourceDef Bumps zspec to v0.9.2 — picks up the `.simple`-runner io_instance init fix shipped this session (apotema/zspec#45) — and lands a small proof-of-value slice for zspec's `Factory` module before tackling #120's TE-test triage and #109's Game View TE coverage. Why factories now: `project_settings_edit_save` (one of the 5 long- silent-passing tests #120 will triage) needs a populated ProjectConfig built per-test. Today's tests either run on defaults or hand-build the struct with arena.dupe per string field. Factories collapse that to `Factory.build(.{ ...overrides })` and validate field names at comptime. What changed: - `src/test_fixtures.zig` + `src/test_fixtures/resource.zon`: two factories, one via `Factory.defineFrom` (leaf type — `ResourceDef`, fixture in a .zon file with field-name typo detection) and one via `Factory.define` (outer type — `ProjectConfig`, in-Zig literal so the `resources: []const ResourceDef = &.{}` slice field can be expressed as `@as([]const ResourceDef, &.{})`. `defineFrom` can't coerce empty slices from .zon syntax; the module doc-comment captures this). - `src/tests.zig`: converts `resources round-trip through save + load` to build its `ResourceDef` via `ResourceFactory.build(.{})`, dropping the manual `arena.dupe` per field. Adds a sibling `ProjectConfigFactory overrides round-trip through save + load` that builds a 7-field-override config, round-trips through save/load, asserts every override survived and untouched fields kept their fixture defaults — direct precursor to #120's `project_settings_edit_save` shape. - `build.zig.zon`: zspec v0.9.1 → v0.9.2. Test plan: `zig build test` — 175/175 pass (was 174 on v0.9.2 baseline, +1 for the new factory-override round-trip test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(build): restore nfd import on gui_tests_exe (regression from #115 squash) main has been red on Linux CI since da97953 (#115 squash-merge) with: src/app.zig:11: error: no module named 'nfd' available within module 'root' The original #106 commit added `gui_tests_exe.root_module.addImport("nfd", nfd.module("nfd"))` because once TE callbacks were made `pub fn` (so `std.meta.hasFn` could find them), the `App.renderFrame` paths the tests exercise pulled in the nfd-using file-dialog code, which wasn't imported on the test target. When #115 was rebased / cherry-picked through review-fix cycles, that single line got dropped from build.zig. Subsequent merges of #115 itself and #119 inherited the broken state and CI started failing on every push without it being obvious whose change introduced it. This restores the addImport with an explanatory comment so future rebases preserve it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * review(#121): address gemini, cursor, copilot bot comments - build.zig.zon: leave the `zspec-0.9.1-...` hash prefix as-is and explain why in a comment. Both gemini and copilot suggested changing it to `zspec-0.9.2-...` to match the URL, but Zig 0.16's package manager bakes upstream's own `build.zig.zon` `.version` into the hash and rejects manual edits ("hash mismatch: manifest declares zspec-0.9.2-... but the fetched package has zspec-0.9.1-..."). Upstream's `.version` still reads "0.9.1" at the v0.9.2 tag; the prefix is cosmetic metadata sourced from there. Only the base64 portion is integrity-relevant. - src/tests.zig (cursor + copilot): the `engine_version = "1.35.0"` override matched both the factory default and the ProjectConfig struct default, making it a no-op. Bumped to "1.36.0" so the assertion actually exercises an override path. - src/tests.zig (copilot): the "non-overridden fields keep their fixture defaults" comment claimed coverage of every untouched field but only checked `ecs` + `target_fps`. Expanded the assertions to also cover `description`, `core_version`, `gfx_version`, `assembler_version`, and `resources.len = 0`. `zig build test` — 180/180 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
apotema
added a commit
to labelle-toolkit/labelle-engine
that referenced
this pull request
May 26, 2026
#583) (#590) * fix(deps): bump zspec to v0.9.2 to init std.testing.io_instance (closes #583) The CI hang under registry_scan_spec on x86_64-linux had nothing to do with `Io.Threaded` worker-pool exhaustion or `prefab_cache.scanDir` itself — the engine's pinned `zspec` was v0.9.1, the release *before* apotema/zspec#45 ("init std.testing.io_instance"). With v0.9.1 the spec runner never ran `std.testing.io_instance = .init(allocator, .{})`, so the global stayed as Zig's `undefined`-pattern `0xaaaaaaaa…` bytes. The first `std.testing.io.*` call any spec made — concretely `tmpDir()`'s opening `io.random(...)` — deadlocked deterministically: 1. `random(userdata=&io_instance, …)` casts the uninitialized bytes to `*Threaded` and hands them to `randomMainThread`. 2. `randomMainThread` calls `mutexLock(&t.mutex)`. 3. `t.mutex.state.raw == 0xaaaaaaaa` — neither `.unlocked` (0) nor `.locked_once` (1) nor `.contended` (2). The opening `cmpxchgStrong(.unlocked, .locked_once, …)` fails; the `swap(.contended, .acquire)` returns the garbage value, which compares `!= .unlocked`, so the thread enters `Thread.futexWaitUncancelable` on a futex no other thread will ever wake. The process has one TID parked in `futex_wait_queue` forever (confirmed via `/proc/$pid/task` under `docker --platform=linux/amd64 --cpus=2 ubuntu:24.04`). macOS and Windows masked the bug because their memory-init paths gave the mutex bytes a value `mutexLock` could recover from — the exact same bug, but only deterministic on x86_64-linux Debug. apotema/zspec#45 fixes it by initializing `io_instance` once in the runner's `main()`. The first release containing it is v0.9.2; this commit bumps the dependency hash to point there. Verification: - `docker --platform=linux/amd64 --cpus=2 -m 7g`: `zig build spec` now passes 28/28 in 102 ms (was: hang → CI timeout / exit 124). - macOS arm64: `zig build spec` 28/28 in 46 ms; `zig build test` full suite green in 16 s. - Docker amd64 `zig build test` full suite green in 1 m 22 s — well within the 10-min CI timeout that #585 added. Drops the Linux gate that #585 introduced as an emergency workaround; re-exports `RegistryScanSpec` unconditionally so every platform now runs the full RFC #561 / #577 registry-scan coverage. Also corrects the `io_helper.io()` comment, which had attributed the original hang to dual-pool `sigaction` racing — the real cause is documented above. * ci: retrigger after engine v1.45.0 tag
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a linux-only deadlock for every consumer using zspec's
mode: .simpletest_runner whose tests touchstd.testing.io.std.testingdeclarespub var io_instance: Io.Threaded = undefinedandpub const io = io_instance.io().Threaded.io(t)bakes&io_instanceinto the returned Io'suserdataat comptime — so without a runtime init, every call site is talking to a zero-initialized Threaded (no worker pool). On linux that deadlocks; on macOS more zero-init paths happen to work, hiding the bug.The stdlib's terminal runner already does the right thing (lib/compiler/test_runner.zig:273-282 — per-test init+deinit). zspec's
.simplerunner didn't.Fix
Single-shot
testing.io_instance = .init(testing.allocator, .{})at the top ofrunner.main(). Same memory address that comptime-bakedIo.userdatapointers already reference — they now point at a properly initialized Threaded with a real worker pool.Per-test re-init (matching the stdlib) is cheaper to skip in a
.simplerunner: 160 tests × spinning up a worker pool is way more than one global init covers.Test plan
tests/testing_io_test.zigexercises the two deadlock signatures from runner does not initialize std.testing.io_instance — tests using std.testing.io deadlock on linux #44 (tmpDir + writeFile round-trip; readFileAlloc on a missing path) under the.simplerunner. Both passed in <5 ms locally and in anubuntu:24.04docker container; both would have hung for 6 h on linux before this fix.zig build testpasses locally on macOS (57/57) and in the linux container.beforeAllhook — that project's test suite had been timing out at 6 h on every linux CI run since the 0.16 migration; it now passes 162/162 in <1 s.Closes #44.
🤖 Generated with Claude Code