Skip to content

fix(runner): init std.testing.io_instance — fixes #44#45

Merged
apotema merged 2 commits into
mainfrom
fix/44-init-testing-io-instance
May 15, 2026
Merged

fix(runner): init std.testing.io_instance — fixes #44#45
apotema merged 2 commits into
mainfrom
fix/44-init-testing-io-instance

Conversation

@apotema

@apotema apotema commented May 15, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes a linux-only deadlock for every consumer using zspec's mode: .simple test_runner whose tests touch std.testing.io.

std.testing declares pub var io_instance: Io.Threaded = undefined and pub const io = io_instance.io(). Threaded.io(t) bakes &io_instance into the returned Io's userdata at comptime — so without a runtime init, every call site is talking to a zero-initialized Threaded (no worker pool). On linux that deadlocks; on macOS more zero-init paths happen to work, hiding the bug.

The stdlib's terminal runner already does the right thing (lib/compiler/test_runner.zig:273-282 — per-test init+deinit). zspec's .simple runner didn't.

Fix

Single-shot testing.io_instance = .init(testing.allocator, .{}) at the top of runner.main(). Same memory address that comptime-baked Io.userdata pointers already reference — they now point at a properly initialized Threaded with a real worker pool.

Per-test re-init (matching the stdlib) is cheaper to skip in a .simple runner: 160 tests × spinning up a worker pool is way more than one global init covers.

Test plan

  • New tests/testing_io_test.zig exercises the two deadlock signatures from runner does not initialize std.testing.io_instance — tests using std.testing.io deadlock on linux #44 (tmpDir + writeFile round-trip; readFileAlloc on a missing path) under the .simple runner. Both passed in <5 ms locally and in an ubuntu:24.04 docker container; both would have hung for 6 h on linux before this fix.
  • Full zig build test passes locally on macOS (57/57) and in the linux container.
  • End-to-end consumer verification: labelle-toolkit/labelle-gui@6e11736 carried an equivalent fix as a project-local beforeAll hook — that project's test suite had been timing out at 6 h on every linux CI run since the 0.16 migration; it now passes 162/162 in <1 s.

Closes #44.

🤖 Generated with Claude Code

In Zig 0.16 `std.testing` declares:

    pub var io_instance: Io.Threaded = undefined;
    pub const io = if (builtin.is_test) io_instance.io() else ...;

`Threaded.io(t)` bakes `&io_instance` into the returned Io's
`userdata` *at comptime* (lib/std/Io/Threaded.zig:1806). Without
a runtime init, the global memory at that address stays zero —
worker_threads = null, etc. — and every operation that needs to
enqueue work onto the thread pool deadlocks on linux. macOS lets
more zero-init paths through so the bug only surfaces in CI.

The stdlib's terminal runner already handles this (see
lib/compiler/test_runner.zig:273-282) — it does
`testing.io_instance = .init(testing.allocator, .{...})` before
every test and tears down after. zspec's `mode: .simple` runner
never did, so every consumer routing `.simple`-mode tests through
zspec was actually running against an uninitialized Io.Threaded
the whole time — only ops that took fast paths happened to work.

Add a single-shot init at the top of `runner.main()`. A `.simple`
runner doesn't need the per-test re-init the stdlib does — once
suffices and is cheaper than spinning up a worker pool 160 times.
Same memory address as the comptime-baked userdata pointers, so
every previously-broken call site now reaches a real Threaded.

Plus a regression test under tests/ that exercises tmpDir +
writeFile + readFileAlloc and the file-not-found readFileAlloc
path — both deadlocked on linux before this fix; both return in
ms after. Routed through the `.simple` runner on purpose: a
default-runner version would always pass because the stdlib
runner does its own init.

Verified end-to-end via labelle-toolkit/labelle-gui@6e11736: that
project's test suite hung for 6h on every linux CI run since the
0.16 migration and now passes 162/162 in <1s after adopting an
equivalent fix as a project-local beforeAll hook. This PR pulls
the fix into zspec so consumers don't have to.

Closes #44.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a Linux-only deadlock in zspec’s .simple test_runner when tests use std.testing.io by initializing std.testing.io_instance in the custom runner, and adds a regression test to ensure this remains covered.

Changes:

  • Initialize/deinitialize std.testing.io_instance in src/runner.zig so std.testing.io has a properly initialized Io.Threaded worker pool.
  • Add a regression test that exercises the previously-deadlocking std.testing.io code paths under the .simple runner.
  • Wire the new regression test into zig build test via build.zig, explicitly using the .simple runner.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/runner.zig Adds one-time initialization of std.testing.io_instance for the .simple runner to prevent Linux deadlocks.
tests/testing_io_test.zig New regression tests covering tmpDir + writeFile + readFileAlloc and missing-path readFileAlloc under std.testing.io.
build.zig Adds the new regression test executable and includes it in the main test build step, routed through the .simple runner.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/runner.zig Outdated
Comment on lines +46 to +48
// worker_threads = null, etc. — and every operation that needs to
// enqueue work onto the thread pool deadlocks on linux. macOS
// satisfies more zero-init paths, so the bug only shows up there.
Copilot flagged that the sentence "macOS satisfies more zero-init
paths, so the bug only shows up there" reads as if the deadlock is
a macOS bug, while the rest of the comment correctly attributes it
to linux. Reworded to make the asymmetry explicit: macOS hides the
bug, linux exposes it.
@apotema apotema merged commit ebe3b85 into main May 15, 2026
4 checks passed
@apotema apotema deleted the fix/44-init-testing-io-instance branch May 15, 2026 21:13
apotema added a commit to labelle-toolkit/labelle-gui that referenced this pull request May 15, 2026
…esourceDef (#121)

* test(fixtures): introduce zspec Factory pattern for ProjectConfig + ResourceDef

Bumps zspec to v0.9.2 — picks up the `.simple`-runner io_instance init
fix shipped this session (apotema/zspec#45) — and lands a small
proof-of-value slice for zspec's `Factory` module before tackling #120's
TE-test triage and #109's Game View TE coverage.

Why factories now: `project_settings_edit_save` (one of the 5 long-
silent-passing tests #120 will triage) needs a populated ProjectConfig
built per-test. Today's tests either run on defaults or hand-build the
struct with arena.dupe per string field. Factories collapse that to
`Factory.build(.{ ...overrides })` and validate field names at comptime.

What changed:

- `src/test_fixtures.zig` + `src/test_fixtures/resource.zon`: two
  factories, one via `Factory.defineFrom` (leaf type — `ResourceDef`,
  fixture in a .zon file with field-name typo detection) and one via
  `Factory.define` (outer type — `ProjectConfig`, in-Zig literal so the
  `resources: []const ResourceDef = &.{}` slice field can be expressed
  as `@as([]const ResourceDef, &.{})`. `defineFrom` can't coerce empty
  slices from .zon syntax; the module doc-comment captures this).

- `src/tests.zig`: converts `resources round-trip through save + load`
  to build its `ResourceDef` via `ResourceFactory.build(.{})`, dropping
  the manual `arena.dupe` per field. Adds a sibling
  `ProjectConfigFactory overrides round-trip through save + load` that
  builds a 7-field-override config, round-trips through save/load,
  asserts every override survived and untouched fields kept their
  fixture defaults — direct precursor to #120's
  `project_settings_edit_save` shape.

- `build.zig.zon`: zspec v0.9.1 → v0.9.2.

Test plan: `zig build test` — 175/175 pass (was 174 on v0.9.2 baseline,
+1 for the new factory-override round-trip test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(build): restore nfd import on gui_tests_exe (regression from #115 squash)

main has been red on Linux CI since da97953 (#115 squash-merge) with:

    src/app.zig:11: error: no module named 'nfd' available within module 'root'

The original #106 commit added
`gui_tests_exe.root_module.addImport("nfd", nfd.module("nfd"))` because
once TE callbacks were made `pub fn` (so `std.meta.hasFn` could find
them), the `App.renderFrame` paths the tests exercise pulled in the
nfd-using file-dialog code, which wasn't imported on the test target.

When #115 was rebased / cherry-picked through review-fix cycles, that
single line got dropped from build.zig. Subsequent merges of #115
itself and #119 inherited the broken state and CI started failing on
every push without it being obvious whose change introduced it.

This restores the addImport with an explanatory comment so future
rebases preserve it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* review(#121): address gemini, cursor, copilot bot comments

- build.zig.zon: leave the `zspec-0.9.1-...` hash prefix as-is and
  explain why in a comment. Both gemini and copilot suggested
  changing it to `zspec-0.9.2-...` to match the URL, but Zig 0.16's
  package manager bakes upstream's own `build.zig.zon` `.version`
  into the hash and rejects manual edits ("hash mismatch: manifest
  declares zspec-0.9.2-... but the fetched package has zspec-0.9.1-...").
  Upstream's `.version` still reads "0.9.1" at the v0.9.2 tag; the
  prefix is cosmetic metadata sourced from there. Only the base64
  portion is integrity-relevant.

- src/tests.zig (cursor + copilot): the `engine_version = "1.35.0"`
  override matched both the factory default and the ProjectConfig
  struct default, making it a no-op. Bumped to "1.36.0" so the
  assertion actually exercises an override path.

- src/tests.zig (copilot): the "non-overridden fields keep their
  fixture defaults" comment claimed coverage of every untouched
  field but only checked `ecs` + `target_fps`. Expanded the
  assertions to also cover `description`, `core_version`,
  `gfx_version`, `assembler_version`, and `resources.len = 0`.

`zig build test` — 180/180 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
apotema added a commit to labelle-toolkit/labelle-engine that referenced this pull request May 26, 2026
#583) (#590)

* fix(deps): bump zspec to v0.9.2 to init std.testing.io_instance (closes #583)

The CI hang under registry_scan_spec on x86_64-linux had nothing to do
with `Io.Threaded` worker-pool exhaustion or `prefab_cache.scanDir`
itself — the engine's pinned `zspec` was v0.9.1, the release *before*
apotema/zspec#45 ("init std.testing.io_instance"). With v0.9.1 the
spec runner never ran `std.testing.io_instance = .init(allocator, .{})`,
so the global stayed as Zig's `undefined`-pattern `0xaaaaaaaa…` bytes.

The first `std.testing.io.*` call any spec made — concretely
`tmpDir()`'s opening `io.random(...)` — deadlocked deterministically:

  1. `random(userdata=&io_instance, …)` casts the uninitialized
     bytes to `*Threaded` and hands them to `randomMainThread`.
  2. `randomMainThread` calls `mutexLock(&t.mutex)`.
  3. `t.mutex.state.raw == 0xaaaaaaaa` — neither `.unlocked` (0)
     nor `.locked_once` (1) nor `.contended` (2). The opening
     `cmpxchgStrong(.unlocked, .locked_once, …)` fails; the
     `swap(.contended, .acquire)` returns the garbage value, which
     compares `!= .unlocked`, so the thread enters
     `Thread.futexWaitUncancelable` on a futex no other thread
     will ever wake. The process has one TID parked in
     `futex_wait_queue` forever (confirmed via `/proc/$pid/task`
     under `docker --platform=linux/amd64 --cpus=2 ubuntu:24.04`).

macOS and Windows masked the bug because their memory-init paths
gave the mutex bytes a value `mutexLock` could recover from — the
exact same bug, but only deterministic on x86_64-linux Debug.

apotema/zspec#45 fixes it by initializing `io_instance` once in the
runner's `main()`. The first release containing it is v0.9.2; this
commit bumps the dependency hash to point there.

Verification:
- `docker --platform=linux/amd64 --cpus=2 -m 7g`: `zig build spec`
  now passes 28/28 in 102 ms (was: hang → CI timeout / exit 124).
- macOS arm64: `zig build spec` 28/28 in 46 ms; `zig build test`
  full suite green in 16 s.
- Docker amd64 `zig build test` full suite green in 1 m 22 s — well
  within the 10-min CI timeout that #585 added.

Drops the Linux gate that #585 introduced as an emergency workaround;
re-exports `RegistryScanSpec` unconditionally so every platform now
runs the full RFC #561 / #577 registry-scan coverage. Also corrects
the `io_helper.io()` comment, which had attributed the original hang
to dual-pool `sigaction` racing — the real cause is documented above.

* ci: retrigger after engine v1.45.0 tag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

runner does not initialize std.testing.io_instance — tests using std.testing.io deadlock on linux

2 participants