Skip to content

fix(deps): bump zspec to v0.9.2 to init std.testing.io_instance (closes #583)#590

Merged
apotema merged 2 commits into
mainfrom
fix/583-root-cause
May 26, 2026
Merged

fix(deps): bump zspec to v0.9.2 to init std.testing.io_instance (closes #583)#590
apotema merged 2 commits into
mainfrom
fix/583-root-cause

Conversation

@apotema

@apotema apotema commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

The CI deadlock under registry_scan_spec on x86_64-linux had nothing to do with Io.Threaded worker-pool exhaustion or prefab_cache.scanDir. The engine pinned zspec at v0.9.1, the release before apotema/zspec#45 ("init std.testing.io_instance"). v0.9.1's spec runner never ran std.testing.io_instance = .init(allocator, .{}), so the stdlib global stayed as Zig's undefined-pattern bytes (0xaaaaaaaa…).

The first std.testing.io.* call any spec made — concretely tmpDir()'s opening io.random(...) — then deadlocked deterministically:

  1. random(userdata = &io_instance, …) casts the uninitialized bytes to *Threaded and hands them to randomMainThread.
  2. randomMainThread calls mutexLock(&t.mutex).
  3. t.mutex.state.raw == 0xaaaaaaaa — neither .unlocked (0) nor .locked_once (1) nor .contended (2). The opening cmpxchgStrong(.unlocked, .locked_once, …) fails; the swap(.contended, .acquire) returns the garbage value, which compares != .unlocked, so the thread enters Thread.futexWaitUncancelable on a futex no other thread will ever wake.

Reproduced under docker --platform=linux/amd64 --cpus=2 -m 7g ubuntu:24.04: a single TID parked in futex_wait_queue forever (confirmed via /proc/$pid/task/*/wchan). macOS and Windows masked the bug because their memory-init paths happened to leave the mutex bytes in a state mutexLock could recover from.

This PR bumps the zspec dependency hash to v0.9.2 — the first release that includes apotema/zspec#45's io_instance init. With that change the runner zero-initializes the global before any test runs, the mutex is .unlocked like it should be, and tmpDir() returns immediately.

Drops the Linux gate that #585 introduced as an emergency workaround. Re-exports RegistryScanSpec unconditionally so every platform now runs the full RFC #561 / #577 registry-scan coverage. Also corrects the io_helper.io() comment, which had attributed the original hang to dual-pool sigaction racing — the real cause is what's described above.

Test plan

  • docker --platform=linux/amd64 --cpus=2 -m 7g: zig build spec passes 28/28 in 102 ms (was: hang → exit 124).
  • Docker amd64 zig build test full suite green in 1 m 22 s — well within the 10-min CI timeout ci: 10-min timeout + gate registry_scan_spec off Linux to unblock CI #585 added.
  • macOS arm64 zig build spec 28/28 in 46 ms; zig build test full suite green in 16 s.
  • CI run on this PR confirms the same on real ubuntu-latest.

#583)

The CI hang under registry_scan_spec on x86_64-linux had nothing to do
with `Io.Threaded` worker-pool exhaustion or `prefab_cache.scanDir`
itself — the engine's pinned `zspec` was v0.9.1, the release *before*
apotema/zspec#45 ("init std.testing.io_instance"). With v0.9.1 the
spec runner never ran `std.testing.io_instance = .init(allocator, .{})`,
so the global stayed as Zig's `undefined`-pattern `0xaaaaaaaa…` bytes.

The first `std.testing.io.*` call any spec made — concretely
`tmpDir()`'s opening `io.random(...)` — deadlocked deterministically:

  1. `random(userdata=&io_instance, …)` casts the uninitialized
     bytes to `*Threaded` and hands them to `randomMainThread`.
  2. `randomMainThread` calls `mutexLock(&t.mutex)`.
  3. `t.mutex.state.raw == 0xaaaaaaaa` — neither `.unlocked` (0)
     nor `.locked_once` (1) nor `.contended` (2). The opening
     `cmpxchgStrong(.unlocked, .locked_once, …)` fails; the
     `swap(.contended, .acquire)` returns the garbage value, which
     compares `!= .unlocked`, so the thread enters
     `Thread.futexWaitUncancelable` on a futex no other thread
     will ever wake. The process has one TID parked in
     `futex_wait_queue` forever (confirmed via `/proc/$pid/task`
     under `docker --platform=linux/amd64 --cpus=2 ubuntu:24.04`).

macOS and Windows masked the bug because their memory-init paths
gave the mutex bytes a value `mutexLock` could recover from — the
exact same bug, but only deterministic on x86_64-linux Debug.

apotema/zspec#45 fixes it by initializing `io_instance` once in the
runner's `main()`. The first release containing it is v0.9.2; this
commit bumps the dependency hash to point there.

Verification:
- `docker --platform=linux/amd64 --cpus=2 -m 7g`: `zig build spec`
  now passes 28/28 in 102 ms (was: hang → CI timeout / exit 124).
- macOS arm64: `zig build spec` 28/28 in 46 ms; `zig build test`
  full suite green in 16 s.
- Docker amd64 `zig build test` full suite green in 1 m 22 s — well
  within the 10-min CI timeout that #585 added.

Drops the Linux gate that #585 introduced as an emergency workaround;
re-exports `RegistryScanSpec` unconditionally so every platform now
runs the full RFC #561 / #577 registry-scan coverage. Also corrects
the `io_helper.io()` comment, which had attributed the original hang
to dual-pool `sigaction` racing — the real cause is documented above.
@cursor

cursor Bot commented May 26, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Test-only dependency bump and re-enabling specs on Linux; no production runtime or auth/data-path changes.

Overview
Fixes the x86_64-linux CI hang in BDD specs by upgrading the lazy test dependency zspec from v0.9.1 to v0.9.2, which initializes std.testing.io_instance before any spec runs (apotema/zspec#45). Without that init, the first std.testing.io.* use (e.g. tmpDir()io.random()) could deadlock on an uninitialized mutex on Linux runners.

Removes the Linux-only workaround that skipped RegistryScanSpec in spec/spec_tests.zig, so registry-scan specs run on every platform again as part of zig build spec / zig build test.

Updates comments in io_helper.zig to attribute the hang to the zspec runner bug rather than dual Io.Threaded pools; the existing test-binary path that reuses std.testing.io_instance is unchanged and kept as defence-in-depth.

Reviewed by Cursor Bugbot for commit 0b3250a. Bugbot is set up for automated code reviews on this repo. Configure here.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request upgrades the zspec dependency to v0.9.2 to resolve a deadlock issue on Linux runners caused by uninitialized testing IO instances. This allows the registry_scan_spec tests to be re-enabled on Linux. A review comment points out that the dependency hash prefix in build.zig.zon still references 0.9.1, which could cause build failures and should be updated to 0.9.2.

Comment thread build.zig.zon
// bug (different memory-init paths); the x86_64-linux runner
// hit it deterministically. Closes #583.
.url = "https://github.com/apotema/zspec/archive/v0.9.2.tar.gz",
.hash = "zspec-0.9.1-jaKLbXgMBACFwbNjflhbMyP113leoYjboxn-1UOP-FGw",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The hash prefix zspec-0.9.1- is used for the v0.9.2 dependency. If the zspec package updated its version to "0.9.2" in its own build.zig.zon for the v0.9.2 release, this mismatch will cause a build failure during dependency resolution. Please verify the correct hash by running zig fetch https://github.com/apotema/zspec/archive/v0.9.2.tar.gz and update the hash prefix accordingly (likely to zspec-0.9.2- if the version was bumped in the upstream package).

            .hash = "zspec-0.9.2-jaKLbXgMBACFwbNjflhbMyP113leoYjboxn-1UOP-FGw",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified against the actual fetch — this is a wrong-direction correction. Running zig fetch https://github.com/apotema/zspec/archive/v0.9.2.tar.gz returns:

zspec-0.9.1-jaKLbXgMBACFwbNjflhbMyP113leoYjboxn-1UOP-FGw

The hash prefix is derived from the upstream package's own build.zig.zon .version field, and apotema/zspec v0.9.2 was tagged without bumping that internal version string (still "0.9.1"). The current hash matches zig fetch output verbatim; changing the prefix to zspec-0.9.2- would actually break dependency resolution. Declining.

@apotema

apotema commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

@copilot review

@apotema apotema merged commit 7ff4408 into main May 26, 2026
3 checks passed
@apotema apotema deleted the fix/583-root-cause branch May 26, 2026 14:44
apotema added a commit that referenced this pull request May 26, 2026
v1.45.0 was tagged with #588 alone (ScreenshotRequest) but shipped a
broken std.Thread.Mutex declaration in io_helper.zig that downstream
game builds hit (engine's own CI passed via builtin.is_test
short-circuit). v1.46.0 consolidates:

- #588 — engine.ScreenshotRequest env-var helper
- #591 — fix(io_helper): std.Thread.Mutex → std.atomic.Mutex
- #589 — engine.requestedScene() for cli#229 runtime override
- #590 — zspec v0.9.2 (closes #583 RegistryScanSpec deadlock)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant