fix(deps): bump zspec to v0.9.2 to init std.testing.io_instance (closes #583)#590
Conversation
#583) The CI hang under registry_scan_spec on x86_64-linux had nothing to do with `Io.Threaded` worker-pool exhaustion or `prefab_cache.scanDir` itself — the engine's pinned `zspec` was v0.9.1, the release *before* apotema/zspec#45 ("init std.testing.io_instance"). With v0.9.1 the spec runner never ran `std.testing.io_instance = .init(allocator, .{})`, so the global stayed as Zig's `undefined`-pattern `0xaaaaaaaa…` bytes. The first `std.testing.io.*` call any spec made — concretely `tmpDir()`'s opening `io.random(...)` — deadlocked deterministically: 1. `random(userdata=&io_instance, …)` casts the uninitialized bytes to `*Threaded` and hands them to `randomMainThread`. 2. `randomMainThread` calls `mutexLock(&t.mutex)`. 3. `t.mutex.state.raw == 0xaaaaaaaa` — neither `.unlocked` (0) nor `.locked_once` (1) nor `.contended` (2). The opening `cmpxchgStrong(.unlocked, .locked_once, …)` fails; the `swap(.contended, .acquire)` returns the garbage value, which compares `!= .unlocked`, so the thread enters `Thread.futexWaitUncancelable` on a futex no other thread will ever wake. The process has one TID parked in `futex_wait_queue` forever (confirmed via `/proc/$pid/task` under `docker --platform=linux/amd64 --cpus=2 ubuntu:24.04`). macOS and Windows masked the bug because their memory-init paths gave the mutex bytes a value `mutexLock` could recover from — the exact same bug, but only deterministic on x86_64-linux Debug. apotema/zspec#45 fixes it by initializing `io_instance` once in the runner's `main()`. The first release containing it is v0.9.2; this commit bumps the dependency hash to point there. Verification: - `docker --platform=linux/amd64 --cpus=2 -m 7g`: `zig build spec` now passes 28/28 in 102 ms (was: hang → CI timeout / exit 124). - macOS arm64: `zig build spec` 28/28 in 46 ms; `zig build test` full suite green in 16 s. - Docker amd64 `zig build test` full suite green in 1 m 22 s — well within the 10-min CI timeout that #585 added. Drops the Linux gate that #585 introduced as an emergency workaround; re-exports `RegistryScanSpec` unconditionally so every platform now runs the full RFC #561 / #577 registry-scan coverage. Also corrects the `io_helper.io()` comment, which had attributed the original hang to dual-pool `sigaction` racing — the real cause is documented above.
PR SummaryLow Risk Overview Removes the Linux-only workaround that skipped Updates comments in Reviewed by Cursor Bugbot for commit 0b3250a. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Code Review
This pull request upgrades the zspec dependency to v0.9.2 to resolve a deadlock issue on Linux runners caused by uninitialized testing IO instances. This allows the registry_scan_spec tests to be re-enabled on Linux. A review comment points out that the dependency hash prefix in build.zig.zon still references 0.9.1, which could cause build failures and should be updated to 0.9.2.
| // bug (different memory-init paths); the x86_64-linux runner | ||
| // hit it deterministically. Closes #583. | ||
| .url = "https://github.com/apotema/zspec/archive/v0.9.2.tar.gz", | ||
| .hash = "zspec-0.9.1-jaKLbXgMBACFwbNjflhbMyP113leoYjboxn-1UOP-FGw", |
There was a problem hiding this comment.
The hash prefix zspec-0.9.1- is used for the v0.9.2 dependency. If the zspec package updated its version to "0.9.2" in its own build.zig.zon for the v0.9.2 release, this mismatch will cause a build failure during dependency resolution. Please verify the correct hash by running zig fetch https://github.com/apotema/zspec/archive/v0.9.2.tar.gz and update the hash prefix accordingly (likely to zspec-0.9.2- if the version was bumped in the upstream package).
.hash = "zspec-0.9.2-jaKLbXgMBACFwbNjflhbMyP113leoYjboxn-1UOP-FGw",
There was a problem hiding this comment.
Verified against the actual fetch — this is a wrong-direction correction. Running zig fetch https://github.com/apotema/zspec/archive/v0.9.2.tar.gz returns:
zspec-0.9.1-jaKLbXgMBACFwbNjflhbMyP113leoYjboxn-1UOP-FGw
The hash prefix is derived from the upstream package's own build.zig.zon .version field, and apotema/zspec v0.9.2 was tagged without bumping that internal version string (still "0.9.1"). The current hash matches zig fetch output verbatim; changing the prefix to zspec-0.9.2- would actually break dependency resolution. Declining.
|
@copilot review |
v1.45.0 was tagged with #588 alone (ScreenshotRequest) but shipped a broken std.Thread.Mutex declaration in io_helper.zig that downstream game builds hit (engine's own CI passed via builtin.is_test short-circuit). v1.46.0 consolidates: - #588 — engine.ScreenshotRequest env-var helper - #591 — fix(io_helper): std.Thread.Mutex → std.atomic.Mutex - #589 — engine.requestedScene() for cli#229 runtime override - #590 — zspec v0.9.2 (closes #583 RegistryScanSpec deadlock)
Summary
The CI deadlock under
registry_scan_specon x86_64-linux had nothing to do withIo.Threadedworker-pool exhaustion orprefab_cache.scanDir. The engine pinnedzspecat v0.9.1, the release before apotema/zspec#45 ("initstd.testing.io_instance"). v0.9.1's spec runner never ranstd.testing.io_instance = .init(allocator, .{}), so the stdlib global stayed as Zig's undefined-pattern bytes (0xaaaaaaaa…).The first
std.testing.io.*call any spec made — concretelytmpDir()'s openingio.random(...)— then deadlocked deterministically:random(userdata = &io_instance, …)casts the uninitialized bytes to*Threadedand hands them torandomMainThread.randomMainThreadcallsmutexLock(&t.mutex).t.mutex.state.raw == 0xaaaaaaaa— neither.unlocked(0) nor.locked_once(1) nor.contended(2). The openingcmpxchgStrong(.unlocked, .locked_once, …)fails; theswap(.contended, .acquire)returns the garbage value, which compares!= .unlocked, so the thread entersThread.futexWaitUncancelableon a futex no other thread will ever wake.Reproduced under
docker --platform=linux/amd64 --cpus=2 -m 7g ubuntu:24.04: a single TID parked infutex_wait_queueforever (confirmed via/proc/$pid/task/*/wchan). macOS and Windows masked the bug because their memory-init paths happened to leave the mutex bytes in a statemutexLockcould recover from.This PR bumps the
zspecdependency hash to v0.9.2 — the first release that includes apotema/zspec#45'sio_instanceinit. With that change the runner zero-initializes the global before any test runs, the mutex is.unlockedlike it should be, andtmpDir()returns immediately.Drops the Linux gate that #585 introduced as an emergency workaround. Re-exports
RegistryScanSpecunconditionally so every platform now runs the full RFC #561 / #577 registry-scan coverage. Also corrects theio_helper.io()comment, which had attributed the original hang to dual-poolsigactionracing — the real cause is what's described above.Test plan
docker --platform=linux/amd64 --cpus=2 -m 7g:zig build specpasses 28/28 in 102 ms (was: hang → exit 124).zig build testfull suite green in 1 m 22 s — well within the 10-min CI timeout ci: 10-min timeout + gate registry_scan_spec off Linux to unblock CI #585 added.zig build spec28/28 in 46 ms;zig build testfull suite green in 16 s.ubuntu-latest.