fix(leyline): configurable daemon-start timeout + crash-vs-timeout diagnostics (mache-0a1ded)#468
Conversation
…h-vs-timeout diagnostics Auto-started leyline daemons failed with "socket did not appear within 5s" on cold starts (first run, arena init, or contention with co-tenant daemons on the shared ~/.mache/default.arena). Two problems: the 5s wait was hardcoded and too short, and a daemon that *crashed* on startup was reported as a timeout after a full 5s wait for a socket that could never appear. - Timeout is now configurable via MACHE_LEYLINE_START_TIMEOUT (Go duration or bare seconds), default raised 5s → 15s. - The poll loop watches the process: if it exits before the socket appears, return immediately with "exited during startup: <exit status>" instead of waiting out the timeout. The timeout error now names the arena being contended and points at MACHE_LEYLINE_START_TIMEOUT. Tests: env-parsing table for leylineStartTimeout; the existing timeout test now uses a sleeper fake + 1s override (fast, still covers the timeout path); new crash-path test asserts fast exit-detection. Cross-process arena isolation (the other half of the contention story) is left as follow-up on mache-0a1ded / the mache-823d91 reliability half.
find_smells (advisory)Scoped to files changed in this PR. Rules below run on the standalone (no-LLO) cross-ref tables; fan_out_skew — 1 finding(s) in changed files
Rules: the registry is |
There was a problem hiding this comment.
Pull request overview
This PR improves internal/leyline daemon auto-start reliability by making the startup socket wait configurable and by distinguishing between a daemon that is still initializing vs one that exited before binding its socket.
Changes:
- Add
MACHE_LEYLINE_START_TIMEOUTparsing (Go duration or bare seconds) and raise the default startup wait to 15s. - Detect “daemon exited during startup” by watching the spawned process while polling for the socket, returning immediately on exit.
- Update and add tests to cover env parsing, the real timeout path (fast with
1s), and the crash-during-startup path.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| internal/leyline/start_timeout_test.go | Adds table tests for timeout env parsing plus a guard on the default timeout. |
| internal/leyline/socket.go | Implements configurable startup timeout and crash-vs-timeout diagnostics during daemon auto-start. |
| internal/leyline/socket_test.go | Updates timeout-path test to use a sleeper fake with 1s timeout; adds a new crash-path test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…efore binding (Copilot #468) fmt.Errorf(...%w, nil) doesn't return nil (Errorf is never nil), but it does render an ugly "%!w(<nil>)". Guard werr==nil in the crash branch and return an explicit 'exited cleanly (status 0)' error — a status-0 exit before the socket binds is still a failure. Adds a test for the exit-0 edge.
The bug
Auto-started leyline daemons failed with
socket did not appear within 5son cold starts — first run, arena init/enrichment setup, or contention with co-tenant daemons on the shared~/.mache/default.arena(surfaced with ≥1mache servealready live). Two root problems:Fix
MACHE_LEYLINE_START_TIMEOUT(a Go duration like30s, or a bare integer = seconds). Default raised 5s → 15s.exited during startup: <exit status>instead of waiting out the timeout. The timeout error now names the contended arena and points atMACHE_LEYLINE_START_TIMEOUT.Tests
leylineStartTimeoutenv-parsing table (duration, bare-seconds, garbage/zero/negative fallback, whitespace) + a guard that the default stays ≥10s.MACHE_LEYLINE_START_TIMEOUT=1s— fast, and still exercises the real timeout path (previously a/bin/shfake that exited immediately, which now correctly takes the crash branch).exited during startupmessage (not a timeout).Not in scope (follow-up)
The cross-process arena isolation half of the contention story — two separate
mache serveprocesses racing on the same~/.mache/default.{arena,ctrl,sock}— is left for follow-up undermache-0a1ded/ themache-823d91reliability half. This PR removes the spurious-timeout failures and makes the knob available.Verification
go test ./internal/leyline/→ ok (full package)go vet+golangci-lint(pre-commit) → pass