Skip to content

fix(leyline): configurable daemon-start timeout + crash-vs-timeout diagnostics (mache-0a1ded)#468

Merged
jamestexas merged 2 commits into
mainfrom
dx/mache-0a1ded-leyline-start-timeout
Jul 2, 2026
Merged

fix(leyline): configurable daemon-start timeout + crash-vs-timeout diagnostics (mache-0a1ded)#468
jamestexas merged 2 commits into
mainfrom
dx/mache-0a1ded-leyline-start-timeout

Conversation

@jamestexas

Copy link
Copy Markdown
Contributor

The bug

Auto-started leyline daemons failed with socket did not appear within 5s on cold starts — first run, arena init/enrichment setup, or contention with co-tenant daemons on the shared ~/.mache/default.arena (surfaced with ≥1 mache serve already live). Two root problems:

  1. The 5s socket-appear wait was hardcoded and too short for a cold start.
  2. A daemon that crashed on startup was still reported as a timeout — after a full 5s wait for a socket that could never appear.

Fix

  • Configurable timeout via MACHE_LEYLINE_START_TIMEOUT (a Go duration like 30s, or a bare integer = seconds). Default raised 5s → 15s.
  • Crash vs timeout: the poll loop now watches the spawned process. If it exits before the socket appears, return immediately with exited during startup: <exit status> instead of waiting out the timeout. The timeout error now names the contended arena and points at MACHE_LEYLINE_START_TIMEOUT.

Tests

  • leylineStartTimeout env-parsing table (duration, bare-seconds, garbage/zero/negative fallback, whitespace) + a guard that the default stays ≥10s.
  • The existing timeout test now uses a sleeper fake + MACHE_LEYLINE_START_TIMEOUT=1s — fast, and still exercises the real timeout path (previously a /bin/sh fake that exited immediately, which now correctly takes the crash branch).
  • New crash-path test: a fast-exiting fake asserts exit-detection returns in <5s with the exited during startup message (not a timeout).

Not in scope (follow-up)

The cross-process arena isolation half of the contention story — two separate mache serve processes racing on the same ~/.mache/default.{arena,ctrl,sock} — is left for follow-up under mache-0a1ded / the mache-823d91 reliability half. This PR removes the spurious-timeout failures and makes the knob available.

Verification

  • go test ./internal/leyline/ → ok (full package)
  • go vet + golangci-lint (pre-commit) → pass

…h-vs-timeout diagnostics

Auto-started leyline daemons failed with "socket did not appear within 5s"
on cold starts (first run, arena init, or contention with co-tenant daemons
on the shared ~/.mache/default.arena). Two problems: the 5s wait was
hardcoded and too short, and a daemon that *crashed* on startup was reported
as a timeout after a full 5s wait for a socket that could never appear.

- Timeout is now configurable via MACHE_LEYLINE_START_TIMEOUT (Go duration or
  bare seconds), default raised 5s → 15s.
- The poll loop watches the process: if it exits before the socket appears,
  return immediately with "exited during startup: <exit status>" instead of
  waiting out the timeout. The timeout error now names the arena being
  contended and points at MACHE_LEYLINE_START_TIMEOUT.

Tests: env-parsing table for leylineStartTimeout; the existing timeout test
now uses a sleeper fake + 1s override (fast, still covers the timeout path);
new crash-path test asserts fast exit-detection. Cross-process arena
isolation (the other half of the contention story) is left as follow-up on
mache-0a1ded / the mache-823d91 reliability half.
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

find_smells (advisory)

Scoped to files changed in this PR. Rules below run on the standalone (no-LLO) cross-ref tables; _ast rules (cyclomatic_complexity, long_function, long_file, magic_int_in_comparison) are not exercised here.

fan_out_skew — 1 finding(s) in changed files
Source Node Metric
internal/leyline/socket.go leyline/functions/DiscoverOrStart/source 33

Rules: the registry is cmd/smell_rules.go (source of truth); see the authoring guide to add one. Advisory only — these are heuristics, not gates.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves internal/leyline daemon auto-start reliability by making the startup socket wait configurable and by distinguishing between a daemon that is still initializing vs one that exited before binding its socket.

Changes:

  • Add MACHE_LEYLINE_START_TIMEOUT parsing (Go duration or bare seconds) and raise the default startup wait to 15s.
  • Detect “daemon exited during startup” by watching the spawned process while polling for the socket, returning immediately on exit.
  • Update and add tests to cover env parsing, the real timeout path (fast with 1s), and the crash-during-startup path.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
internal/leyline/start_timeout_test.go Adds table tests for timeout env parsing plus a guard on the default timeout.
internal/leyline/socket.go Implements configurable startup timeout and crash-vs-timeout diagnostics during daemon auto-start.
internal/leyline/socket_test.go Updates timeout-path test to use a sleeper fake with 1s timeout; adds a new crash-path test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/leyline/socket.go
…efore binding (Copilot #468)

fmt.Errorf(...%w, nil) doesn't return nil (Errorf is never nil), but it does
render an ugly "%!w(<nil>)". Guard werr==nil in the crash branch and return
an explicit 'exited cleanly (status 0)' error — a status-0 exit before the
socket binds is still a failure. Adds a test for the exit-0 edge.
@jamestexas jamestexas enabled auto-merge (squash) July 2, 2026 01:00
@jamestexas jamestexas merged commit 396b354 into main Jul 2, 2026
15 checks passed
@jamestexas jamestexas deleted the dx/mache-0a1ded-leyline-start-timeout branch July 2, 2026 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants