Skip to content

Make orch worker start launch a local background worker for the configured master #437

@proboscis

Description

@proboscis

Problem

orch worker start is supposed to mean:

start or reuse the long-lived orch-worker on this host/profile

But the current behavior is still effectively:

tell the connected orch-master to spawn/manage a worker on the master's host

That breaks the desired remote-execution model. In the Zeus master + Mac worker flow, operators currently need to use a manual workaround such as:

tmux new-session -d -s orch-worker-mac 'ORCH_REMOTE=zeus:7777 orch worker run --worker-id host-mac'

The worker itself is supposed to be a background service. tmux should not be part of the normal operator flow.

Desired state

orch worker start and orch worker stop should manage the local host's worker process.

Mac
  orch worker start
    -> starts/reuses a background orch-worker on Mac
    -> that worker connects to the configured master (e.g. zeus:7777)

Zeus
  orch worker start
    -> starts/reuses a background orch-worker on Zeus
    -> that worker connects to the configured master for Zeus's profile

orch worker start must be:

  • local-host scoped
  • background by default
  • idempotent for the same host/profile
  • fail-fast with a clear error if the worker fails before register/heartbeat

orch worker stop must stop the corresponding local worker process, not ask the master host to stop something else.

orch worker status should make it clear whether:

  • the local background worker process exists
  • it successfully registered with the master
  • it is active/stale from the master's point of view

Why this matters

This is part of the desired host-worker model already documented in the branch specs:

  • one long-lived worker per host/profile
  • worker is background infrastructure, not a foreground command that operators need to wrap in tmux
  • --on <target> selects a host/profile, not a master-side path or master-side spawned worker

Acceptance criteria

  • orch worker start on a host starts/reuses a local background worker process on that same host.
  • With ORCH_REMOTE=zeus:7777, orch worker start on Mac starts/reuses the Mac worker that connects to Zeus.
  • orch worker stop on Mac stops that Mac worker.
  • orch worker start is idempotent for the same host/profile.
  • Startup is fail-fast: if the worker exits before registration, worker start returns a clear actionable error immediately.
  • worker status surfaces both local process state and master registration state clearly enough to debug startup failures.
  • E2E docs and automation no longer require tmux ... orch worker run ... as the normal path for starting a worker.

Session context for continuation

This issue is intentionally carrying forward the current branch/worktree context from the previous coding session.

Continue from this exact worktree

branch   : feature/remote-execution-phase0
worktree : /Users/s22625/repos/orch-feature-remote-execution-phase0
HEAD     : fb835c870a21f2b4dbe5dc34941ca48e77158213

Important:

  • Do not continue this work in /Users/s22625/repos/orch.
  • Continue in /Users/s22625/repos/orch-feature-remote-execution-phase0.

Related PR and previous tracking issue

What is already done on this branch

The remote-execution migration itself is mostly complete on this branch. Highlights from the session:

  • strict project_id-based runtime identity
  • removal of main runtime path-shaped transport (project_root / repo_root) in the new runtime path
  • worker-local repo mapping for execution
  • host-worker routing (target.name -> target.host -> host worker id)
  • fail-fast managed worker startup in the current implementation
  • PR-safe E2E automation lanes
  • desired-state docs/spec updates for the host-worker model

Latest branch head at the time of issue creation:

  • fb835c8 Drop remaining runtime path fallbacks

What is still wrong

Even after the remote-execution migration, worker lifecycle UX is still wrong for distributed use:

  • orch worker start is not yet a true local background worker bootstrap
  • operators still need to manually keep orch worker run alive
  • tmux was used only as a workaround to keep a foreground loop alive
  • this contradicts the desired architecture that the worker itself runs in the background

Concrete operator flow that should become normal

The intended flow should be as simple as:

# Zeus
orch master start --listen tcp://0.0.0.0:7777
orch daemon repo register /home/kento/repos/doeff

# Mac
ORCH_REMOTE=zeus:7777 orch worker start
ORCH_PROJECT=proboscis-doeff orch issue create sample-issue --title "Sample"
ORCH_PROJECT=proboscis-doeff orch run sample-issue --on mac --agent codex

And not require:

tmux new-session -d -s orch-worker-mac 'ORCH_REMOTE=zeus:7777 orch worker run --worker-id host-mac'

Philosophy to preserve while implementing

The user explicitly requested the following philosophy for this work:

  • no intermediate-state workaround kept as the desired UX
  • no silent fallback behavior
  • fail fast in unexpected cases
  • clear actionable error messages in every unexpected case
  • daemon/master remains the authoritative source of truth for state

Likely implementation areas

  • internal/cli/master_worker.go
  • local worker background supervision and lifecycle state under XDG dirs
  • local pid/log/metadata for workers, parallel to daemon background lifecycle
  • tests for local worker start/stop/status and fail-fast startup
  • docs / E2E automation updates so background worker startup is the standard path

Notes for the next session

If a new session picks this up, it should begin by checking out the worktree above and then implementing this issue there, not on main.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions