Problem
orch worker start is supposed to mean:
start or reuse the long-lived orch-worker on this host/profile
But the current behavior is still effectively:
tell the connected orch-master to spawn/manage a worker on the master's host
That breaks the desired remote-execution model. In the Zeus master + Mac worker flow, operators currently need to use a manual workaround such as:
tmux new-session -d -s orch-worker-mac 'ORCH_REMOTE=zeus:7777 orch worker run --worker-id host-mac'
The worker itself is supposed to be a background service. tmux should not be part of the normal operator flow.
Desired state
orch worker start and orch worker stop should manage the local host's worker process.
Mac
orch worker start
-> starts/reuses a background orch-worker on Mac
-> that worker connects to the configured master (e.g. zeus:7777)
Zeus
orch worker start
-> starts/reuses a background orch-worker on Zeus
-> that worker connects to the configured master for Zeus's profile
orch worker start must be:
- local-host scoped
- background by default
- idempotent for the same host/profile
- fail-fast with a clear error if the worker fails before register/heartbeat
orch worker stop must stop the corresponding local worker process, not ask the master host to stop something else.
orch worker status should make it clear whether:
- the local background worker process exists
- it successfully registered with the master
- it is active/stale from the master's point of view
Why this matters
This is part of the desired host-worker model already documented in the branch specs:
- one long-lived worker per host/profile
- worker is background infrastructure, not a foreground command that operators need to wrap in
tmux
--on <target> selects a host/profile, not a master-side path or master-side spawned worker
Acceptance criteria
orch worker start on a host starts/reuses a local background worker process on that same host.
- With
ORCH_REMOTE=zeus:7777, orch worker start on Mac starts/reuses the Mac worker that connects to Zeus.
orch worker stop on Mac stops that Mac worker.
orch worker start is idempotent for the same host/profile.
- Startup is fail-fast: if the worker exits before registration,
worker start returns a clear actionable error immediately.
worker status surfaces both local process state and master registration state clearly enough to debug startup failures.
- E2E docs and automation no longer require
tmux ... orch worker run ... as the normal path for starting a worker.
Session context for continuation
This issue is intentionally carrying forward the current branch/worktree context from the previous coding session.
Continue from this exact worktree
branch : feature/remote-execution-phase0
worktree : /Users/s22625/repos/orch-feature-remote-execution-phase0
HEAD : fb835c870a21f2b4dbe5dc34941ca48e77158213
Important:
- Do not continue this work in
/Users/s22625/repos/orch.
- Continue in
/Users/s22625/repos/orch-feature-remote-execution-phase0.
Related PR and previous tracking issue
What is already done on this branch
The remote-execution migration itself is mostly complete on this branch. Highlights from the session:
- strict
project_id-based runtime identity
- removal of main runtime path-shaped transport (
project_root / repo_root) in the new runtime path
- worker-local repo mapping for execution
- host-worker routing (
target.name -> target.host -> host worker id)
- fail-fast managed worker startup in the current implementation
- PR-safe E2E automation lanes
- desired-state docs/spec updates for the host-worker model
Latest branch head at the time of issue creation:
fb835c8 Drop remaining runtime path fallbacks
What is still wrong
Even after the remote-execution migration, worker lifecycle UX is still wrong for distributed use:
orch worker start is not yet a true local background worker bootstrap
- operators still need to manually keep
orch worker run alive
tmux was used only as a workaround to keep a foreground loop alive
- this contradicts the desired architecture that the worker itself runs in the background
Concrete operator flow that should become normal
The intended flow should be as simple as:
# Zeus
orch master start --listen tcp://0.0.0.0:7777
orch daemon repo register /home/kento/repos/doeff
# Mac
ORCH_REMOTE=zeus:7777 orch worker start
ORCH_PROJECT=proboscis-doeff orch issue create sample-issue --title "Sample"
ORCH_PROJECT=proboscis-doeff orch run sample-issue --on mac --agent codex
And not require:
tmux new-session -d -s orch-worker-mac 'ORCH_REMOTE=zeus:7777 orch worker run --worker-id host-mac'
Philosophy to preserve while implementing
The user explicitly requested the following philosophy for this work:
- no intermediate-state workaround kept as the desired UX
- no silent fallback behavior
- fail fast in unexpected cases
- clear actionable error messages in every unexpected case
- daemon/master remains the authoritative source of truth for state
Likely implementation areas
internal/cli/master_worker.go
- local worker background supervision and lifecycle state under XDG dirs
- local pid/log/metadata for workers, parallel to daemon background lifecycle
- tests for local worker start/stop/status and fail-fast startup
- docs / E2E automation updates so background worker startup is the standard path
Notes for the next session
If a new session picks this up, it should begin by checking out the worktree above and then implementing this issue there, not on main.
Problem
orch worker startis supposed to mean:But the current behavior is still effectively:
That breaks the desired remote-execution model. In the
Zeus master + Mac workerflow, operators currently need to use a manual workaround such as:tmux new-session -d -s orch-worker-mac 'ORCH_REMOTE=zeus:7777 orch worker run --worker-id host-mac'The worker itself is supposed to be a background service.
tmuxshould not be part of the normal operator flow.Desired state
orch worker startandorch worker stopshould manage the local host's worker process.orch worker startmust be:orch worker stopmust stop the corresponding local worker process, not ask the master host to stop something else.orch worker statusshould make it clear whether:Why this matters
This is part of the desired host-worker model already documented in the branch specs:
tmux--on <target>selects a host/profile, not a master-side path or master-side spawned workerAcceptance criteria
orch worker starton a host starts/reuses a local background worker process on that same host.ORCH_REMOTE=zeus:7777,orch worker starton Mac starts/reuses the Mac worker that connects to Zeus.orch worker stopon Mac stops that Mac worker.orch worker startis idempotent for the same host/profile.worker startreturns a clear actionable error immediately.worker statussurfaces both local process state and master registration state clearly enough to debug startup failures.tmux ... orch worker run ...as the normal path for starting a worker.Session context for continuation
This issue is intentionally carrying forward the current branch/worktree context from the previous coding session.
Continue from this exact worktree
Important:
/Users/s22625/repos/orch./Users/s22625/repos/orch-feature-remote-execution-phase0.Related PR and previous tracking issue
What is already done on this branch
The remote-execution migration itself is mostly complete on this branch. Highlights from the session:
project_id-based runtime identityproject_root/repo_root) in the new runtime pathtarget.name -> target.host -> host worker id)Latest branch head at the time of issue creation:
fb835c8Drop remaining runtime path fallbacksWhat is still wrong
Even after the remote-execution migration, worker lifecycle UX is still wrong for distributed use:
orch worker startis not yet a true local background worker bootstraporch worker runalivetmuxwas used only as a workaround to keep a foreground loop aliveConcrete operator flow that should become normal
The intended flow should be as simple as:
And not require:
tmux new-session -d -s orch-worker-mac 'ORCH_REMOTE=zeus:7777 orch worker run --worker-id host-mac'Philosophy to preserve while implementing
The user explicitly requested the following philosophy for this work:
Likely implementation areas
internal/cli/master_worker.goNotes for the next session
If a new session picks this up, it should begin by checking out the worktree above and then implementing this issue there, not on
main.