Status: describes the target design. Current code still has separate
DistChipProcess/DistSubWorkerclasses (target: merged intoWorkerThreadin PR-D) and passesconst WorkerPayload&toIWorker::run(target: replaced in PR-C). See roadmap.md for the full landed-vs-planned breakdown.
WorkerManager and WorkerThread together implement the execution layer
of a Worker engine. WorkerManager owns two pools of WorkerThreads (one
for next-level workers, one for sub workers); each WorkerThread owns an
IWorker and a std::thread, and dispatches to it in either THREAD or
PROCESS mode.
For the high-level role of this layer among the three engine components, see
distributed_level_runtime.md. For what the
IWorker implementations actually do with task data, see
task-flow.md. For where dispatched tasks come from, see
scheduler.md.
class WorkerManager {
public:
enum class Mode { THREAD, PROCESS };
explicit WorkerManager(Mode mode);
// Registration (before init)
void add_next_level(IWorker *worker);
void add_sub (IWorker *worker);
// Lifecycle
void start(OnCompleteFn on_complete); // starts all WorkerThreads
void stop();
// Scheduler API
WorkerThread *pick_idle(WorkerType type);
std::vector<WorkerThread *> pick_n_idle(WorkerType type, int n);
void dispatch(WorkerThread *wt, TaskSlot slot_id);
private:
Mode mode_;
std::vector<std::unique_ptr<WorkerThread>> next_level_;
std::vector<std::unique_ptr<WorkerThread>> sub_;
};- Pool ownership: two
std::vectorpools, sized at init fromadd_*calls - Idle selection:
pick_idle(type)finds a WorkerThread whose queue is empty; blocks if none available - Mode propagation: every
WorkerThreadconstructed under this manager inheritsmode_(picked perWorkerat construction)
| Deployment | Recommended mode |
|---|---|
| Onboard real hardware | THREAD — driver is thread-safe per device, no fork overhead |
| Simulation (sim runtime) | PROCESS — sim backend has shared state that needs isolation |
ci.py parallel tests |
PROCESS — test independence; per-test dlopen state |
| L4+ when L3 children are thread-safe composites | THREAD |
Mode is a per-Worker decision. Different levels in a nested hierarchy can
use different modes independently (e.g., L4 THREAD containing L3 PROCESS).
One WorkerThread per IWorker instance.
class WorkerThread {
public:
enum class Mode { THREAD, PROCESS };
WorkerThread(Mode mode,
IWorker *worker,
TaskSlotState *parent_slots,
size_t mailbox_size = 0);
void start(OnCompleteFn on_done);
void stop();
void dispatch(TaskSlot slot_id);
bool is_idle() const;
private:
Mode mode_;
IWorker *worker_;
TaskSlotState *parent_slots_; // reference to parent's slot pool
std::thread parent_thread_;
LockFreeQueue<TaskSlot> queue_;
// PROCESS mode only
void *mailbox_ = nullptr; // shm
pid_t child_pid_ = -1;
size_t mailbox_size_ = 0;
void loop();
void dispatch_thread(TaskSlot slot_id);
void dispatch_process(TaskSlot slot_id);
[[noreturn]] void child_loop();
void fork_child();
};The WorkerThread's std::thread always exists regardless of mode — it pumps
the internal queue and either runs the worker in-process or drives the shm
handshake to a forked child.
The simple case: same process, no shm, no serialization.
void WorkerThread::dispatch_thread(TaskSlot slot_id) {
TaskSlotState &s = parent_slots_[slot_id];
worker_->run(s.callable, s.task_args.view(), s.config);
on_complete_(slot_id);
}TaskArgs::view()returns a zero-copyTaskArgsViewpointing into the slot'sstd::vectorbacking (parent heap)IWorker::rundispatches polymorphically based on the actual worker type
When is THREAD mode safe?
- The IWorker implementation must be thread-safe relative to other concurrent calls and other system state
ChipWorker(dlsym'd runtime.so) is safe when the runtime.soand its device driver support concurrent useSubWorkerin THREAD mode is constrained by Python's GIL (all SubWorkers in the pool effectively serialize), but this is often fine for light Python callables
Each WorkerThread forks a child at init. Each dispatch encodes task data into a shm mailbox, signals the child, and polls for completion.
fork_child() is called once by WorkerThread::start() before any C++
worker thread spawns:
void WorkerThread::fork_child() {
// Alloc mailbox in MAP_SHARED shm
mailbox_ = mmap(nullptr, mailbox_size_,
PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_ANONYMOUS, -1, 0);
// Initialize mailbox state to IDLE
write_state(mailbox_, MailboxState::IDLE);
pid_t pid = fork();
if (pid == 0) {
// Child
child_loop(); // never returns
} else {
child_pid_ = pid;
}
}Fork must happen before any std::thread is created in the parent. The
Python Worker ensures this by:
Worker.register(fn)registers Python callables (pre-fork)- C++
WorkerManager::add_*registers IWorker pointers Worker::init:- First:
WorkerManager::start()— this calls each WorkerThread'sstart, which forks the child, then spawns the parent's std::thread for that WT - Then:
Scheduler::start()spawns the scheduler thread
- First:
- Fork ordering: at the moment
fork()is called, the parent has only the Python main thread and zero C++ worker threads. Safe.
This avoids the classical "fork in multithreaded process" hazard where a child inherits locks held by threads that don't exist post-fork.
void WorkerThread::dispatch_process(TaskSlot slot_id) {
TaskSlotState &s = parent_slots_[slot_id];
uint8_t *d = (uint8_t*)mailbox_ + HEADER_SIZE;
// Write task data
*reinterpret_cast<Callable*>(d) = s.callable;
*reinterpret_cast<CallConfig*>(d + 8) = s.config;
write_blob(d + 8 + sizeof(CallConfig), s.task_args);
// Signal child
write_state(mailbox_, MailboxState::TASK_READY);
// Poll for completion
while (read_state(mailbox_) != MailboxState::TASK_DONE)
std::this_thread::sleep_for(std::chrono::microseconds(50));
int err = read_error(mailbox_);
write_state(mailbox_, MailboxState::IDLE);
on_complete_(slot_id, err);
}Parent-side cost per dispatch:
- One memcpy of
Callable(8 B) +CallConfig(16 B) + blob (≤1.7 KB for L3) - One signal (
write_state) - Poll loop with
sleep_for(50us)(not busy-wait)
Total ~nanoseconds overhead; the wait is dominated by actual kernel execution.
void WorkerThread::child_loop() {
for (;;) {
while (read_state(mailbox_) != MailboxState::TASK_READY)
pause_short();
if (read_state(mailbox_) == MailboxState::SHUTDOWN) exit(0);
uint8_t *d = (uint8_t*)mailbox_ + HEADER_SIZE;
Callable cb = *reinterpret_cast<Callable*>(d);
CallConfig config = *reinterpret_cast<CallConfig*>(d + 8);
TaskArgsView view = read_blob(d + 8 + sizeof(CallConfig));
int err = 0;
try {
worker_->run(cb, view, config);
} catch (...) {
err = 1;
}
write_error(mailbox_, err);
write_state(mailbox_, MailboxState::TASK_DONE);
}
}Child's worker_ is polymorphic (ChipWorker / SubWorker / nested Worker).
The child inherits the parent's full address space at fork time, so:
- ChipCallable objects (pre-fork allocated) are COW-visible at the same VA
py_registry(for SubWorker) is COW-visible- Tensor data in
torch.share_memory_()regions is fully shared (MAP_SHARED)
offset 0: int32 state (IDLE / TASK_READY / TASK_DONE / SHUTDOWN)
offset 4: int32 error
offset 8: uint64 callable
offset 16: CallConfig config
offset 16 + sizeof(CallConfig): bytes blob: [int32 T][int32 S][ContinuousTensor × T][uint64_t × S]
Sized at WorkerThread construction:
mailbox_size_ = HEADER_SIZE // 8 B (state + error)
+ sizeof(Callable) // 8 B
+ sizeof(CallConfig) // ~16 B
+ MAX_BLOB_SIZE; // per-level, e.g. 1672 B for L3Per-worker total: ~2 KB. Typical pool: 4-8 workers → ~8-16 KB shm total.
void WorkerThread::stop() {
if (mode_ == Mode::PROCESS) {
write_state(mailbox_, MailboxState::SHUTDOWN);
waitpid(child_pid_, nullptr, 0);
munmap(mailbox_, mailbox_size_);
}
// Signal parent thread to exit its loop
queue_.push_sentinel();
parent_thread_.join();
}To add a new worker kind (e.g., a RemoteWorker over RPC):
- Implement
IWorker::run(Callable, TaskArgsView, const CallConfig&)on the new class - Register via
manager.add_next_level(ptr)ormanager.add_sub(ptr) - If the new worker needs to run in PROCESS mode, ensure any resources it needs (shm regions, sockets) are established before fork
The dispatch path (THREAD vs PROCESS) is chosen by WorkerManager::mode_,
not by the IWorker type — so the same IWorker implementation works in both
modes. This is why ChipWorker, SubWorker, and Worker all share one
interface: the dispatch layer is orthogonal to the worker semantics.
Three decisions that led here:
Forking per submit eliminates the mailbox and serialization, but costs ~1-10 ms per fork (COW page-table setup for a large parent image). For thousands of tasks per DAG, the overhead dominates. Pre-forked pool amortizes fork across many dispatches.
The scheduling state (TaskSlotState.fanin_count, fanout_consumers, fanout_mu) is parent-only — Scheduler and Orchestrator read/write it, but children never do. Putting the slot in shm would force cross-process atomics and shm-safe containers for no benefit. See task-flow.md §11 for full rationale.
Alternative: N workers share one dispatch queue. Rejected because:
WorkerThreadqueue is the natural unit of backpressure — if workeriis slow, its queue fills up and scheduler falls back to another- Simpler mental model: one IWorker = one thread that drives it
- Zero contention on queue access (only one producer, one consumer per queue)
- distributed_level_runtime.md — where this layer fits in the three-component engine
- task-flow.md — what
IWorker::runreceives - scheduler.md — the producer of
WorkerThread::dispatchcalls