feat(spider-execution-manager): Add liveness actor with session ID tracker. by LinZhihao-723 · Pull Request #328 · y-scope/spider

LinZhihao-723 · 2026-05-22T22:01:37Z

Description

This PR depends on #327.

This PR adds the liveness actor and the shared session-tracker primitive that the rest of the execution-manager runtime will plug into.

spider_core::session::SessionTracker — a forward-only counter wrapping Arc<AtomicU64> for the runtime's view of storage's current session id. Cloneable, with current() / try_advance() semantics: writers always move the stored value forward via a CAS loop, and reads observe the latest committed value. Lives in spider-core so the future scheduler service can reuse the same primitive.

spider_execution_manager::liveness — a tokio-actor driving the periodic storage heartbeat:

A tokio::time::interval ticks every heartbeat_interval; each tick calls LivenessClient::heartbeat(em_id) and forwards storage's reply to the shared SessionTracker. The interval uses MissedTickBehavior::Skip as a defensive guard against starvation-induced burst-replay.
A LivenessCommand::Refresh lets the rest of the runtime ask for an off-schedule heartbeat. The command does not advance the tracker directly — storage's heartbeat reply is the only source of truth for the current session id, so the actor always re-checks rather than trusting the caller's observation.
interval.reset() runs at the end of every heartbeat call, so a Refresh-triggered heartbeat naturally rate-limits the next scheduled tick. Two consecutive heartbeats are never closer together than heartbeat_interval.
Terminal errors (MarkedDead / IllegalId) cancel the actor's CancellationToken, which the rest of the runtime will observe to tear everything down.
A storage reply with a session ID older than the locally tracked value is treated as a protocol invariant violation and also cancels the runtime.

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Ensure all workflows pass.
Add unit tests to cover basic actor behaviors.

Summary by CodeRabbit

Release Notes

New Features

Execution manager component for coordinating distributed task execution
Task executor subprocess with automatic process pool management
Timeout enforcement and crash recovery for task execution
Session tracking for execution lifecycle management

Tests

Comprehensive integration tests for task execution and process pool operations
Overhead instrumentation tests for performance measurement

coderabbitai · 2026-05-22T22:01:50Z

Walkthrough

This pull request introduces a complete execution manager subsystem that spawns task executor subprocesses, manages their lifecycle through a process pool, maintains liveness via periodic heartbeats, and coordinates task dispatch using a wire-protocol IPC layer. It includes the executor binary implementation, client abstractions for external service coordination, comprehensive integration testing, and foundational session tracking.

Changes

Execution Manager Feature

Layer / File(s)	Summary
Session tracking foundation `components/spider-core/Cargo.toml`, `components/spider-core/src/lib.rs`, `components/spider-core/src/session.rs`	`SessionTracker` wraps an atomic `SessionId` counter and provides forward-only advancement via CAS-loop, with unit tests covering single-thread and concurrent scenarios.
Client abstractions for storage/scheduler/liveness `components/spider-execution-manager/src/client.rs`, `components/spider-execution-manager/src/client/liveness.rs`, `components/spider-execution-manager/src/client/scheduler.rs`, `components/spider-execution-manager/src/client/storage.rs`	Three async trait interfaces define how the execution manager interacts with storage (task registration, success/failure reporting), scheduler (task polling), and liveness (heartbeat/registration); error types and response structures document failure modes.
Wire protocol for executor IPC `components/spider-task-executor/src/protocol.rs`, `components/spider-task-executor/src/error.rs`, `components/spider-task-executor/src/lib.rs`	`Request`/`Response`/`ExecutorOutcome` enums enable bincode-serialized message framing; `ExecutorError` made serde-compatible and string-backed for wire serialization.
Task executor subprocess implementation `components/spider-task-executor/Cargo.toml`, `components/spider-task-executor/src/bin/spider_task_executor.rs`, `components/spider-task-executor/src/manager.rs`	Executor binary reads framed stdin `Request`s, loads TDL packages via cache, executes tasks, and returns framed stdout `Response`s with timing; `TdlPackageManager::load` now returns package reference.
Liveness heartbeat actor `components/spider-execution-manager/src/liveness.rs`, `components/spider-execution-manager/src/lib.rs`	Tokio actor periodically sends heartbeats, advances `SessionTracker`, cancels runtime on terminal errors, and supports manual refresh via `LivenessHandle`; includes full test suite with mock client and deterministic `Notify`-based coordination.
Process pool executor management `components/spider-execution-manager/src/process_pool.rs`	Supervisor spawns and manages executor subprocesses, serializes concurrent requests via mutex, races against hard timeout, detects crashes via stdout EOF, and transparently respawns on failure.
Integration test harness and test tasks `tests/huntsman/integration-test-tasks/Cargo.toml`, `tests/huntsman/integration-test-tasks/src/lib.rs`, `tests/huntsman/task-executor/Cargo.toml`, `tests/huntsman/task-executor/src/lib.rs`, `taskfiles/test.yaml`	Introduces test tasks (fibonacci, always_fail, always_panic, instrument) as a TDL cdylib; test harness spawns executor binary, frames protocol messages, and provides payload builders; build config stages TDL artifacts.
Integration tests for executor and process pool `tests/huntsman/task-executor/tests/test_executor.rs`, `tests/huntsman/task-executor/tests/test_process_pool.rs`, `tests/huntsman/task-executor/tests/overhead_instrument.rs`	Tests verify executor correctness (output, errors, crashes), process pool semantics (dispatch, failure modes, respawn), and measure IPC/execution latency; includes crash recovery and timeout/retry scenarios.
Workspace and test configuration updates `Cargo.toml`, `tests/huntsman/tdl-integration/tests/complex.rs`	Workspace members extended to include execution manager and test crates; test config updated to build executor binary and stage TDL artifacts; existing TDL test adapted to new API.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

y-scope/spider#317: Introduces changes to ExecutorError and TdlPackageManager that this PR builds upon and refines for wire-protocol serialization.

Suggested reviewers

sitaowang1998

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main feature addition: a liveness actor with session ID tracker for the spider-execution-manager component.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

components/spider-execution-manager/src/process_pool.rs (1)
163-173: ⚡ Quick win

Serialize the request before taking the executor mutex.

build_request() only does local encoding, but it currently runs after Line 163 inside the same mutex scope that guards the child process. Moving it ahead of the lock keeps large input serializations and local encoding failures from extending head-of-line blocking on the single executor.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/spider-execution-manager/src/process_pool.rs` around lines 163 -
173, Call build_request(request) before acquiring the executor mutex so local
serialization/encoding work does not hold the child-process lock; specifically,
move the build_request(request)? invocation out of the critical section that
surrounds self.handle.lock().await and handle.run(...).await, so you compute
frame_request (via build_request) first, then acquire the mutex
(self.handle.lock().await), get handle
(handle_guard.as_mut().ok_or(InternalError::NotRunning)?), log and call
handle.run(frame_request, hard_timeout).await.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/spider-execution-manager/src/process_pool.rs`:
- Around line 278-325: The current run function (async fn run(&mut self,
request: Request, hard_timeout: Duration) -> Outcome) only starts the
tokio::select! timeout after awaiting self.requests.send(...).await, so a
blocked send can prevent the hard_timeout from ever firing; change run to cover
the full send+receive window by moving the timeout to wrap both send and
response handling (e.g., use tokio::time::timeout(hard_timeout, async {
self.requests.send(Bytes::from(bytes)).await?; self.responses.next().await }) or
include the send future in the same tokio::select! alongside responses and the
sleep), ensuring the send call (self.requests.send) is protected by hard_timeout
and still returns Outcome::Timeout on expiration.

In `@components/spider-task-executor/src/bin/spider_task_executor.rs`:
- Around line 73-77: The package identifier is used directly to build a
filesystem path in the else branch (the block that calls manager.get(package)
and manager.load(&path)), allowing path traversal; before joining
pkg_dir.join(package) validate/sanitize `package` (e.g., reject empty strings,
any path separators like '/' or '\\', any ".." components, and allow only a safe
whitelist such as [A-Za-z0-9_-]); if validation fails return an error instead of
constructing the path; apply this check where you construct `path` and before
calling `manager.load` so only safe package names are used.

---

Nitpick comments:
In `@components/spider-execution-manager/src/process_pool.rs`:
- Around line 163-173: Call build_request(request) before acquiring the executor
mutex so local serialization/encoding work does not hold the child-process lock;
specifically, move the build_request(request)? invocation out of the critical
section that surrounds self.handle.lock().await and handle.run(...).await, so
you compute frame_request (via build_request) first, then acquire the mutex
(self.handle.lock().await), get handle
(handle_guard.as_mut().ok_or(InternalError::NotRunning)?), log and call
handle.run(frame_request, hard_timeout).await.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fed459f0-c414-490d-9e84-a0fc3c793720

📥 Commits

Reviewing files that changed from the base of the PR and between aadb9eb and 49b34d2.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (27)

Cargo.toml
components/spider-core/Cargo.toml
components/spider-core/src/lib.rs
components/spider-core/src/session.rs
components/spider-execution-manager/Cargo.toml
components/spider-execution-manager/src/client.rs
components/spider-execution-manager/src/client/liveness.rs
components/spider-execution-manager/src/client/scheduler.rs
components/spider-execution-manager/src/client/storage.rs
components/spider-execution-manager/src/lib.rs
components/spider-execution-manager/src/liveness.rs
components/spider-execution-manager/src/process_pool.rs
components/spider-task-executor/Cargo.toml
components/spider-task-executor/src/bin/spider_task_executor.rs
components/spider-task-executor/src/error.rs
components/spider-task-executor/src/lib.rs
components/spider-task-executor/src/manager.rs
components/spider-task-executor/src/protocol.rs
taskfiles/test.yaml
tests/huntsman/integration-test-tasks/Cargo.toml
tests/huntsman/integration-test-tasks/src/lib.rs
tests/huntsman/task-executor/Cargo.toml
tests/huntsman/task-executor/src/lib.rs
tests/huntsman/task-executor/tests/overhead_instrument.rs
tests/huntsman/task-executor/tests/test_executor.rs
tests/huntsman/task-executor/tests/test_process_pool.rs
tests/huntsman/tdl-integration/tests/complex.rs

coderabbitai · 2026-05-22T22:08:07Z

+    async fn run(&mut self, request: Request, hard_timeout: Duration) -> Outcome {
+        let bytes = bincode::serialize(&request).expect("bincode encode Request");
+        if let Err(err) = self.requests.send(Bytes::from(bytes)).await {
+            tracing::warn!(
+                executor_id = self.executor_id,
+                err = ? err,
+                "Failed to send request to executor."
+            );
+            return Outcome::ExecutorCrash {
+                exit_status: self.poll_exit_code(),
+            };
+        }
+
+        tokio::select! {
+            biased;
+            frame = self.responses.next() => match frame {
+                Some(Ok(bytes)) => match bincode::deserialize::<Response>(&bytes) {
+                    Ok(Response::Result { outcome, elapsed_us }) => match outcome {
+                        ExecutorOutcome::Success { outputs } => {
+                            Outcome::Success { outputs, elapsed_us }
+                        }
+                        ExecutorOutcome::Failure { error } => {
+                            Outcome::InTaskFailure { error, elapsed_us }
+                        }
+                    },
+                    Err(err) => {
+                        tracing::error!(
+                            executor_id = self.executor_id,
+                            err = ? err,
+                            "Failed to decode executor's response. Considered as crashed."
+                        );
+                        Outcome::ExecutorCrash { exit_status: self.poll_exit_code() }
+                    }
+                },
+                Some(Err(err)) => {
+                    tracing::error!(
+                        executor_id = self.executor_id,
+                        err = ? err,
+                        "Failed to receive executor's response."
+                    );
+                    Outcome::ExecutorCrash { exit_status: self.poll_exit_code() }
+                }
+                None => Outcome::ExecutorCrash { exit_status: self.poll_exit_code() },
+            },
+            () = tokio::time::sleep(hard_timeout) => {
+                tracing::warn!(executor_id = self.executor_id, "Executor time out triggered.");
+                Outcome::Timeout { hard_timeout }
+            }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate the run() implementation and inspect the exact stdin write / pipe handling around the referenced lines. FILE="components/spider-execution-manager/src/process_pool.rs" # Show a window around the snippet lines sed -n '240,360p' "$FILE" | cat -n # Find where stdin (or any pipe writer) is written in run() rg -n "stdin|write_all|AsyncWriteExt|pipe|requests\.send|Bytes::from" "$FILE" # Print the send-related code with some surrounding context (where possible) rg -n "self\.requests\.send" "$FILE" && \ sed -n '260,330p' "$FILE" | cat -n # Also locate poll_exit_code and respawn/timeout path logic for context rg -n "poll_exit_code|respawn|Timeout|ExecutorCrash" "$FILE"

Repository: y-scope/spider

Length of output: 11483

Cover the stdin write with hard_timeout
In ExecutorHandle::run, the hard_timeout tokio::select! doesn’t start until after self.requests.send(...).await completes. If the child stops reading stdin (or the framed writer backpressures), that send can block indefinitely, preventing the timeout (and subsequent crash/respawn handling) from triggering. Move the timeout to cover the full send+receive exchange (e.g., tokio::time::timeout(hard_timeout, async { send; recv }) or include the send future in the same select!).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/spider-execution-manager/src/process_pool.rs` around lines 278 - 325, The current run function (async fn run(&mut self, request: Request, hard_timeout: Duration) -> Outcome) only starts the tokio::select! timeout after awaiting self.requests.send(...).await, so a blocked send can prevent the hard_timeout from ever firing; change run to cover the full send+receive window by moving the timeout to wrap both send and response handling (e.g., use tokio::time::timeout(hard_timeout, async { self.requests.send(Bytes::from(bytes)).await?; self.responses.next().await }) or include the send future in the same tokio::select! alongside responses and the sleep), ensuring the send call (self.requests.send) is protected by hard_timeout and still returns Outcome::Timeout on expiration.

coderabbitai · 2026-05-22T22:08:07Z

+    let pkg = if let Some(pkg) = manager.get(package) {
+        pkg
+    } else {
+        let path = pkg_dir.join(package).join(format!("lib{package}.so"));
+        manager.load(&path)?


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate package identifiers before filesystem path construction

Line 76 builds a library path from package without validating components. A crafted value (for example containing separators or ..) can escape the intended package directory and load an unintended shared object.

Proposed fix

use std::{ - path::{Path, PathBuf}, + path::{Component, Path, PathBuf}, time::Instant, }; @@ fn run_task( @@ ) -> Result<Vec<u8>, ExecutorError> { let pkg = if let Some(pkg) = manager.get(package) { pkg } else { + let mut components = Path::new(package).components(); + let valid_package = matches!(components.next(), Some(Component::Normal(_))) + && components.next().is_none(); + if !valid_package { + return Err(ExecutorError::InvalidLibrary(format!( + "invalid package identifier: {package}" + ))); + } let path = pkg_dir.join(package).join(format!("lib{package}.so")); manager.load(&path)? }; pkg.execute_task(task_func, raw_ctx, raw_inputs) }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/spider-task-executor/src/bin/spider_task_executor.rs` around lines 73 - 77, The package identifier is used directly to build a filesystem path in the else branch (the block that calls manager.get(package) and manager.load(&path)), allowing path traversal; before joining pkg_dir.join(package) validate/sanitize `package` (e.g., reject empty strings, any path separators like '/' or '\\', any ".." components, and allow only a safe whitelist such as [A-Za-z0-9_-]); if validation fails return an error instead of constructing the path; apply this check where you construct `path` and before calling `manager.load` so only safe package names are used.

LinZhihao-723 added 11 commits May 14, 2026 19:36

Implementation done.

320c2d0

Commit integration tests.

d25131c

gg, forgot to apply linters.

777cff5

Done with process pool implementation.

47a004f

Rename test source.

e8253e9

Merge branch 'task-executor-impl' into process-pool-basic

86bb2ed

Fix.

5946d18

Merge branch 'task-executor-impl' into process-pool-basic

cf6fdf8

Rename test binary.

c614fd3

Add client trait implement.

fd93247

Add liveness actor.

49b34d2

LinZhihao-723 requested review from a team and sitaowang1998 as code owners May 22, 2026 22:01

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spider-execution-manager): Add liveness actor with session ID tracker.#328

feat(spider-execution-manager): Add liveness actor with session ID tracker.#328
LinZhihao-723 wants to merge 11 commits into
y-scope:mainfrom
LinZhihao-723:liveness-actor

LinZhihao-723 commented May 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 22, 2026

Uh oh!

coderabbitai Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LinZhihao-723 commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Validation performed

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LinZhihao-723 commented May 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading