Skip to content

feat(spider-execution-manager): Add liveness actor with session ID tracker.#328

Open
LinZhihao-723 wants to merge 11 commits into
y-scope:mainfrom
LinZhihao-723:liveness-actor
Open

feat(spider-execution-manager): Add liveness actor with session ID tracker.#328
LinZhihao-723 wants to merge 11 commits into
y-scope:mainfrom
LinZhihao-723:liveness-actor

Conversation

@LinZhihao-723
Copy link
Copy Markdown
Member

@LinZhihao-723 LinZhihao-723 commented May 22, 2026

Description

This PR depends on #327.

This PR adds the liveness actor and the shared session-tracker primitive that the rest of the execution-manager runtime will plug into.

spider_core::session::SessionTracker — a forward-only counter wrapping Arc<AtomicU64> for the runtime's view of storage's current session id. Cloneable, with current() / try_advance() semantics: writers always move the stored value forward via a CAS loop, and reads observe the latest committed value. Lives in spider-core so the future scheduler service can reuse the same primitive.

spider_execution_manager::liveness — a tokio-actor driving the periodic storage heartbeat:

  • A tokio::time::interval ticks every heartbeat_interval; each tick calls LivenessClient::heartbeat(em_id) and forwards storage's reply to the shared SessionTracker. The interval uses MissedTickBehavior::Skip as a defensive guard against starvation-induced burst-replay.
  • A LivenessCommand::Refresh lets the rest of the runtime ask for an off-schedule heartbeat. The command does not advance the tracker directly — storage's heartbeat reply is the only source of truth for the current session id, so the actor always re-checks rather than trusting the caller's observation.
  • interval.reset() runs at the end of every heartbeat call, so a Refresh-triggered heartbeat naturally rate-limits the next scheduled tick. Two consecutive heartbeats are never closer together than heartbeat_interval.
  • Terminal errors (MarkedDead / IllegalId) cancel the actor's CancellationToken, which the rest of the runtime will observe to tear everything down.
  • A storage reply with a session ID older than the locally tracked value is treated as a protocol invariant violation and also cancels the runtime.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Ensure all workflows pass.
  • Add unit tests to cover basic actor behaviors.

Summary by CodeRabbit

Release Notes

New Features

  • Execution manager component for coordinating distributed task execution
  • Task executor subprocess with automatic process pool management
  • Timeout enforcement and crash recovery for task execution
  • Session tracking for execution lifecycle management

Tests

  • Comprehensive integration tests for task execution and process pool operations
  • Overhead instrumentation tests for performance measurement

Review Change Stack

@LinZhihao-723 LinZhihao-723 requested review from a team and sitaowang1998 as code owners May 22, 2026 22:01
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Walkthrough

This pull request introduces a complete execution manager subsystem that spawns task executor subprocesses, manages their lifecycle through a process pool, maintains liveness via periodic heartbeats, and coordinates task dispatch using a wire-protocol IPC layer. It includes the executor binary implementation, client abstractions for external service coordination, comprehensive integration testing, and foundational session tracking.

Changes

Execution Manager Feature

Layer / File(s) Summary
Session tracking foundation
components/spider-core/Cargo.toml, components/spider-core/src/lib.rs, components/spider-core/src/session.rs
SessionTracker wraps an atomic SessionId counter and provides forward-only advancement via CAS-loop, with unit tests covering single-thread and concurrent scenarios.
Client abstractions for storage/scheduler/liveness
components/spider-execution-manager/src/client.rs, components/spider-execution-manager/src/client/liveness.rs, components/spider-execution-manager/src/client/scheduler.rs, components/spider-execution-manager/src/client/storage.rs
Three async trait interfaces define how the execution manager interacts with storage (task registration, success/failure reporting), scheduler (task polling), and liveness (heartbeat/registration); error types and response structures document failure modes.
Wire protocol for executor IPC
components/spider-task-executor/src/protocol.rs, components/spider-task-executor/src/error.rs, components/spider-task-executor/src/lib.rs
Request/Response/ExecutorOutcome enums enable bincode-serialized message framing; ExecutorError made serde-compatible and string-backed for wire serialization.
Task executor subprocess implementation
components/spider-task-executor/Cargo.toml, components/spider-task-executor/src/bin/spider_task_executor.rs, components/spider-task-executor/src/manager.rs
Executor binary reads framed stdin Requests, loads TDL packages via cache, executes tasks, and returns framed stdout Responses with timing; TdlPackageManager::load now returns package reference.
Liveness heartbeat actor
components/spider-execution-manager/src/liveness.rs, components/spider-execution-manager/src/lib.rs
Tokio actor periodically sends heartbeats, advances SessionTracker, cancels runtime on terminal errors, and supports manual refresh via LivenessHandle; includes full test suite with mock client and deterministic Notify-based coordination.
Process pool executor management
components/spider-execution-manager/src/process_pool.rs
Supervisor spawns and manages executor subprocesses, serializes concurrent requests via mutex, races against hard timeout, detects crashes via stdout EOF, and transparently respawns on failure.
Integration test harness and test tasks
tests/huntsman/integration-test-tasks/Cargo.toml, tests/huntsman/integration-test-tasks/src/lib.rs, tests/huntsman/task-executor/Cargo.toml, tests/huntsman/task-executor/src/lib.rs, taskfiles/test.yaml
Introduces test tasks (fibonacci, always_fail, always_panic, instrument) as a TDL cdylib; test harness spawns executor binary, frames protocol messages, and provides payload builders; build config stages TDL artifacts.
Integration tests for executor and process pool
tests/huntsman/task-executor/tests/test_executor.rs, tests/huntsman/task-executor/tests/test_process_pool.rs, tests/huntsman/task-executor/tests/overhead_instrument.rs
Tests verify executor correctness (output, errors, crashes), process pool semantics (dispatch, failure modes, respawn), and measure IPC/execution latency; includes crash recovery and timeout/retry scenarios.
Workspace and test configuration updates
Cargo.toml, tests/huntsman/tdl-integration/tests/complex.rs
Workspace members extended to include execution manager and test crates; test config updated to build executor binary and stage TDL artifacts; existing TDL test adapted to new API.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • y-scope/spider#317: Introduces changes to ExecutorError and TdlPackageManager that this PR builds upon and refines for wire-protocol serialization.

Suggested reviewers

  • sitaowang1998
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main feature addition: a liveness actor with session ID tracker for the spider-execution-manager component.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
components/spider-execution-manager/src/process_pool.rs (1)

163-173: ⚡ Quick win

Serialize the request before taking the executor mutex.

build_request() only does local encoding, but it currently runs after Line 163 inside the same mutex scope that guards the child process. Moving it ahead of the lock keeps large input serializations and local encoding failures from extending head-of-line blocking on the single executor.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/spider-execution-manager/src/process_pool.rs` around lines 163 -
173, Call build_request(request) before acquiring the executor mutex so local
serialization/encoding work does not hold the child-process lock; specifically,
move the build_request(request)? invocation out of the critical section that
surrounds self.handle.lock().await and handle.run(...).await, so you compute
frame_request (via build_request) first, then acquire the mutex
(self.handle.lock().await), get handle
(handle_guard.as_mut().ok_or(InternalError::NotRunning)?), log and call
handle.run(frame_request, hard_timeout).await.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/spider-execution-manager/src/process_pool.rs`:
- Around line 278-325: The current run function (async fn run(&mut self,
request: Request, hard_timeout: Duration) -> Outcome) only starts the
tokio::select! timeout after awaiting self.requests.send(...).await, so a
blocked send can prevent the hard_timeout from ever firing; change run to cover
the full send+receive window by moving the timeout to wrap both send and
response handling (e.g., use tokio::time::timeout(hard_timeout, async {
self.requests.send(Bytes::from(bytes)).await?; self.responses.next().await }) or
include the send future in the same tokio::select! alongside responses and the
sleep), ensuring the send call (self.requests.send) is protected by hard_timeout
and still returns Outcome::Timeout on expiration.

In `@components/spider-task-executor/src/bin/spider_task_executor.rs`:
- Around line 73-77: The package identifier is used directly to build a
filesystem path in the else branch (the block that calls manager.get(package)
and manager.load(&path)), allowing path traversal; before joining
pkg_dir.join(package) validate/sanitize `package` (e.g., reject empty strings,
any path separators like '/' or '\\', any ".." components, and allow only a safe
whitelist such as [A-Za-z0-9_-]); if validation fails return an error instead of
constructing the path; apply this check where you construct `path` and before
calling `manager.load` so only safe package names are used.

---

Nitpick comments:
In `@components/spider-execution-manager/src/process_pool.rs`:
- Around line 163-173: Call build_request(request) before acquiring the executor
mutex so local serialization/encoding work does not hold the child-process lock;
specifically, move the build_request(request)? invocation out of the critical
section that surrounds self.handle.lock().await and handle.run(...).await, so
you compute frame_request (via build_request) first, then acquire the mutex
(self.handle.lock().await), get handle
(handle_guard.as_mut().ok_or(InternalError::NotRunning)?), log and call
handle.run(frame_request, hard_timeout).await.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fed459f0-c414-490d-9e84-a0fc3c793720

📥 Commits

Reviewing files that changed from the base of the PR and between aadb9eb and 49b34d2.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (27)
  • Cargo.toml
  • components/spider-core/Cargo.toml
  • components/spider-core/src/lib.rs
  • components/spider-core/src/session.rs
  • components/spider-execution-manager/Cargo.toml
  • components/spider-execution-manager/src/client.rs
  • components/spider-execution-manager/src/client/liveness.rs
  • components/spider-execution-manager/src/client/scheduler.rs
  • components/spider-execution-manager/src/client/storage.rs
  • components/spider-execution-manager/src/lib.rs
  • components/spider-execution-manager/src/liveness.rs
  • components/spider-execution-manager/src/process_pool.rs
  • components/spider-task-executor/Cargo.toml
  • components/spider-task-executor/src/bin/spider_task_executor.rs
  • components/spider-task-executor/src/error.rs
  • components/spider-task-executor/src/lib.rs
  • components/spider-task-executor/src/manager.rs
  • components/spider-task-executor/src/protocol.rs
  • taskfiles/test.yaml
  • tests/huntsman/integration-test-tasks/Cargo.toml
  • tests/huntsman/integration-test-tasks/src/lib.rs
  • tests/huntsman/task-executor/Cargo.toml
  • tests/huntsman/task-executor/src/lib.rs
  • tests/huntsman/task-executor/tests/overhead_instrument.rs
  • tests/huntsman/task-executor/tests/test_executor.rs
  • tests/huntsman/task-executor/tests/test_process_pool.rs
  • tests/huntsman/tdl-integration/tests/complex.rs

Comment on lines +278 to +325
async fn run(&mut self, request: Request, hard_timeout: Duration) -> Outcome {
let bytes = bincode::serialize(&request).expect("bincode encode Request");
if let Err(err) = self.requests.send(Bytes::from(bytes)).await {
tracing::warn!(
executor_id = self.executor_id,
err = ? err,
"Failed to send request to executor."
);
return Outcome::ExecutorCrash {
exit_status: self.poll_exit_code(),
};
}

tokio::select! {
biased;
frame = self.responses.next() => match frame {
Some(Ok(bytes)) => match bincode::deserialize::<Response>(&bytes) {
Ok(Response::Result { outcome, elapsed_us }) => match outcome {
ExecutorOutcome::Success { outputs } => {
Outcome::Success { outputs, elapsed_us }
}
ExecutorOutcome::Failure { error } => {
Outcome::InTaskFailure { error, elapsed_us }
}
},
Err(err) => {
tracing::error!(
executor_id = self.executor_id,
err = ? err,
"Failed to decode executor's response. Considered as crashed."
);
Outcome::ExecutorCrash { exit_status: self.poll_exit_code() }
}
},
Some(Err(err)) => {
tracing::error!(
executor_id = self.executor_id,
err = ? err,
"Failed to receive executor's response."
);
Outcome::ExecutorCrash { exit_status: self.poll_exit_code() }
}
None => Outcome::ExecutorCrash { exit_status: self.poll_exit_code() },
},
() = tokio::time::sleep(hard_timeout) => {
tracing::warn!(executor_id = self.executor_id, "Executor time out triggered.");
Outcome::Timeout { hard_timeout }
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the run() implementation and inspect the exact stdin write / pipe handling around the referenced lines.
FILE="components/spider-execution-manager/src/process_pool.rs"

# Show a window around the snippet lines
sed -n '240,360p' "$FILE" | cat -n

# Find where stdin (or any pipe writer) is written in run()
rg -n "stdin|write_all|AsyncWriteExt|pipe|requests\.send|Bytes::from" "$FILE"

# Print the send-related code with some surrounding context (where possible)
rg -n "self\.requests\.send" "$FILE" && \
  sed -n '260,330p' "$FILE" | cat -n

# Also locate poll_exit_code and respawn/timeout path logic for context
rg -n "poll_exit_code|respawn|Timeout|ExecutorCrash" "$FILE"

Repository: y-scope/spider

Length of output: 11483


Cover the stdin write with hard_timeout
In ExecutorHandle::run, the hard_timeout tokio::select! doesn’t start until after self.requests.send(...).await completes. If the child stops reading stdin (or the framed writer backpressures), that send can block indefinitely, preventing the timeout (and subsequent crash/respawn handling) from triggering. Move the timeout to cover the full send+receive exchange (e.g., tokio::time::timeout(hard_timeout, async { send; recv }) or include the send future in the same select!).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/spider-execution-manager/src/process_pool.rs` around lines 278 -
325, The current run function (async fn run(&mut self, request: Request,
hard_timeout: Duration) -> Outcome) only starts the tokio::select! timeout after
awaiting self.requests.send(...).await, so a blocked send can prevent the
hard_timeout from ever firing; change run to cover the full send+receive window
by moving the timeout to wrap both send and response handling (e.g., use
tokio::time::timeout(hard_timeout, async {
self.requests.send(Bytes::from(bytes)).await?; self.responses.next().await }) or
include the send future in the same tokio::select! alongside responses and the
sleep), ensuring the send call (self.requests.send) is protected by hard_timeout
and still returns Outcome::Timeout on expiration.

Comment on lines +73 to +77
let pkg = if let Some(pkg) = manager.get(package) {
pkg
} else {
let path = pkg_dir.join(package).join(format!("lib{package}.so"));
manager.load(&path)?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate package identifiers before filesystem path construction

Line 76 builds a library path from package without validating components. A crafted value (for example containing separators or ..) can escape the intended package directory and load an unintended shared object.

Proposed fix
 use std::{
-    path::{Path, PathBuf},
+    path::{Component, Path, PathBuf},
     time::Instant,
 };
@@
 fn run_task(
@@
 ) -> Result<Vec<u8>, ExecutorError> {
     let pkg = if let Some(pkg) = manager.get(package) {
         pkg
     } else {
+        let mut components = Path::new(package).components();
+        let valid_package = matches!(components.next(), Some(Component::Normal(_)))
+            && components.next().is_none();
+        if !valid_package {
+            return Err(ExecutorError::InvalidLibrary(format!(
+                "invalid package identifier: {package}"
+            )));
+        }
         let path = pkg_dir.join(package).join(format!("lib{package}.so"));
         manager.load(&path)?
     };
     pkg.execute_task(task_func, raw_ctx, raw_inputs)
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/spider-task-executor/src/bin/spider_task_executor.rs` around lines
73 - 77, The package identifier is used directly to build a filesystem path in
the else branch (the block that calls manager.get(package) and
manager.load(&path)), allowing path traversal; before joining
pkg_dir.join(package) validate/sanitize `package` (e.g., reject empty strings,
any path separators like '/' or '\\', any ".." components, and allow only a safe
whitelist such as [A-Za-z0-9_-]); if validation fails return an error instead of
constructing the path; apply this check where you construct `path` and before
calling `manager.load` so only safe package names are used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant