Skip to content

feat(spider-task-executor): Add executor binary with bincode wire protocol and integration tests.#325

Open
LinZhihao-723 wants to merge 5 commits into
y-scope:mainfrom
LinZhihao-723:task-executor-impl
Open

feat(spider-task-executor): Add executor binary with bincode wire protocol and integration tests.#325
LinZhihao-723 wants to merge 5 commits into
y-scope:mainfrom
LinZhihao-723:task-executor-impl

Conversation

@LinZhihao-723
Copy link
Copy Markdown
Member

@LinZhihao-723 LinZhihao-723 commented May 14, 2026

Description

Wire protocol (spider-task-executor/src/protocol.rs)

A new protocol module on the spider-task-executor library defines the three wire types the execution manager will use to drive the child:

  • Request::Execute { tdl_context, raw_ctx, raw_inputs } and Request::Shutdown. The parent owns the hard timeout in its entirety; the executor has no notion of timeouts and the request carries no deadline.
  • Response::Result { outcome, elapsed_us } — exactly one per Execute. elapsed_us is the in-FFI wall-clock measured by the executor and is what the overhead instrument uses to separate executor-side cost from parent-side IPC.
  • ExecutorOutcome::Success { outputs } | Failure { error }outputs is the wire-format TaskOutputsSerializer buffer ready to forward to storage; error is the msgpack-encoded ExecutorError.

Stderr is not carried over the protocol; how the spawner disposes of the executor's stderr (inherit / pipe / log file) is a parent-side decision.

Executor binary (spider-task-executor/src/bin/spider_task_executor.rs)

Single-threaded tokio runtime; requests are processed strictly sequentially with exactly one task running for the lifetime of the process. Tokio is here only to match the async I/O surface the execution manager uses (tokio_util::codec::LengthDelimitedCodec); the executor itself has no concurrency requirements.

Key shape decisions:

  • The FFI call runs inline on the runtime thread — no second OS thread, no oneshot, no tokio::select!. The previous design dispatched the FFI on a std::thread so the runtime could select! an in-process timer against the FFI; with the timer responsibility consolidated on the parent, none of that scaffolding earns its keep.
  • SPIDER_TDL_PACKAGE_DIR is validated once at startup. If unset the binary exits non-zero before processing any request, which surfaces a deployment misconfiguration immediately rather than per-request.
  • Package resolution: ${SPIDER_TDL_PACKAGE_DIR}/<package>/lib<package>.so. The first request for a package dlopens the library; subsequent requests reuse the cached TdlPackage.
  • Tracing init: JSON, ANSI off, env-filtered, written to stderr so it doesn't pollute the framed-stdout protocol channel. Both tracing and tracing-subscriber are pulled in with default-features = false and only the features actually used (std for the macros; fmt, env-filter, json for the subscriber).

ExecutorError is now wire-friendly

ExecutorError derives serde::Serialize/Deserialize so the binary can ship it across the protocol as Failure { error: rmp_serde::to_vec(&err) } and the EM can decode it back to a typed value. Three variants used to wrap external types that don't implement Serialize (libloading::Error, std::str::Utf8Error, rmp_serde::decode::Error); they now carry the Display rendering of the source error as a String. Explicit From impls preserve the lossless ? propagation in manager.rs. The wildcard matches!(err, ExecutorError::InvalidLibrary(_)) pattern in the existing huntsman/tdl-integration tests still compiles unchanged.

Integration test crates

tests/huntsman/integration-test-tasks (TDL package)

A cdylib + rlib package (crate-type = ["cdylib", "rlib"]) registered under TDL package name integration_test_tasks. The dual crate-type lets the bench reference compile-time constants (notably INSTRUMENT_SLEEP_US) while keeping the cdylib for dlopen from the executor.

Tasks:

  • fibonacci(index: u64) -> u64 — naive recursion; correctness check.
  • always_fail() -> Result — returns TdlError::ExecutionError.
  • always_panic() -> ! — panics; the panic crosses the extern "C" FFI boundary and aborts the process, which is exactly the crash signal the parent test asserts on.
  • instrument(items: Vec<String>) -> Vec<String> — sleeps for a fixed INSTRUMENT_SLEEP_US (50µs) and echoes the payload back. Used by the overhead bench.

tests/huntsman/task-executor (executor integration tests)

A library crate that provides an ExecutorHandle harness (spawn the binary, frame requests on stdin, decode responses from stdout) and two [[test]] binaries.

The harness panics with descriptive .expect(...) messages on protocol / I/O / decode failures rather than threading errors through every helper — these are infrastructure, not production code, and a panic with backtrace points at the failure site immediately. (This pattern surfaced one subtle bug during development: a stale .so left over from a task rename produced TaskNotFound("instrument") in the panic message verbatim, which was the entire diagnosis.)

tests/executor.rs covers:

  • fibonacci_returns_correct_value — encodes a single u64 input; asserts Success and that the decoded u64 equals 55.
  • always_fail_reports_task_error — asserts Failure whose msgpack-decoded ExecutorError is TaskError(TdlError::ExecutionError(_)) and whose message contains the task name.
  • always_panic_crashes_the_process — sends Execute, expects stdout EOF before any frame arrives, then waits for the child to exit non-zero.

tests/overhead_instrument.rs runs the instrument task ten times against a long-lived executor (so dlopen happens once during a discarded warm-up) and writes a markdown table at ${SPIDER_TEST_INSTRUMENT_OUTPUT_DIR}/task_executor_overhead.md. With the work portion held constant at 50µs the table separates four metrics:

Metric What it measures
E2E (parent) Instant-to-Instant around send(Execute)recv(Response::Result)
Executor FFI elapsed_us reported by the executor (sleep + in-FFI input/output serde)
Executor internal (FFI - sleep) In-executor input/output serde alone
IPC overhead (E2E - FFI) Parent-side framing + bincode + pipe traversal

Sample output from a local run:

| Metric                          | Count | Avg (µs) | P50 (µs) | P95 (µs) | P99 (µs) |
| E2E (parent)                    | 10    | 476.00   | 479.64   | 602.06   | 602.06   |
| Executor FFI                    | 10    | 157.80   | 164.00   | 175.00   | 175.00   |
| Executor internal (FFI - sleep) | 10    | 107.80   | 114.00   | 125.00   | 125.00   |
| IPC overhead (E2E - FFI)        | 10    | 318.20   | 315.64   | 429.06   | 429.06   |

Taskfile changes (taskfiles/test.yaml)

spider-huntsman-unit-tests-executor now:

  1. Builds three artifacts in separate cargo build invocations (combining --package <cdylib> with --bin <name> would silently exclude the cdylibs from the target selection): huntsman-complex, integration-test-tasks, and spider-task-executor --bin spider-task-executor.
  2. Stages the cdylibs under build/tdl_packages/<package>/lib<package>.so — the standard layout the executor binary reads via ${SPIDER_TDL_PACKAGE_DIR}/<package>/lib<package>.so.
  3. Sets SPIDER_TDL_PACKAGE_DIR, SPIDER_TASK_EXECUTOR_BIN, and (relocates) SPIDER_TDL_PACKAGE_COMPLEX to point at the staged paths. Existing huntsman-complex consumers see no behavior change because their tests load the .so by absolute path.
  4. Invokes cargo nextest run --all --all-features --run-ignored all --release. The new tests are gated #[ignore] so plain cargo test (which doesn't go through the taskfile) can't accidentally try to run them without env vars set.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Ensure all workflows pass.
  • Add integration tests to test task executions that:
    • Return results.
    • Return errors.
    • Panic/crash.

Summary by CodeRabbit

  • New Features

    • Added a standalone task-executor subprocess with a length-delimited request/response protocol for running TDL package tasks and returning serialized outcomes and timing.
  • Tests

    • Added end-to-end integration tests covering success, failure, and crash scenarios.
    • Added performance instrumentation tests to measure round-trip latency and IPC overhead.
  • Chores

    • Expanded workspace and test harness to build and stage test packages and the executor binary.

Review Change Stack

@LinZhihao-723 LinZhihao-723 requested review from a team and sitaowang1998 as code owners May 14, 2026 23:44
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 88c771f9-e241-4960-8e8e-9b8124d35fe1

📥 Commits

Reviewing files that changed from the base of the PR and between e8253e9 and 5946d18.

📒 Files selected for processing (1)
  • tests/huntsman/task-executor/Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/huntsman/task-executor/Cargo.toml

Walkthrough

This PR adds a spider-task-executor subprocess: a new bincode-framed IPC protocol and serializable errors, a binary that loads and executes TDL packages via FFI, updates to package manager APIs and workspace/build config, plus integration tests and a benchmark harness with test task packages.

Changes

Spider Task Executor Subprocess

Layer / File(s) Summary
Protocol and error serialization foundation
components/spider-task-executor/src/protocol.rs, components/spider-task-executor/src/error.rs, components/spider-task-executor/src/lib.rs
Defines Request/Response/ExecutorOutcome enums for framed bincode communication and makes ExecutorError serializable by storing errors as String payloads with explicit From conversions instead of error-type wrappers.
Executor binary implementation
components/spider-task-executor/src/bin/spider_task_executor.rs
Implements spider-task-executor binary that boots a single-threaded Tokio runtime, reads length-delimited bincode Request frames from stdin, loads/executes TDL tasks with elapsed-time measurement, and writes Response frames to stdout with JSON tracing to stderr.
Package manager API update
components/spider-task-executor/src/manager.rs, tests/huntsman/tdl-integration/tests/complex.rs
Changes TdlPackageManager::load to return a reference to the loaded TdlPackage instead of a package name string, adds Debug derive to TdlPackage, and updates a test to use the new return shape.
Workspace and build configuration
Cargo.toml, components/spider-task-executor/Cargo.toml, taskfiles/test.yaml
Adds tests/huntsman/integration-test-tasks and tests/huntsman/task-executor to workspace members, declares spider-task-executor binary and runtime deps, and updates the test task to build and stage cdylib artifacts into a TDL package layout and expose env vars.
Integration test harness infrastructure
tests/huntsman/task-executor/Cargo.toml, tests/huntsman/task-executor/src/lib.rs
Provides ExecutorHandle to spawn the executor subprocess, manage framed stdin/stdout bincode Request/Response exchange, and helpers to build/encode task contexts, inputs, and decode outputs.
Integration test task definitions
tests/huntsman/integration-test-tasks/Cargo.toml, tests/huntsman/integration-test-tasks/src/lib.rs
Adds a test TDL package integration_test_tasks with fibonacci, always_fail, always_panic, and instrument tasks (with INSTRUMENT_SLEEP_US constant).
Integration test cases and benchmarking
tests/huntsman/task-executor/tests/test_executor.rs, tests/huntsman/task-executor/tests/overhead_instrument.rs
Adds end-to-end tests for correctness, error propagation, and panic crash semantics, plus an ignored overhead benchmark that collects E2E, FFI, internal, and IPC overhead metrics and writes a Markdown report.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant Executor
  participant TdlPackageManager
  participant TdlPackage
  Client->>Executor: Send Request::Execute (stdin, bincode + length-delim)
  Executor->>TdlPackageManager: get or load package by name
  TdlPackageManager->>TdlPackage: load shared lib from ${SPIDER_TDL_PACKAGE_DIR}
  TdlPackage-->>Executor: TdlPackage reference
  Executor->>Executor: call task FFI, measure elapsed_us
  Executor->>Client: Send Response::Result { outcome, elapsed_us } (stdout, bincode + length-delim)
  Client->>Executor: Send Request::Shutdown
  Executor->>Executor: exit loop and terminate
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • y-scope/spider#317: Both PRs touch components/spider-task-executor (notably src/error.rs and src/manager.rs) and extend the executor/protocol work introduced in that PR.

Suggested reviewers

  • sitaowang1998
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: adding an executor binary with bincode protocol and integration tests.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/huntsman/integration-test-tasks/src/lib.rs (1)

17-17: 💤 Low value

Consider the reliability of a 50-microsecond sleep for benchmarking.

The INSTRUMENT_SLEEP_US constant sets a 50-microsecond sleep, which is quite short. On Linux, sleep() with sub-millisecond durations may be subject to scheduler granularity and could have higher variance. Since this is used for overhead measurement in benchmark tests, ensure the results account for potential timing imprecision.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/huntsman/integration-test-tasks/src/lib.rs` at line 17, The 50µs
constant INSTRUMENT_SLEEP_US is too short and may suffer scheduler jitter; to
fix, increase it to a more reliable value (e.g., 1000 or 5000 µs) or make it
configurable so tests can select a stable duration (via an environment variable
or test flag) and update any places using INSTRUMENT_SLEEP_US to read the
configurable value; ensure the constant name and usages (INSTRUMENT_SLEEP_US)
are adjusted and document the change in the test notes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/huntsman/integration-test-tasks/src/lib.rs`:
- Line 17: The 50µs constant INSTRUMENT_SLEEP_US is too short and may suffer
scheduler jitter; to fix, increase it to a more reliable value (e.g., 1000 or
5000 µs) or make it configurable so tests can select a stable duration (via an
environment variable or test flag) and update any places using
INSTRUMENT_SLEEP_US to read the configurable value; ensure the constant name and
usages (INSTRUMENT_SLEEP_US) are adjusted and document the change in the test
notes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 8d75d100-9dd1-4402-8d2a-b879ab41f55d

📥 Commits

Reviewing files that changed from the base of the PR and between aadb9eb and 777cff5.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • Cargo.toml
  • components/spider-task-executor/Cargo.toml
  • components/spider-task-executor/src/bin/spider_task_executor.rs
  • components/spider-task-executor/src/error.rs
  • components/spider-task-executor/src/lib.rs
  • components/spider-task-executor/src/manager.rs
  • components/spider-task-executor/src/protocol.rs
  • taskfiles/test.yaml
  • tests/huntsman/integration-test-tasks/Cargo.toml
  • tests/huntsman/integration-test-tasks/src/lib.rs
  • tests/huntsman/task-executor/Cargo.toml
  • tests/huntsman/task-executor/src/lib.rs
  • tests/huntsman/task-executor/tests/executor.rs
  • tests/huntsman/task-executor/tests/overhead_instrument.rs
  • tests/huntsman/tdl-integration/tests/complex.rs

Comment on lines +35 to +51
/// Initializes tracing logging.
fn init_tracing() {
// Send tracing output to stderr so it doesn't pollute the framed-stdout protocol channel.
tracing_subscriber::fmt()
.event_format(
tracing_subscriber::fmt::format()
.with_level(true)
.with_target(false)
.with_file(true)
.with_line_number(true)
.json(),
)
.with_env_filter(tracing_subscriber::EnvFilter::from_default_env())
.with_ansi(false)
.with_writer(std::io::stderr)
.init();
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think many more components will need to initialize tracing later one. How about putting this into a util file that is shared among components?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same though. But I realized other components may require different log rotation strategies and output sources: for example, in this component logs are only written to stderr, but it might not be true for the EM service.

Comment thread taskfiles/test.yaml
Comment on lines +245 to +248
cp "{{.G_RUST_RELEASE_DIR}}/libhuntsman_complex.so" \
"{{.G_TDL_PACKAGES_DIR}}/complex/libcomplex.so"
cp "{{.G_RUST_RELEASE_DIR}}/libintegration_test_tasks.so" \
"{{.G_TDL_PACKAGES_DIR}}/integration_test_tasks/libintegration_test_tasks.so"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use soft link instead of coping so files around?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our preference is to copy things when possible. Soft links can be more confusing when we need to debug it...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the objective of the tests is to find out if anything could go wrong instead of avoiding them. We need to make sure that our library loading works with valid soft links.

/// The fixed-cost body lets the overhead bench subtract the known sleep from the executor's
/// reported FFI duration, isolating the in-executor input/output serde overhead.
#[task(name = "instrument")]
pub fn instrument(_ctx: TaskContext, items: Vec<String>) -> Result<Vec<String>, TdlError> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instrument is not a descriptive name. How about sleep_and_echo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants