feat(spider-task-executor): Add executor binary with bincode wire protocol and integration tests.#325
feat(spider-task-executor): Add executor binary with bincode wire protocol and integration tests.#325LinZhihao-723 wants to merge 5 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThis PR adds a spider-task-executor subprocess: a new bincode-framed IPC protocol and serializable errors, a binary that loads and executes TDL packages via FFI, updates to package manager APIs and workspace/build config, plus integration tests and a benchmark harness with test task packages. ChangesSpider Task Executor Subprocess
Sequence Diagram(s)sequenceDiagram
participant Client
participant Executor
participant TdlPackageManager
participant TdlPackage
Client->>Executor: Send Request::Execute (stdin, bincode + length-delim)
Executor->>TdlPackageManager: get or load package by name
TdlPackageManager->>TdlPackage: load shared lib from ${SPIDER_TDL_PACKAGE_DIR}
TdlPackage-->>Executor: TdlPackage reference
Executor->>Executor: call task FFI, measure elapsed_us
Executor->>Client: Send Response::Result { outcome, elapsed_us } (stdout, bincode + length-delim)
Client->>Executor: Send Request::Shutdown
Executor->>Executor: exit loop and terminate
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/huntsman/integration-test-tasks/src/lib.rs (1)
17-17: 💤 Low valueConsider the reliability of a 50-microsecond sleep for benchmarking.
The
INSTRUMENT_SLEEP_USconstant sets a 50-microsecond sleep, which is quite short. On Linux,sleep()with sub-millisecond durations may be subject to scheduler granularity and could have higher variance. Since this is used for overhead measurement in benchmark tests, ensure the results account for potential timing imprecision.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/huntsman/integration-test-tasks/src/lib.rs` at line 17, The 50µs constant INSTRUMENT_SLEEP_US is too short and may suffer scheduler jitter; to fix, increase it to a more reliable value (e.g., 1000 or 5000 µs) or make it configurable so tests can select a stable duration (via an environment variable or test flag) and update any places using INSTRUMENT_SLEEP_US to read the configurable value; ensure the constant name and usages (INSTRUMENT_SLEEP_US) are adjusted and document the change in the test notes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/huntsman/integration-test-tasks/src/lib.rs`:
- Line 17: The 50µs constant INSTRUMENT_SLEEP_US is too short and may suffer
scheduler jitter; to fix, increase it to a more reliable value (e.g., 1000 or
5000 µs) or make it configurable so tests can select a stable duration (via an
environment variable or test flag) and update any places using
INSTRUMENT_SLEEP_US to read the configurable value; ensure the constant name and
usages (INSTRUMENT_SLEEP_US) are adjusted and document the change in the test
notes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 8d75d100-9dd1-4402-8d2a-b879ab41f55d
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (15)
Cargo.tomlcomponents/spider-task-executor/Cargo.tomlcomponents/spider-task-executor/src/bin/spider_task_executor.rscomponents/spider-task-executor/src/error.rscomponents/spider-task-executor/src/lib.rscomponents/spider-task-executor/src/manager.rscomponents/spider-task-executor/src/protocol.rstaskfiles/test.yamltests/huntsman/integration-test-tasks/Cargo.tomltests/huntsman/integration-test-tasks/src/lib.rstests/huntsman/task-executor/Cargo.tomltests/huntsman/task-executor/src/lib.rstests/huntsman/task-executor/tests/executor.rstests/huntsman/task-executor/tests/overhead_instrument.rstests/huntsman/tdl-integration/tests/complex.rs
| /// Initializes tracing logging. | ||
| fn init_tracing() { | ||
| // Send tracing output to stderr so it doesn't pollute the framed-stdout protocol channel. | ||
| tracing_subscriber::fmt() | ||
| .event_format( | ||
| tracing_subscriber::fmt::format() | ||
| .with_level(true) | ||
| .with_target(false) | ||
| .with_file(true) | ||
| .with_line_number(true) | ||
| .json(), | ||
| ) | ||
| .with_env_filter(tracing_subscriber::EnvFilter::from_default_env()) | ||
| .with_ansi(false) | ||
| .with_writer(std::io::stderr) | ||
| .init(); | ||
| } |
There was a problem hiding this comment.
I think many more components will need to initialize tracing later one. How about putting this into a util file that is shared among components?
There was a problem hiding this comment.
I was thinking the same though. But I realized other components may require different log rotation strategies and output sources: for example, in this component logs are only written to stderr, but it might not be true for the EM service.
| cp "{{.G_RUST_RELEASE_DIR}}/libhuntsman_complex.so" \ | ||
| "{{.G_TDL_PACKAGES_DIR}}/complex/libcomplex.so" | ||
| cp "{{.G_RUST_RELEASE_DIR}}/libintegration_test_tasks.so" \ | ||
| "{{.G_TDL_PACKAGES_DIR}}/integration_test_tasks/libintegration_test_tasks.so" |
There was a problem hiding this comment.
Maybe use soft link instead of coping so files around?
There was a problem hiding this comment.
I think our preference is to copy things when possible. Soft links can be more confusing when we need to debug it...
There was a problem hiding this comment.
But the objective of the tests is to find out if anything could go wrong instead of avoiding them. We need to make sure that our library loading works with valid soft links.
| /// The fixed-cost body lets the overhead bench subtract the known sleep from the executor's | ||
| /// reported FFI duration, isolating the in-executor input/output serde overhead. | ||
| #[task(name = "instrument")] | ||
| pub fn instrument(_ctx: TaskContext, items: Vec<String>) -> Result<Vec<String>, TdlError> { |
There was a problem hiding this comment.
instrument is not a descriptive name. How about sleep_and_echo?
Description
Wire protocol (
spider-task-executor/src/protocol.rs)A new
protocolmodule on thespider-task-executorlibrary defines the three wire types the execution manager will use to drive the child:Request::Execute { tdl_context, raw_ctx, raw_inputs }andRequest::Shutdown. The parent owns the hard timeout in its entirety; the executor has no notion of timeouts and the request carries no deadline.Response::Result { outcome, elapsed_us }— exactly one perExecute.elapsed_usis the in-FFI wall-clock measured by the executor and is what the overhead instrument uses to separate executor-side cost from parent-side IPC.ExecutorOutcome::Success { outputs } | Failure { error }—outputsis the wire-formatTaskOutputsSerializerbuffer ready to forward to storage;erroris the msgpack-encodedExecutorError.Stderr is not carried over the protocol; how the spawner disposes of the executor's stderr (inherit / pipe / log file) is a parent-side decision.
Executor binary (
spider-task-executor/src/bin/spider_task_executor.rs)Single-threaded tokio runtime; requests are processed strictly sequentially with exactly one task running for the lifetime of the process. Tokio is here only to match the async I/O surface the execution manager uses (
tokio_util::codec::LengthDelimitedCodec); the executor itself has no concurrency requirements.Key shape decisions:
oneshot, notokio::select!. The previous design dispatched the FFI on astd::threadso the runtime couldselect!an in-process timer against the FFI; with the timer responsibility consolidated on the parent, none of that scaffolding earns its keep.SPIDER_TDL_PACKAGE_DIRis validated once at startup. If unset the binary exits non-zero before processing any request, which surfaces a deployment misconfiguration immediately rather than per-request.${SPIDER_TDL_PACKAGE_DIR}/<package>/lib<package>.so. The first request for a package dlopens the library; subsequent requests reuse the cachedTdlPackage.tracingandtracing-subscriberare pulled in withdefault-features = falseand only the features actually used (stdfor the macros;fmt,env-filter,jsonfor the subscriber).ExecutorErroris now wire-friendlyExecutorErrorderivesserde::Serialize/Deserializeso the binary can ship it across the protocol asFailure { error: rmp_serde::to_vec(&err) }and the EM can decode it back to a typed value. Three variants used to wrap external types that don't implementSerialize(libloading::Error,std::str::Utf8Error,rmp_serde::decode::Error); they now carry theDisplayrendering of the source error as aString. ExplicitFromimpls preserve the lossless?propagation inmanager.rs. The wildcardmatches!(err, ExecutorError::InvalidLibrary(_))pattern in the existinghuntsman/tdl-integrationtests still compiles unchanged.Integration test crates
tests/huntsman/integration-test-tasks(TDL package)A cdylib + rlib package (
crate-type = ["cdylib", "rlib"]) registered under TDL package nameintegration_test_tasks. The dual crate-type lets the bench reference compile-time constants (notablyINSTRUMENT_SLEEP_US) while keeping the cdylib fordlopenfrom the executor.Tasks:
fibonacci(index: u64) -> u64— naive recursion; correctness check.always_fail() -> Result— returnsTdlError::ExecutionError.always_panic() -> !— panics; the panic crosses theextern "C"FFI boundary and aborts the process, which is exactly the crash signal the parent test asserts on.instrument(items: Vec<String>) -> Vec<String>— sleeps for a fixedINSTRUMENT_SLEEP_US(50µs) and echoes the payload back. Used by the overhead bench.tests/huntsman/task-executor(executor integration tests)A library crate that provides an
ExecutorHandleharness (spawn the binary, frame requests on stdin, decode responses from stdout) and two[[test]]binaries.The harness panics with descriptive
.expect(...)messages on protocol / I/O / decode failures rather than threading errors through every helper — these are infrastructure, not production code, and a panic with backtrace points at the failure site immediately. (This pattern surfaced one subtle bug during development: a stale.soleft over from a task rename producedTaskNotFound("instrument")in the panic message verbatim, which was the entire diagnosis.)tests/executor.rscovers:fibonacci_returns_correct_value— encodes a singleu64input; assertsSuccessand that the decodedu64equals 55.always_fail_reports_task_error— assertsFailurewhose msgpack-decodedExecutorErrorisTaskError(TdlError::ExecutionError(_))and whose message contains the task name.always_panic_crashes_the_process— sendsExecute, expects stdout EOF before any frame arrives, then waits for the child to exit non-zero.tests/overhead_instrument.rsruns theinstrumenttask ten times against a long-lived executor (so dlopen happens once during a discarded warm-up) and writes a markdown table at${SPIDER_TEST_INSTRUMENT_OUTPUT_DIR}/task_executor_overhead.md. With the work portion held constant at 50µs the table separates four metrics:E2E (parent)Instant-to-Instantaroundsend(Execute)→recv(Response::Result)Executor FFIelapsed_usreported by the executor (sleep + in-FFI input/output serde)Executor internal (FFI - sleep)IPC overhead (E2E - FFI)Sample output from a local run:
Taskfile changes (
taskfiles/test.yaml)spider-huntsman-unit-tests-executornow:cargo buildinvocations (combining--package <cdylib>with--bin <name>would silently exclude the cdylibs from the target selection):huntsman-complex,integration-test-tasks, andspider-task-executor --bin spider-task-executor.build/tdl_packages/<package>/lib<package>.so— the standard layout the executor binary reads via${SPIDER_TDL_PACKAGE_DIR}/<package>/lib<package>.so.SPIDER_TDL_PACKAGE_DIR,SPIDER_TASK_EXECUTOR_BIN, and (relocates)SPIDER_TDL_PACKAGE_COMPLEXto point at the staged paths. Existinghuntsman-complexconsumers see no behavior change because their tests load the.soby absolute path.cargo nextest run --all --all-features --run-ignored all --release. The new tests are gated#[ignore]so plaincargo test(which doesn't go through the taskfile) can't accidentally try to run them without env vars set.Checklist
breaking change.
Validation performed
Summary by CodeRabbit
New Features
Tests
Chores