docs(sdk): codify FFI panic-safety contract at the napi/PyO3 boundary (#32)#50
Merged
Merged
Conversation
…#32) Audit outcome for #32 (FFI panic-unwind safety, napi 2.16 / pyo3 0.23): the boundary is panic-safe today, but the invariant holds by *convention*, not structure, so it can silently regress. Documents the contract in each SDK's module doc where future FFI edits will see it. Key finding (verified against the cargo cache, not memory): napi 2.x does NOT wrap sync #[napi] bodies in catch_unwind by default — a panic there aborts the Node process on Rust >= 1.81, it is not a catchable JS error. Only #[napi] async fns (-> rejected Promise) and #[napi(catch_unwind)] are safe; ThreadsafeFunction callbacks, spawned tasks, Drop, and module init are not. PyO3 0.23 does catch #[pyfunction]/#[pymethods]/module-init bodies (-> PanicException) but not worker-thread with_gil bridges or spawned tasks. No code change: independent verification confirmed the only production panic site in each SDK is the lazy Tokio-runtime .expect(), reached from caught contexts; the genuinely-uncaught paths (spawned tasks, the Python with_gil callback bridges, no Drop impls exist) are panic-free by construction via ? / unwrap_or_else / fail-closed defaults. Converting get_runtime() to a Result would thread through ~75 call sites to defend an OS-exhaustion panic that is already caught — not worth it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up #32 (FFI panic-unwind safety audit, napi 2.16 / pyo3 0.23).
Audit outcome: the boundary is panic-safe — verified
A 4-lens audit (napi semantics, PyO3 semantics, per-site classification of both ~6–7K-line
lib.rsfiles) plus independent re-verification found zero production panic sites in an uncaught context:sdk/node/src/lib.rs: 90 panic-prone tokens, but 89 are in#[cfg(test)](line 5352+). The single production site is the lazy Tokio-runtime.expect()infallback_runtime()(line 86), reached from within#[napi]bodies.tokio::spawnbodies justawaitan innerResult(?-propagated);ThreadsafeFunctioncallbacks fail closed withunwrap_or_else; noimpl Drop.sdk/python/src/lib.rs: 42 tokens, 40 in#[cfg(test)](6567+). The single production site is the runtime.expect()inget_runtime()(line 109), caught by PyO3's pyfunction trampoline. The worker-threadwith_gilbridges (PythonCallbackHandler,PyBudgetGuard,PySlashCommand— the only truly-uncaught context) use.ok()/unwrap_or_else/fail-closed; notokio::spawn/thread::spawn/Drop.Why docs, not a refactor
The key finding (verified against the cargo cache, not memory): napi 2.x does not wrap sync
#[napi]bodies incatch_unwindby default — a panic there aborts the Node process on Rust ≥ 1.81, it is not a catchable JS error. Only#[napi] async fn(→ rejected Promise) and#[napi(catch_unwind)]are safe. So the boundary's safety rests on a convention (never panic in uncaught contexts), not structure — and a future stray.unwrap()in a TSFN callback or a newDropimpl would silently reintroduce a process-abort.This PR codifies that contract in each SDK's module doc, where FFI edits will see it. No code change: the only two production panic sites are in framework-caught contexts and fire only on OS thread exhaustion; converting
get_runtime()to aResultwould thread through ~75 call sites to defend an already-caught failure — out of proportion (Rule 0).Doc-only; both SDK crates
cargo fmt --checkclean.