[Hackathon] Redeem_Grimm: failure-detector layer with a phi-accrual liveness oracle#46
Open
Redeem-Grimm-Satoshi wants to merge 3 commits into
Conversation
…iveness oracle Add failure detection as a new building block alongside the existing twelve layers. The framework currently models failure only by injection (message drop, Byzantine fraction, partitions) and checks that protocols stay correct in spite of it; it has no primitive by which an agent decides that a peer has stopped responding. This adds that primitive in the project's own shape: a FailureDetector Protocol, two reference plugins (an adaptive phi-accrual detector and a fixed-timeout baseline), a scenario that exercises them, and three property validators (completeness, accuracy, recovery). The accuracy validator is the discriminator: phi-accrual passes it on seeds 42, 7, and 1337, while the fixed timeout set just above the mean heartbeat interval fails it by false-suspecting a provably live peer on the upper tail of normal jitter. Both detectors still pass completeness, so accuracy is the property that separates them. Runs are byte-for-byte deterministic per seed: the only randomness is each emitter's seeded RNG drawing its jittered heartbeat interval, and all emitted floats are rounded to six places. The single change to an existing test extends the validator-registry completeness assertion to include the new scenario type.
Bring the failure-detector branch up to date with the latest upstream main (hackathon PRs projnanda#28, projnanda#30, projnanda#31, projnanda#33). The only conflicts were additive edits to the shared registration surfaces that every new layer touches; resolved by keeping both the upstream additions and the failure_detector entries: - plugins.py _BUILTINS map and the entry-point discovery list - scenarios.py scenario-name dispatch - validators.py validator functions and the VALIDATORS registry - pyproject.toml nest.plugins.* entry points - tests/test_validators registry-completeness assertion No existing behavior changed. make ci-local passes: 681 passed, 1 skipped, 1 deselected.
…tor branch Additive conflicts in the shared registration surfaces (scenario dispatch, validator functions and the VALIDATORS registry, and the registry-completeness test) resolved by keeping both the escrow_marketplace additions and the failure_detection entries. make ci-local passes: 751 passed, 1 skipped, 1 deselected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
NANDA Town models failure by injection.
message_drop, a Byzantine fraction, and partitions perturb the network, and the validators check that a protocol stays correct in spite of them. The stack has no vocabulary for the dual question: when should an agent conclude that a peer has stopped responding, and how does it avoid crying wolf on a peer that is merely slow? That decision, the failure detector, is one of the oldest primitives in distributed systems. Chandra and Toueg formalize it in terms of two properties, completeness (every real crash is eventually suspected) and accuracy (a live process is not wrongly suspected), and it is the one classical building block that the twelve-layer stack does not have.The absence is not academic. Several existing scenarios already presuppose that a participant can disappear.
supply_chainstresses multi-hop reliability under drop and partition,consensusneeds a quorum to make progress when members are missing, andcontract_netcoordination has to decide what happens when an assignee goes away mid-task. Today the framework can inject that disappearance but cannot let an agent observe it. This PR adds failure detection as a new building block, in the same shape as the existing twelve: aProtocolinterface, two reference plugins, a scenario that exercises them, and property validators that separate a correct implementation from a plausible but wrong one.Persona
I work on liveness and membership, so the reference I reached for is the one production systems actually run. Cassandra, Akka, and Hazelcast all ship a phi-accrual detector rather than a fixed timeout, for a concrete reason. A single timeout forces you to trade detection latency against false-positive rate at deploy time, and you lose that bet the moment the network gets jittery: set it tight and you suspect healthy-but-slow peers, set it loose and you are blind to real crashes for too long. An accrual detector defers the decision to the consumer by emitting a continuous suspicion level instead of a boolean, and it adapts that level to the observed inter-arrival distribution. I implemented the real algorithm and, deliberately, a naive fixed-timeout foil, so the layer ships with a built-in demonstration of why the adaptive version earns its keep.
What is in the diff
packages/nest-core/nest_core/layers/failure_detector.pyFailureDetectorProtocol:heartbeat,suspect,phi,report,known_peers. Every method takesnowas a keyword argument instead of reading a clock, so detection is a pure function of observed history and replays deterministically..../failure_detection/phi_accrual.pyphi = -log10(P(next heartbeat still pending))from the Gaussian upper tail. It uses population variance, a configurable standard-deviation floor, and a tail-probability floor to keepphifinite, and rounds every emitted float to six places..../failure_detection/heartbeat.pytimeout. Cheap, needs no warm-up, and structurally unable to tell a slow peer from a dead one.packages/nest-core/nest_core/scenarios_builtin/failure_detection.pyfd:phasemarkers; one emitter goes silent for a bounded window and then heals, another is never silent as an always-live control.scenarios/failure_detection.yamlpackages/nest-core/nest_core/validators.pyfailure_detection: completeness, accuracy, and recovery, described below.packages/nest-plugins-reference/tests/test_failure_detection.pyThe remaining changes are registration only: exporting the layer and the
Suspiciontype, two plugin-registry entries plus the discovery list, scenario registration, and the entry points inpyproject.toml. The one edit to an existing test is a single line intests/test_validators.pythat addsfailure_detectionto the expected validator-registry set, which is the registry-completeness assertion, not a change to existing behavior.Why this is a useful building block
It fills a named gap. Failure detection is the canonical membership primitive that the stack does not yet have and that the existing scenarios silently need. Reliability under drop and partition, quorum progress, and task reassignment all assume a participant can vanish; the framework could inject that, but until now nothing could react to it.
It is idiomatic. A structural
Protocol, reference plugins discovered through entry points, an additive built-in scenario, property validators, and a deterministic trace are exactly how the other twelve layers are built. Nothing here changes existing behavior; the layer slots in beside the others.It is a test, not a demo. The accuracy invariant discriminates. The phi-accrual detector passes it on every seed. The fixed timeout, set just above the mean heartbeat interval, fails it by false-suspecting a provably live peer on the upper tail of normal jitter. On the seed-42 trace the baseline raises nine false suspicions of the always-live peer after warm-up while the accrual detector raises none. Both detectors still pass completeness, because the genuine outage is large enough that even a naive detector catches it. Accuracy is the property that separates a careful detector from a careless one, which is the kind of check this project is built to reward.
Honest scope. The detector is a self-contained primitive: it observes and reports, and no other layer consumes its verdict yet. I kept this PR to the primitive on purpose. The highest-value follow-up is to let
contract_netreassign a task when the assignee is suspected and assert that the task still completes when an agent crashes mid-job, which turns an accurate oracle into a measurable robustness gain; I am happy to send that as a second, additive scenario if you want it. The current scenario also holds message drop at zero for byte-level determinism and does not exercise partitions, where a live peer can look dead; a lossy, heartbeat-redundant variant is the natural next step.Verification
make ci-localruns the same five gates as CI and passes:The layer's own suite:
Running the scenario and validating the trace:
The phi-accrual detector passes completeness, accuracy, and recovery across seeds 42, 7, and 1337. The fixed-timeout baseline at
timeout=16passes completeness and recovery but fails accuracy, asserted directly intest_scenario_baseline_fails_accuracy_but_accrual_passes. Two runs at the same seed produce byte-identical traces, asserted intest_scenario_is_byte_for_byte_deterministic.Reviewer notes worth surfacing
uniform(10, 20), so mean 15 and standard deviation near 2.9. The probability that a single interval exceeds 16 is roughly 0.4, which is why a timeout near the mean false-suspects on the tail. The accrual detector learns the distribution and scores a 20-unit gap at aboutphi1.4, well under the threshold of 8, so it stays quiet through jitter while still drivingphipast the threshold within about one expected interval once a real crash begins.message_dropat zero and no Byzantine agents the failure-injection RNG is never consumed, so traces are byte-stable per seed. All emitted floats are rounded to six places.min_std) keeps a near-constant heartbeat stream from producing a divide-by-zero or an absurdly sharp distribution. The upper-tail probability is floored at1e-18, sophistays finite and bounded near 18.type[FailureDetector]annotation catches any signature drift at import.fd:phasemarkers at start and on every reachability transition, so the validators check verdicts against what actually happened rather than against a second estimate.