Skip to content

[Hackathon] Redeem_Grimm: failure-detector layer with a phi-accrual liveness oracle#46

Open
Redeem-Grimm-Satoshi wants to merge 3 commits into
projnanda:mainfrom
Redeem-Grimm-Satoshi:hackathon/redeem-grimm-failure-detector
Open

[Hackathon] Redeem_Grimm: failure-detector layer with a phi-accrual liveness oracle#46
Redeem-Grimm-Satoshi wants to merge 3 commits into
projnanda:mainfrom
Redeem-Grimm-Satoshi:hackathon/redeem-grimm-failure-detector

Conversation

@Redeem-Grimm-Satoshi

Copy link
Copy Markdown

Problem

NANDA Town models failure by injection. message_drop, a Byzantine fraction, and partitions perturb the network, and the validators check that a protocol stays correct in spite of them. The stack has no vocabulary for the dual question: when should an agent conclude that a peer has stopped responding, and how does it avoid crying wolf on a peer that is merely slow? That decision, the failure detector, is one of the oldest primitives in distributed systems. Chandra and Toueg formalize it in terms of two properties, completeness (every real crash is eventually suspected) and accuracy (a live process is not wrongly suspected), and it is the one classical building block that the twelve-layer stack does not have.

The absence is not academic. Several existing scenarios already presuppose that a participant can disappear. supply_chain stresses multi-hop reliability under drop and partition, consensus needs a quorum to make progress when members are missing, and contract_net coordination has to decide what happens when an assignee goes away mid-task. Today the framework can inject that disappearance but cannot let an agent observe it. This PR adds failure detection as a new building block, in the same shape as the existing twelve: a Protocol interface, two reference plugins, a scenario that exercises them, and property validators that separate a correct implementation from a plausible but wrong one.

Persona

I work on liveness and membership, so the reference I reached for is the one production systems actually run. Cassandra, Akka, and Hazelcast all ship a phi-accrual detector rather than a fixed timeout, for a concrete reason. A single timeout forces you to trade detection latency against false-positive rate at deploy time, and you lose that bet the moment the network gets jittery: set it tight and you suspect healthy-but-slow peers, set it loose and you are blind to real crashes for too long. An accrual detector defers the decision to the consumer by emitting a continuous suspicion level instead of a boolean, and it adapts that level to the observed inter-arrival distribution. I implemented the real algorithm and, deliberately, a naive fixed-timeout foil, so the layer ships with a built-in demonstration of why the adaptive version earns its keep.

What is in the diff

Path Purpose
packages/nest-core/nest_core/layers/failure_detector.py The layer contract. A runtime-checkable FailureDetector Protocol: heartbeat, suspect, phi, report, known_peers. Every method takes now as a keyword argument instead of reading a clock, so detection is a pure function of observed history and replays deterministically.
.../failure_detection/phi_accrual.py The adaptive reference detector. Each peer gets a sliding window of heartbeat inter-arrival samples modeled as a normal distribution; suspicion is phi = -log10(P(next heartbeat still pending)) from the Gaussian upper tail. It uses population variance, a configurable standard-deviation floor, and a tail-probability floor to keep phi finite, and rounds every emitted float to six places.
.../failure_detection/heartbeat.py The fixed-timeout baseline, included as the foil. Suspects once silence exceeds timeout. Cheap, needs no warm-up, and structurally unable to tell a slow peer from a dead one.
packages/nest-core/nest_core/scenarios_builtin/failure_detection.py The agents and factory. An observer drives one injected detector and publishes a per-peer verdict on each evaluation tick; emitters heartbeat on a jittered interval and broadcast ground-truth fd:phase markers; one emitter goes silent for a bounded window and then heals, another is never silent as an always-live control.
scenarios/failure_detection.yaml The runnable scenario: three agents, the twelve standard layers at their defaults, zero message drop, phi-accrual selected.
packages/nest-core/nest_core/validators.py Three invariants registered under failure_detection: completeness, accuracy, and recovery, described below.
packages/nest-plugins-reference/tests/test_failure_detection.py Detector unit tests, a three-seed end-to-end pass, the adversarial discriminator, and a byte-determinism check.

The remaining changes are registration only: exporting the layer and the Suspicion type, two plugin-registry entries plus the discovery list, scenario registration, and the entry points in pyproject.toml. The one edit to an existing test is a single line in tests/test_validators.py that adds failure_detection to the expected validator-registry set, which is the registry-completeness assertion, not a change to existing behavior.

Why this is a useful building block

It fills a named gap. Failure detection is the canonical membership primitive that the stack does not yet have and that the existing scenarios silently need. Reliability under drop and partition, quorum progress, and task reassignment all assume a participant can vanish; the framework could inject that, but until now nothing could react to it.

It is idiomatic. A structural Protocol, reference plugins discovered through entry points, an additive built-in scenario, property validators, and a deterministic trace are exactly how the other twelve layers are built. Nothing here changes existing behavior; the layer slots in beside the others.

It is a test, not a demo. The accuracy invariant discriminates. The phi-accrual detector passes it on every seed. The fixed timeout, set just above the mean heartbeat interval, fails it by false-suspecting a provably live peer on the upper tail of normal jitter. On the seed-42 trace the baseline raises nine false suspicions of the always-live peer after warm-up while the accrual detector raises none. Both detectors still pass completeness, because the genuine outage is large enough that even a naive detector catches it. Accuracy is the property that separates a careful detector from a careless one, which is the kind of check this project is built to reward.

Honest scope. The detector is a self-contained primitive: it observes and reports, and no other layer consumes its verdict yet. I kept this PR to the primitive on purpose. The highest-value follow-up is to let contract_net reassign a task when the assignee is suspected and assert that the task still completes when an agent crashes mid-job, which turns an accurate oracle into a measurable robustness gain; I am happy to send that as a second, additive scenario if you want it. The current scenario also holds message drop at zero for byte-level determinism and does not exercise partitions, where a live peer can look dead; a lossy, heartbeat-redundant variant is the natural next step.

Verification

make ci-local runs the same five gates as CI and passes:

ruff check:          All checks passed!
ruff format --check: 169 files already formatted
pyright:             0 errors, 0 warnings, 0 informations
pytest:              751 passed, 1 skipped, 1 deselected

The layer's own suite:

uv run pytest packages/nest-plugins-reference/tests/test_failure_detection.py -v
15 passed

Running the scenario and validating the trace:

uv run nest run scenarios/failure_detection.yaml

PASS failure_detection_completeness  all 1 outage segment(s) detected and still suspected at end
PASS failure_detection_accuracy      no false suspicion of a provably-live peer across 2 peer(s)
PASS failure_detection_recovery      all 1 recovered peer(s) cleared by end

The phi-accrual detector passes completeness, accuracy, and recovery across seeds 42, 7, and 1337. The fixed-timeout baseline at timeout=16 passes completeness and recovery but fails accuracy, asserted directly in test_scenario_baseline_fails_accuracy_but_accrual_passes. Two runs at the same seed produce byte-identical traces, asserted in test_scenario_is_byte_for_byte_deterministic.

Reviewer notes worth surfacing

  • Why phi-accrual rather than a tuned timeout. The jitter is uniform(10, 20), so mean 15 and standard deviation near 2.9. The probability that a single interval exceeds 16 is roughly 0.4, which is why a timeout near the mean false-suspects on the tail. The accrual detector learns the distribution and scores a 20-unit gap at about phi 1.4, well under the threshold of 8, so it stays quiet through jitter while still driving phi past the threshold within about one expected interval once a real crash begins.
  • Determinism. The only randomness in the run is each emitter's seeded RNG drawing its next interval; there are no wall-clock reads. With message_drop at zero and no Byzantine agents the failure-injection RNG is never consumed, so traces are byte-stable per seed. All emitted floats are rounded to six places.
  • Numerical guards. Population variance with a standard-deviation floor (min_std) keeps a near-constant heartbeat stream from producing a divide-by-zero or an absurdly sharp distribution. The upper-tail probability is floored at 1e-18, so phi stays finite and bounded near 18.
  • Structural conformance. Both plugins are duck-typed against the Protocol rather than subclassing it, matching every other reference plugin. An import-time assignment to a type[FailureDetector] annotation catches any signature drift at import.
  • Ground truth is recorded, not inferred. The emitters broadcast fd:phase markers at start and on every reachability transition, so the validators check verdicts against what actually happened rather than against a second estimate.

…iveness oracle

Add failure detection as a new building block alongside the existing twelve
layers. The framework currently models failure only by injection (message
drop, Byzantine fraction, partitions) and checks that protocols stay correct
in spite of it; it has no primitive by which an agent decides that a peer has
stopped responding. This adds that primitive in the project's own shape: a
FailureDetector Protocol, two reference plugins (an adaptive phi-accrual
detector and a fixed-timeout baseline), a scenario that exercises them, and
three property validators (completeness, accuracy, recovery).

The accuracy validator is the discriminator: phi-accrual passes it on seeds
42, 7, and 1337, while the fixed timeout set just above the mean heartbeat
interval fails it by false-suspecting a provably live peer on the upper tail
of normal jitter. Both detectors still pass completeness, so accuracy is the
property that separates them.

Runs are byte-for-byte deterministic per seed: the only randomness is each
emitter's seeded RNG drawing its jittered heartbeat interval, and all emitted
floats are rounded to six places. The single change to an existing test extends
the validator-registry completeness assertion to include the new scenario type.
Bring the failure-detector branch up to date with the latest upstream main
(hackathon PRs projnanda#28, projnanda#30, projnanda#31, projnanda#33). The only conflicts were additive edits to
the shared registration surfaces that every new layer touches; resolved by
keeping both the upstream additions and the failure_detector entries:

  - plugins.py            _BUILTINS map and the entry-point discovery list
  - scenarios.py          scenario-name dispatch
  - validators.py         validator functions and the VALIDATORS registry
  - pyproject.toml        nest.plugins.* entry points
  - tests/test_validators registry-completeness assertion

No existing behavior changed. make ci-local passes: 681 passed, 1 skipped,
1 deselected.
…tor branch

Additive conflicts in the shared registration surfaces (scenario dispatch,
validator functions and the VALIDATORS registry, and the registry-completeness
test) resolved by keeping both the escrow_marketplace additions and the
failure_detection entries. make ci-local passes: 751 passed, 1 skipped,
1 deselected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant