Skip to content

Feat mev with duties prefetching (T-ssv-057)#2700

Draft
julienh-ssv wants to merge 45 commits into
stagefrom
feat/mev
Draft

Feat mev with duties prefetching (T-ssv-057)#2700
julienh-ssv wants to merge 45 commits into
stagefrom
feat/mev

Conversation

@julienh-ssv
Copy link
Copy Markdown
Contributor

@julienh-ssv julienh-ssv commented Feb 24, 2026

Recently we observed issues where enabling MEV boost on a beacon node would cause our validators to miss/fail proposer duties. This PR attempts to address this issue.

Summary

SSV Nodes are run by operators on a distributed-validator network. For each validator, the validator key is split into key shares across multiple operators. For each duty, the operator set first reaches consensus on what should be signed, and then a threshold of operators produces partial signatures. Because Ethereum proposal slots are only 12 seconds long, extra latency in the proposal path can cause late or missed proposals.

When the beacon node is configured to use the Builder API, the builder path adds extra latency before the SSV Node receives the proposal data to sign. In our current setup, that latency can overlap with SSV pre-consensus and signing deadlines, which increases the risk of late proposals.

This PR adds an in-node Builder API component ("SSV-MEV") inside the SSV Node. SSV-MEV prefetches builder bids ahead of the critical path and caches the best bid seen so far. Later, when the beacon node asks for a builder header, SSV-MEV can respond immediately with the best prefetched bid instead of waiting on external requests at proposal time.

This PR does not change the beacon node's local-vs-builder selection logic. The beacon node still decides whether to use the local execution payload or the builder path. This PR only reduces latency on the builder side.

The diagrams below focus only on the latency-sensitive getHeader stage. Validator registration and post-sign publication still follow the standard Builder API / beacon-node flow.

Current flow

sequenceDiagram
    participant SSV as SSV Node
    participant Ops as SSV operator set
    participant BN as Beacon node
    participant Builder as Current Builder API endpoint
    participant EL as Execution client

    SSV->>Ops: Pre-consensus / proposal preparation
    SSV->>BN: GET /eth/v3/validator/blocks/{slot}
    BN->>Builder: GET /eth/v1/builder/header/{slot}/{parent_hash}/{pubkey}
    Builder-->>BN: Signed builder bid/header
    BN->>EL: Build local execution payload
    BN-->>SSV: Return blinded or unblinded block
    SSV->>Ops: Consensus + threshold signing
Loading

In the current flow, the builder-header fetch is on the critical proposer path.

New flow

sequenceDiagram
    participant SSV as SSV Node
    participant Ops as SSV operator set
    participant BN as Beacon node
    participant SSVMEV as SSV-MEV (in-node Builder API proxy/cache)
    participant Relays as External relays
    participant EL as Execution client

    SSV->>Ops: Pre-consensus / proposal preparation
    SSV->>SSVMEV: Start prefetch once slot/parent_hash/pubkey are known
    SSVMEV->>Relays: Query relays early
    Relays-->>SSVMEV: Signed builder bids
    SSV->>BN: GET /eth/v3/validator/blocks/{slot}
    BN->>SSVMEV: GET /eth/v1/builder/header/{slot}/{parent_hash}/{pubkey}
    SSVMEV-->>BN: Best cached bid/header (if available)
    BN->>EL: Build local execution payload
    BN-->>SSV: Return blinded or unblinded block
    SSV->>Ops: Consensus + threshold signing
Loading

This moves external builder communication earlier, so the beacon node can be served from local cache instead of waiting on external requests during the most time-sensitive part of the proposer path.

Why this helps

The key idea is:

  • external relay communication is slow and variable
  • SSV coordination already consumes part of the slot
  • by prefetching bids early and serving them locally, we remove one external round-trip from the beacon node’s critical path

This should reduce proposal latency and lower the chance of missing proposer duties due to late builder responses.

Relation to Vouch

This follows the same general pattern as Vouch: the beacon node talks to a local Builder API endpoint, and that endpoint talks to external relays. The difference is that we are moving that functionality into the SSV Node and adding bid prefetching.

Current rollout

At the moment, this PR runs both paths in parallel. The existing builder path remains authoritative, while SSV-MEV fetches the same bids ahead of time inside the SSV Node.

The in-node path is currently dry-run only. It is used for latency and bid comparison, but it does not yet replace the existing proposal path.

Stage results and statistics


Known limitation

This feature requires the beacon node to point its Builder API configuration at our SSV-MEV proxy/cache.

That does not compose well when a single beacon node is shared by multiple SSV clusters. Each cluster-local SSV-MEV instance only has visibility into its own validators, so no single in-node instance can serve builder bids for every validator known to the beacon node.

A separate external prefetch/cache service would be more general. Multiple clusters could register with the same service, and the beacon node could use that single service as its Builder API endpoint for all validators.


Linked: T-ssv-057

@julienh-ssv julienh-ssv self-assigned this Feb 24, 2026
@julienh-ssv julienh-ssv changed the base branch from main to stage February 24, 2026 07:35
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 24, 2026

@julienh-ssv julienh-ssv changed the title Feat mev Feat mev with duties prefetching Feb 25, 2026
Run shadow in-node get_header in parallel to legacy GetBeaconBlock, record comparisons via logs/metrics and API, and add --no-dry-run to serve the builder endpoint.
@julienh-ssv
Copy link
Copy Markdown
Contributor Author

Added dry-run mode to safely deploy to nodes without needing to change beacon node configuration

Add unit tests for dry-run service behavior and MEV dry-run HTTP API handler/route. Refactor dry-run service dependencies to narrow interfaces and reduce duplicated get_header result handling.
@julienh-ssv
Copy link
Copy Markdown
Contributor Author

Updated PR description with stats.

@julienh-ssv julienh-ssv marked this pull request as draft April 1, 2026 03:17
@julienh-ssv
Copy link
Copy Markdown
Contributor Author

julienh-ssv commented Apr 9, 2026

Stats and stage results

Conclusion: the in-node MEV path is working, materially improves proposer readiness, and is healthy enough to move forward. Node 63 is no longer a blocker; it still has later beacon timing than
the others, but the dry-run path is still winning there.

  1. Exact Metrics And Timing Benchmarks

24h Mimir summary:

  • 61: getHeader hit rate 93.33%, cache-miss proxy 6.67%, slot offset avg 313ms, prefetch requests 22, cached 22, prefetch late 1
  • 62: getHeader hit rate 100.0%, cache-miss proxy 0.0%, slot offset avg 219ms, prefetch requests 22, cached 22, prefetch late 1
  • 63: getHeader hit rate 100.0%, cache-miss proxy 0.0%, slot offset avg 530ms, prefetch requests 22, cached 22, prefetch late 1
  • 64: getHeader hit rate 93.33%, cache-miss proxy 6.67%, slot offset avg 253ms, prefetch requests 23, cached 22, prefetch late 2

7d Mimir summary:

  • 61: hit rate 98.11%, cache-miss proxy 1.89%, slot offset avg 316ms, prefetch cached 155/155
  • 62: hit rate 100.0%, cache-miss proxy 0.0%, slot offset avg 206ms, prefetch cached 155/157
  • 63: hit rate 99.05%, cache-miss proxy 0.95%, slot offset avg 374ms, prefetch cached 155/157
  • 64: hit rate 98.11%, cache-miss proxy 1.89%, slot offset avg 229ms, prefetch cached 155/160

Live node counters right now:

  • 61: prefetch bid fetches 71, direct get_header bid fetches 2
  • 62: prefetch bid fetches 71, direct get_header bid fetches 1
  • 63: prefetch bid fetches 71, direct get_header bid fetches 1
  • 64: prefetch bid fetches 71, direct get_header bid fetches 2

What this means operationally:

  • The beacon node’s getHeader almost always lands on a warmed local cache rather than waiting on a live relay round-trip.
  • On cache hit, the builder endpoint’s own getHeader latency is effectively zero: about 0.002ms average across nodes.
  • On miss, the direct relay path costs about 370ms to 553ms server-side depending on node.

Observed proposer timing wins from comparison logs:

  • Recent healthy sample at 2026-04-09 06:14 UTC
  • 61: baseline 1.291s, shadow total 24.9ms, gain 1.266s
  • 62: baseline 1.211s, shadow total 31.1ms, gain 1.179s
  • 63: baseline 1.289s, shadow total 55.6ms, gain 1.234s
  • 64: baseline 1.327s, shadow total 6.55ms, gain 1.320s

Another recent sample at 2026-04-09 07:51 UTC

  • 61: 1.474s to 69.5ms, gain 1.405s
  • 62: 1.446s to 9.18ms, gain 1.437s
  • 63: 1.342s to 27.0ms, gain 1.315s
  • 64: 1.254s to 46.6ms, gain 1.208s

Important nuance on 63:

  • 63 still gets the beacon getHeader call later than the others. Its slot offset is 530ms over 24h and 374ms over 7d, versus roughly 206ms to 316ms on the other nodes.
  • Even so, the shadow path still wins cleanly on current data.
  • I found only one recoverable head_error event in the last 24h on 63; it was recovered by exact-parent lookup and has not become a recurring failure pattern.

Duty health:

  • All four nodes are actively and repeatedly logging ✅ successfully submitted attestations.
  • I did not find evidence of proposer-duty breakage tied to the MEV dry-run.
  • There is background noise from P2P validation ignored and some event lookup errors, but these are not MEV-specific and did not prevent duty execution.
  1. How Direct Relay Improves The Timing Game

Note: Timing Game is not the best term to use here as it's often use for something different (increase MEV value by delaying blocks)... which is the opposite of what we're trying to achieve (we want to speed up blocks!).

The old timing game is:

  • proposer duty starts
  • SSV/beacon goes to external builder flow live
  • a relay round-trip happens on the critical path
  • proposal readiness completes roughly 0.5s to 1.5s into the slot in the observed samples

The new dry-run timing game is:

  • node prefetches bids before the beacon asks for them
  • when the beacon calls getHeader, the response is usually already cached
  • the remaining critical path is mostly just head-hash lookup plus cache read
  • proposal readiness completes in tens of milliseconds instead of hundreds or thousands

So the measured improvement is not incremental. It is typically:

  • around 1.2s to 1.45s faster on blinded proposal samples
  • around 0.48s to 0.80s faster on the smaller non-blinded samples we saw
  • with near-total elimination of live relay calls from the beacon-facing getHeader path

The strongest single metric for hierarchy is probably this:

  • in recent live counters, each node has about 71 prefetch-backed bid wins versus only 1-2 direct get_header bid wins

That is the feature doing exactly what it was intended to do: taking relay latency out of the beacon-critical path.

  1. APR Impact

The honest answer is: latency improvement only matters financially when it changes outcome, meaning when it converts a would-be missed or late builder opportunity into a captured bid.

A reasonable gross-value model is:

extra ETH per validator per year
≈ proposer_duties_per_year × conversion_gain × avg_builder_premium

Using:

  • about 2.39 proposer duties per validator per year
    assuming about 1.1M active validators and 7200 slots/day
  • observed stage bid values roughly in the 0.0036 to 0.0050 ETH range
  • a latency-conversion assumption of 5%, 10%, or 25%
    meaning “what share of proposer duties does the faster path turn from miss/lose into capture”

Gross validator-level uplift at 0.004 ETH average builder premium:

  • 5% conversion: 0.000478 ETH/year per validator, about 0.00149% APR absolute
  • 10% conversion: 0.000956 ETH/year per validator, about 0.00299% APR absolute
  • 25% conversion: 0.002389 ETH/year per validator, about 0.00747% APR absolute

Fleet-level gross uplift at 0.004 ETH average premium:

  • per 10k validators:
  • 5% conversion: about 4.78 ETH/year
  • 10% conversion: about 9.56 ETH/year
  • 25% conversion: about 23.89 ETH/year

How to interpret that:

  • Per validator, annualized APR uplift is modest because proposer duties are rare.
  • Per fleet, the value becomes meaningful, and it scales linearly with validator count and with actual mainnet builder premiums.
  • These stage numbers likely understate upside if mainnet builder premiums are higher than the ~0.0036-0.0050 ETH values seen in the sampled stage proposals.

Recommendation

This is ready for mainnet rollout.

  • The direct in-node MEV path is clearly reducing proposal readiness latency.
  • Cache hit behavior is strong across all nodes.
  • 63 is no longer failing systematically; it remains later than peers, but the dry-run path is still winning.
  • I would not block production on the remaining stage observations.

Older stats:

Dry mode running on stage on nodes 61-64 for a few days.

The MEV dry-run is showing a significant latency win on nodes 61, 62, and 64. On 63, the dry-run path is mostly failing with head_error, so
that node is the outlier.

24h Summary

  • 61: shadow path average 79ms, baseline average 1139ms, average gain about 1.06s. 15/15 shadow results were bid, all parent hashes matched.
  • 62: shadow path average 52ms, baseline average 1172ms, average gain about 1.12s. 15/15 shadow results were bid, all parent hashes matched.
  • 64: shadow path average 48ms, baseline average 1132ms, average gain about 1.08s. 15/15 shadow results were bid, all parent hashes matched.
  • 63: shadow path average 165ms, baseline average 2020ms, but only 1/15 direct shadow results was bid; 14/15 were head_error. 11 of those were later
    recoverable with exact-parent lookup, which is diagnostic but does not help the live fast path.

Prometheus / Builder Metrics

  • 61: getHeader hit rate 93.33% over 24h, 81.17% over 7d. Cache-miss proxy 6.67% over 24h.
  • 62: getHeader hit rate 93.33% over 24h, 79.06% over 7d. Cache-miss proxy 6.67% over 24h.
  • 64: getHeader hit rate 93.33% over 24h, 76.74% over 7d. Cache-miss proxy 6.67% over 24h.
  • 63: getHeader hit rate 85.71% over 24h, 68.29% over 7d, but its slot offset is much later: average 2534ms over 24h and 3014ms over 7d. Cache-miss proxy is
    also much worse at 14.29% over 24h and 31.71% over 7d.

Live Counter Snapshot

  • 61: 126 prefetch bid fetches, 3 direct get_header bid fetches.
  • 62: 126 prefetch bid fetches, 3 direct get_header bid fetches.
  • 64: 126 prefetch bid fetches, 5 direct get_header bid fetches.
  • 63: 117 prefetch bid fetches, 24 direct get_header no_bid, and no direct get_header winners.

A concrete healthy example from 61 on 2026-03-17T19:45:49Z: baseline 1.353s, shadow 38.3ms, shadow finished at 238ms from slot start, with a 0.0163 ETH bid.
The matching bad example from 63 on 2026-03-17T19:45:50Z: baseline 2.183s, shadow failed with head_error after 150ms, then an exact-parent check later found
the same 0.0163 ETH bid.

Conclusion: for 61, 62, and 64, the dry-run is materially speeding up proposal readiness by roughly 1.0s to 1.1s and looks healthy. 63 is not healthy enough
to count as a win right now; it looks like a parent-hash/head lookup issue in the direct shadow path.

@julienh-ssv julienh-ssv marked this pull request as ready for review April 9, 2026 08:33
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 9, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

@olegshmuelov olegshmuelov marked this pull request as draft April 9, 2026 13:33
@julienh-ssv julienh-ssv changed the title Feat mev with duties prefetching Feat mev with duties prefetching (T-ssv-057) Apr 12, 2026
@julienh-ssv
Copy link
Copy Markdown
Contributor Author

Preliminary Research: What vouch does.

  1. Validator registration flow
[Vouch]
  -- POST /eth/v1/validator/register_validator -->
[Beacon node]

[Vouch]
  -- POST /eth/v1/builder/validators -->
[External relay(s)]

The beacon-node API defines /eth/v1/validator/register_validator, and the Builder API defines /eth/v1/builder/validators. The builder-spec validator flow says validators submit registrations to their connected beacon node and also periodically upstream to builder software; Vouch’s config docs say Vouch talks to MEV-boost relays for validator registration, bid fetching, and block unblinding.

  1. Proposal flow when using external MEV
A. Optional early bid fetch / auction
[Vouch]
  -- GET /eth/v1/builder/header/{slot}/{parent_hash}/{pubkey} -->
[External relay A]
[Vouch]
  -- GET /eth/v1/builder/header/{slot}/{parent_hash}/{pubkey} -->
[External relay B]
[Vouch]
  <-- signed builder bid/header from relays --

B. Ask beacon node for the proposal
[Vouch]
  -- GET /eth/v3/validator/blocks/{slot}?randao_reveal=...&graffiti=...&builder_boost_factor=... -->
[Beacon node]

C. Inside that beacon-node request
[Beacon node]
  -- GET /eth/v1/builder/header/{slot}/{parent_hash}/{pubkey} -->
[Vouch local MEV-boost server]

[Vouch local MEV-boost server]
  -- GET /eth/v1/builder/header/{slot}/{parent_hash}/{pubkey} -->
[External relay(s)]

[Beacon node]
  -- local payload build -->
[Paired execution client]

D. Beacon node chooses local-vs-builder result
[Beacon node]
  <-- returns BLINDED block to Vouch (if builder path wins) --

E. Sign the blinded proposal
[Vouch]
  -- signing request -->
[Signer / validator key manager]

F. Unblind via relay
[Vouch]
  -- POST /eth/v1/builder/blinded_blocks -->
[Winning relay or selected relays]
[Relay]
  <-- full execution payload (+ blobs after Deneb) --

G. Publish final signed block
[Vouch]
  -- POST /eth/v2/beacon/blocks -->
[Beacon node]
[Beacon node]
  -- broadcast -->
[Ethereum network]

That is the standard external-builder path. The Builder API defines getHeader at /eth/v1/builder/header/{slot}/{parent_hash}/{pubkey} and submitBlindedBlock at /eth/v1/builder/blinded_blocks; getHeader returns a signed builder bid/header, and submitBlindedBlock returns the unblinded execution payload. The beacon-node API defines produceBlockV3 at /eth/v3/validator/blocks/{slot} and publishBlockV2 at /eth/v2/beacon/blocks. The spec for produceBlockV3 says the beacon node returns a blinded block only if it got an execution payload header from an MEV relay; otherwise it returns an unblinded block from its paired execution node.

  1. Builder API

Steps B. and C. above are the standard builder API flow.

@julienh-ssv
Copy link
Copy Markdown
Contributor Author

• Node 69 MEV Dry-Run Report on stage

  • Node: ssv-node-69-0
  • Cluster / Namespace: stage / ssv
  • Window: last 1d

Overall

  • MEV dry-run is active and producing useful data on node 69.
  • Metrics collection is working.
  • Logs confirm real proposer-path comparisons, not just idle startup.

Prometheus Stats

  • getHeader requests: 12.01
  • getHeader cache hit rate: 100%
  • getHeader misses: 0
  • prefetch requests: 15.02
  • prefetch results: all observed as cached
  • prefetch late: 0
  • bid fetches from prefetch: 15.02
  • bid fetches from direct get_header: 0
  • slot offset avg: ~912ms

Time Gains

  • From the 16 mev dry-run comparison logs observed in the last 1d:
    • Average gain: ~808ms faster for shadow vs baseline
    • Median gain: ~806ms
    • Best observed gain: ~2.364s
    • Worst observed result: one case where shadow was ~18.8ms slower
    • Positive gain rate: 15 / 16 comparisons
  • Observed baseline times ranged from roughly 53ms to 2.375s.
  • Observed shadow total times ranged from roughly 5ms to 217ms in the sampled logs.

# Conflicts:
#	operator/node.go
#	ssvsigner/go.sum
@julienh-ssv
Copy link
Copy Markdown
Contributor Author

Stage Dry-Run Metrics Report

Collected from stage nodes 61-75 for the feat/mev dry-run rollout. The rollout has been running for 482h38m46s at collection time.

Summary

The dry-run is doing materially better than the baseline on the latency metric it is designed to improve.

Across the full 482h38m46s window, the legacy baseline GetBeaconBlock path averaged 1,153 ms, while the in-node SSV-MEV shadow getHeader total path averaged 31 ms. That is a 97.3% average latency reduction for the shadow builder-header path. The slot finish offset also improved substantially: baseline p95 finish offset was 2,426 ms into the slot, while shadow p95 finish offset was 463 ms.

The last 24h look especially clean: 337 dry-run comparisons, all baseline OK, all shadow calls returned bids, all known parent hashes matched, and no exact-parent fallback or recovered-bid cases were needed. Loki hourly summaries independently showed 330 comparisons in the last 24h with the same all-green outcome. The small count difference is expected from Prometheus increase() scrape extrapolation and slightly different query cutoffs.

Key caveat: this is still dry-run only. It proves that the in-node prefetch/cache path is much faster and highly available, but it does not yet prove live proposal success after making the beacon node use SSV-MEV as its authoritative Builder API endpoint.

Methodology

Data sources used:

Source What it validated
Live pod metrics scrape In-pod MEV endpoint status, relay config, current counters, cache gauges.
Historical Prometheus/Mimir metrics Builder endpoint, prefetch, dry-run comparison, and proposer metrics over 24h, 7d, and the full dry-run window.
Aggregate PromQL over Mimir Cohort-level metrics for pods ssv-node-61-0 through ssv-node-75-0.
Loki logs Hourly dry-run summaries for sanity-checking dry-run results.
feat/mev code instrumentation Metric meaning and dry-run comparison semantics.

Rollout selector:

cluster="stage", namespace="ssv", pod=~"ssv-node-(61|62|63|64|65|66|67|68|69|70|71|72|73|74|75)-0"

Live pod checks showed dry-run behavior as expected:

Check Result
Builder endpoint status 000 on live status probe, expected for dry-run mode because the endpoint is not serving the beacon node as authoritative Builder API.
Relay config hoodi.titanrelay.xyz, hoodi.aestus.live, boost-relay-hoodi.flashbots.net, relay.ultrasound.money.
Cache gauges at collection time 0 cache entries, 0 provenance entries, 0 in-flight prefetches on all rollout pods. This is a point-in-time sample between duties, not evidence that caching is inactive.
Unblind path No unblind requests observed, expected for dry-run.

Prometheus note: counter increase() values below are rounded to whole events for readability. Raw PromQL returns fractional values because it extrapolates between scrapes.

Dry-Run vs Baseline

This compares the legacy proposer GetBeaconBlock baseline against the SSV-MEV shadow getHeader path recorded in the same proposer duties.

Window Comparisons Baseline OK Baseline error Shadow bid Shadow no-bid Shadow head-error Avg baseline Avg shadow total Avg latency reduction
24h 337 337 0 337 0 0 1,203 ms 27 ms 97.7%
7d 2,101 2,098 3 2,096 3 2 1,168 ms 30 ms 97.4%
482h38m46s 6,250 6,247 3 6,209 9 12 1,153 ms 31 ms 97.3%

Latency distributions:

Window Baseline p50 Baseline p90 Baseline p95 Baseline p99 Shadow p50 Shadow p90 Shadow p95 Shadow p99
24h 1,547 ms 2,328 ms 2,426 ms 2,894 ms 8 ms 30 ms 47 ms 693 ms
7d 1,538 ms 2,316 ms 2,413 ms 2,491 ms 8 ms 35 ms 54 ms 680 ms
482h38m46s 1,535 ms 2,312 ms 2,410 ms 2,487 ms 9 ms 42 ms 69 ms 666 ms

Slot finish offsets:

Window Baseline avg finish Baseline p95 finish Shadow avg finish Shadow p95 finish
24h 1,361 ms 2,436 ms 185 ms 437 ms
7d 1,315 ms 2,428 ms 178 ms 407 ms
482h38m46s 1,315 ms 2,426 ms 192 ms 463 ms

Parent-hash and fallback checks:

Window Parent hash matches Exact-parent checks Exact-parent bid Recovered bid
24h 337 0 0 0
7d 2,096 2 2 2
482h38m46s 6,212 19 13 12

Interpretation:

Finding Assessment
Shadow path is much faster Strong positive. The measured shadow path removes roughly 1.1s from the latency-sensitive builder-header portion compared with the baseline block request path.
24h availability is clean Strong positive. No dry-run shadow no-bid, head-error, timeout, baseline error, or recovered-bid cases in the last 24h.
Full-window recovered bids exist Watch item. 12 recovered bids over 6,250 comparisons means the normal shadow parent-hash path occasionally missed a bid that exact-parent lookup later recovered. This is rare (~0.19%) and absent in the last 24h.
Baseline comparison is not perfectly apples-to-apples Expected. Baseline is full GetBeaconBlock; shadow is SSV-MEV getHeader total path. This is still the right dry-run signal for the PR objective because the PR targets Builder API header latency.

Builder Endpoint Cache and getHeader

Window getHeader total Cache hits Cache misses Hit rate Hit avg latency Miss avg latency Overall p99
24h 337 329 8 97.6% 0.006 ms 676 ms 682 ms
7d 2,102 2,043 59 97.2% 0.003 ms 643 ms 667 ms
482h38m46s 6,207 6,073 134 97.8% 0.003 ms 651 ms 629 ms

The important result is the hit/miss split. Cache hits are effectively immediate. Misses still go to external relays and cost hundreds of milliseconds, which is exactly what the feature is trying to move out of the critical path.

Slot offset for shadow getHeader calls:

Window Avg offset p50 offset p90 offset
24h 174 ms 174 ms 289 ms
7d 165 ms 168 ms 244 ms
482h38m46s 184 ms 175 ms 299 ms

Prefetch and Cache Warming

Window Prefetch requests Cached No bid Warm skips Late prefetches Late rate Cached rate
24h 390 381 0 7 19 4.9% 97.7%
7d 2,461 2,406 3 46 129 5.2% 97.8%
482h38m46s 7,225 7,061 10 105 341 4.7% 97.7%

Prefetch timing:

Window Prefetch lead avg Prefetch lead p50 Prefetch lead p90 First-cached lead avg First-cached lead p50 First-cached lead p90 First-cached late
24h 1,408 ms 1,703 ms 2,341 ms 1,228 ms 1,719 ms 2,344 ms 10
7d 1,414 ms 1,706 ms 2,341 ms 1,226 ms 1,710 ms 2,342 ms 84
482h38m46s 1,418 ms 1,709 ms 2,342 ms 1,257 ms 1,712 ms 2,342 ms 207

The configured prefetch lead target is around 1.5s, and the measured lead averages are close to that. The first-cached lead averages are lower because relay fetches take time, but still typically complete before the beacon asks for getHeader.

Prefetch parent-hash compare:

Window Match Missing
24h 337 0
7d 2,103 0
482h38m46s 6,252 0

Relay Bid Fetch Metrics

Bid fetch volume and latency:

Window Source Result Fetches Avg latency p95 latency by source
24h prefetch bid 381 2,289 ms 2,422 ms
24h get_header bid 8 676 ms 950 ms
7d prefetch bid 2,406 2,292 ms 2,422 ms
7d prefetch no_bid 3 2,348 ms 2,422 ms
7d get_header bid 56 678 ms 816 ms
7d get_header no_bid 3 <1 ms 816 ms
482h38m46s prefetch bid 7,061 2,295 ms 2,425 ms
482h38m46s prefetch no_bid 10 2,654 ms 2,425 ms
482h38m46s get_header bid 119 693 ms 748 ms
482h38m46s get_header no_bid 16 302 ms 748 ms

Relay winners:

Window Source Relay Wins Avg winning value
24h prefetch boost-relay-hoodi.flashbots.net 229 0.009 ETH
24h prefetch hoodi.aestus.live 148 0.008 ETH
24h get_header boost-relay-hoodi.flashbots.net 6 0.008 ETH
24h get_header hoodi.aestus.live 2 0.005 ETH
7d prefetch boost-relay-hoodi.flashbots.net 2,124 0.010 ETH
7d prefetch hoodi.aestus.live 278 0.008 ETH
7d get_header boost-relay-hoodi.flashbots.net 40 0.012 ETH
7d get_header hoodi.aestus.live 13 0.007 ETH
482h38m46s prefetch boost-relay-hoodi.flashbots.net 5,349 0.011 ETH
482h38m46s prefetch hoodi.aestus.live 1,670 0.006 ETH
482h38m46s get_header boost-relay-hoodi.flashbots.net 67 0.014 ETH
482h38m46s get_header hoodi.aestus.live 39 0.006 ETH

Only Flashbots Hoodi and Aestus won bid selection in this sample. Titan and Ultrasound were configured but did not win according to the exported winner counters.

Loki Hourly Summary Sanity Check

Sanity check: queried Loki hourly summary logs for mev dry-run hourly summary across stage nodes 61-75 over the last 24h.

Last 24h log summary totals:

Metric Count
Hourly summary rows 189
Window total 330
Baseline OK 330
Baseline error 0
Shadow bid 330
Shadow no-bid 0
Shadow error 0
Shadow head-error 0
Shadow timeout 0
Parent hash known 330
Parent hash match 330
Parent hash mismatch 0
Exact-parent total 0
Recovered bid 0

Per-node last 24h log totals:

Node Comparisons Shadow bid Parent hash match
61 21 21 21
62 21 21 21
63 22 22 22
64 22 22 22
65 17 17 17
66 16 16 16
67 16 16 16
68 16 16 16
69 17 17 17
70 17 17 17
71 17 17 17
72 17 17 17
73 37 37 37
74 37 37 37
75 37 37 37

Full-Window Per-Pod Coverage

Pod Baseline comparisons Shadow bids Prefetch requests
ssv-node-61-0 339 339 528
ssv-node-62-0 241 241 529
ssv-node-63-0 340 337 526
ssv-node-64-0 340 332 527
ssv-node-65-0 323 321 328
ssv-node-66-0 324 323 332
ssv-node-67-0 324 323 330
ssv-node-68-0 324 323 329
ssv-node-69-0 314 310 321
ssv-node-70-0 313 310 314
ssv-node-71-0 314 306 330
ssv-node-72-0 314 312 323
ssv-node-73-0 813 810 832
ssv-node-74-0 813 811 838
ssv-node-75-0 813 810 837

Limitations and Follow-Ups

Topic Note
Dry-run scope The existing builder path remains authoritative. This report validates latency/availability of the shadow in-node path, not live replacement behavior.
Baseline metric shape Baseline measures full GetBeaconBlock; shadow measures in-node getHeader. The comparison is intentionally focused on removing builder-header latency from the critical path.
Negative histogram metric ssv_runner_mev_dry_run_shadow_minus_baseline_seconds records negative observations. Prometheus histogram increase() on negative sums produced unusable average values, so this report uses baseline_avg - shadow_total_avg and Loki raw duration summaries instead.
Rare exact-parent recovery Full-window data had 12 recovered bids from exact-parent checks. This was absent in the last 24h, but it is the main reliability item to keep watching before moving out of dry-run.
Proposer success counters Rollout pods had proposer submission successes and failures during the window, but these counters are not a clean feature verdict because dry-run is not authoritative and stage proposer outcomes include consensus, validator assignment, and beacon-node factors outside SSV-MEV.

Verdict

The dry-run is better than baseline for the intended objective: making builder headers available much earlier and avoiding external relay latency during the critical proposer path.

The current evidence supports continuing toward a live-routing test, with one explicit guardrail: keep tracking parent-hash lookup reliability and exact-parent recovery cases. The last 24h was clean, and the full-window recovery rate was low, but those rare cases are the most relevant correctness risk before making SSV-MEV authoritative.

@julienh-ssv
Copy link
Copy Markdown
Contributor Author

Production MEV Dry-Run Metrics Report

Collected from production mainnet SSV nodes 1-4:

cluster="production-ovh", namespace="ssv", pod=~"mainnet-ssv-node-(1|2|3|4)-0"

Summary

The production dry-run data points in the same direction as stage: the in-node SSV-MEV shadow path is faster than the baseline proposer block path. The production sample is much smaller, so treat this as a consistency check rather than a statistically strong production conclusion.

Across the full 482h38m46s comparison window, production had 48 dry-run comparisons. All 48 baseline calls were OK, all 48 shadow calls returned bids, all 48 parent hashes matched, and no recovered-bid fallback was needed. Baseline GetBeaconBlock averaged 1,095 ms; the SSV-MEV shadow total getHeader path averaged 70 ms, a 93.6% average latency reduction. Baseline p95 was 2,425 ms; shadow p95 was 850 ms.

Over 7d, production had only 12 Mimir dry-run comparisons. All were clean. Baseline averaged 1,078 ms; shadow averaged 270 ms, a 75.0% average reduction. The higher 7d shadow average is from a very small sample and should not be compared too aggressively against stage.

Data-quality note: Mimir returned 0 dry-run counter increases for the last 24h, but Loki hourly summaries showed 4 clean dry-run comparisons in that period, one on each production node. The report uses Mimir for quantitative 7d and full-window metrics, and Loki only as a last-24h sanity check.

Methodology

Data sources used:

Source What it validated
Historical Prometheus/Mimir metrics Builder endpoint, prefetch, dry-run comparison, and proposer metrics over 24h, 7d, and 482h38m46s.
Loki logs Hourly dry-run summaries for sanity-checking last-24h production behavior.
Production metric label discovery Confirmed metrics exist for mainnet-ssv-node-1-0 through mainnet-ssv-node-4-0.
feat/mev code instrumentation Metric meaning and dry-run comparison semantics.

Prometheus note: counter increase() values are rounded to whole events for readability. Raw PromQL returns fractional values because it extrapolates between scrapes.

Dry-Run vs Baseline

This compares the legacy proposer GetBeaconBlock baseline against the SSV-MEV shadow getHeader path recorded in the same proposer duties.

Window Mimir comparisons Baseline OK Baseline error Shadow bid Shadow no-bid Shadow head-error Avg baseline Avg shadow total Avg latency reduction
24h 0 0 0 0 0 0 n/a n/a n/a
7d 12 12 0 12 0 0 1,078 ms 270 ms 75.0%
482h38m46s 48 48 0 48 0 0 1,095 ms 70 ms 93.6%

Latency and finish offset:

Window Baseline p95 Shadow p95 Baseline p95 finish offset Shadow p95 finish offset
24h n/a n/a n/a n/a
7d 2,425 ms 963 ms 2,425 ms 963 ms
482h38m46s 2,425 ms 850 ms 2,425 ms 850 ms

Parent-hash and fallback checks:

Window Parent hash matches Exact-parent checks Exact-parent bid Recovered bid
24h 0 0 0 0
7d 12 0 0 0
482h38m46s 48 0 0 0

Interpretation:

Finding Assessment
Shadow path is faster Positive. Full-window production shadow total averaged 70 ms vs 1,095 ms baseline.
Reliability in observed samples is clean Positive. No shadow no-bid, head-error, timeout, parent-hash mismatch, or recovered-bid cases in Mimir over 7d or the full window.
Production sample is sparse Important caveat. The full-window production sample is only 48 comparisons across four nodes, and the 7d sample is only 12.
24h metrics/logs disagree on count Caveat. Mimir showed no 24h counter increase, while Loki had four clean hourly summaries. Treat last-24h production as too sparse for quantitative latency conclusions.

Builder Endpoint Cache and getHeader

Window getHeader total Cache hits Cache misses Hit rate getHeader p99
24h 0 0 0 n/a n/a
7d 8 8 0 100.0% 0.99 ms
482h38m46s 44 44 0 100.0% 0.99 ms

The production cache-hit sample is small but clean: every Mimir-observed getHeader request was served from cache, with no miss-triggered relay fetches on the getHeader critical path.

Prefetch and Cache Warming

Window Prefetch requests Cached No bid Late prefetches Late rate Cached rate Avg prefetch lead
24h 0 0 0 0 n/a n/a n/a
7d 12 11 0 0 0.0% 91.7% 996 ms
482h38m46s 48 47 0 0 0.0% 97.9% 1,371 ms

The full-window production prefetch lead average is close to the intended early-prefetch behavior. Unlike stage, production had no observed late prefetches in the Mimir windows.

Relay Bid Fetch Metrics

Bid fetch volume:

Window Source Result Fetches
24h prefetch bid 0
24h get_header bid 0
7d prefetch bid 11
7d get_header bid 0
482h38m46s prefetch bid 47
482h38m46s get_header bid 0

Relay winners:

Window Source Relay Wins
24h prefetch boost-relay.flashbots.net 0
7d prefetch boost-relay.flashbots.net 11
482h38m46s prefetch boost-relay.flashbots.net 47

Only boost-relay.flashbots.net won bid selection in the production sample. There were no observed get_header relay fetch winners because observed getHeader calls were cache hits.

Current Gauges

At collection time:

Gauge Observed value
Cache entries 0 on nodes with current gauge samples
Cache provenance entries 0 on nodes with current gauge samples
In-flight prefetches 0 on nodes with current gauge samples

Gauge samples were present for nodes 1, 2, and 4 at collection time. Node 3 had counter/history data in Mimir but no current gauge sample in the instant gauge query. This is a point-in-time scrape observation, not evidence that node 3 did not participate.

Loki Last-24h Sanity Check

Loki hourly summary logs for mev dry-run hourly summary across production nodes 1-4 over the last 24h showed:

Metric Count
Hourly summary rows 4
Window total 4
Baseline OK 4
Baseline error 0
Shadow bid 4
Shadow no-bid 0
Shadow error 0
Shadow head-error 0
Shadow timeout 0
Parent hash known 4
Parent hash match 4
Parent hash mismatch 0
Exact-parent total 0
Recovered bid 0

Per-node last-24h Loki summary:

Node Comparisons Shadow bid Parent hash match
mainnet-ssv-node-1 1 1 1
mainnet-ssv-node-2 1 1 1
mainnet-ssv-node-3 1 1 1
mainnet-ssv-node-4 1 1 1

Full-Window Per-Pod Coverage

Pod Baseline comparisons Shadow bids Prefetch requests
mainnet-ssv-node-1-0 12 12 12
mainnet-ssv-node-2-0 12 12 12
mainnet-ssv-node-3-0 12 12 12
mainnet-ssv-node-4-0 12 12 12

Limitations and Follow-Ups

Topic Note
Sparse production proposer sample The full-window production sample is only 48 comparisons. This is enough to confirm the feature is behaving cleanly on production nodes, but not enough for a high-confidence production latency distribution.
Dry-run scope The existing builder path remains authoritative. This report validates the shadow in-node path, not live authoritative Builder API routing.
24h Mimir/log mismatch Mimir showed 0 last-24h counter increase, while Loki showed 4 clean hourly summary records. This should be treated as telemetry sparsity or scrape-window mismatch until checked further.
Baseline metric shape Baseline measures full GetBeaconBlock; shadow measures in-node getHeader. The comparison is still relevant because the PR targets builder-header latency on the critical proposer path.
No unblind data No unblind requests were observed, expected for dry-run.

Verdict

Production nodes 1-4 show the same qualitative outcome as stage: SSV-MEV dry-run serves builder headers much faster than the baseline path and did not show reliability failures in observed samples.

The production evidence is positive but sample-limited. I would use it as a production sanity check that the dry-run behaves correctly on mainnet nodes, while relying on the larger stage sample for stronger latency-distribution confidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants