Feat mev with duties prefetching (T-ssv-057)#2700
Conversation
b253793 to
cf821eb
Compare
cf821eb to
72e027c
Compare
Run shadow in-node get_header in parallel to legacy GetBeaconBlock, record comparisons via logs/metrics and API, and add --no-dry-run to serve the builder endpoint.
|
Added dry-run mode to safely deploy to nodes without needing to change beacon node configuration |
Add unit tests for dry-run service behavior and MEV dry-run HTTP API handler/route. Refactor dry-run service dependencies to narrow interfaces and reduce duplicated get_header result handling.
|
Updated PR description with stats. |
Stats and stage resultsConclusion: the in-node MEV path is working, materially improves proposer readiness, and is healthy enough to move forward. Node 63 is no longer a blocker; it still has later beacon timing than
24h Mimir summary:
7d Mimir summary:
Live node counters right now:
What this means operationally:
Observed proposer timing wins from comparison logs:
Another recent sample at 2026-04-09 07:51 UTC
Important nuance on 63:
Duty health:
Note: Timing Game is not the best term to use here as it's often use for something different (increase MEV value by delaying blocks)... which is the opposite of what we're trying to achieve (we want to speed up blocks!). The old timing game is:
The new dry-run timing game is:
So the measured improvement is not incremental. It is typically:
The strongest single metric for hierarchy is probably this:
That is the feature doing exactly what it was intended to do: taking relay latency out of the beacon-critical path.
The honest answer is: latency improvement only matters financially when it changes outcome, meaning when it converts a would-be missed or late builder opportunity into a captured bid. A reasonable gross-value model is: extra ETH per validator per year Using:
Gross validator-level uplift at 0.004 ETH average builder premium:
Fleet-level gross uplift at 0.004 ETH average premium:
How to interpret that:
Recommendation This is ready for mainnet rollout.
Older stats:Dry mode running on stage on nodes 61-64 for a few days. The MEV dry-run is showing a significant latency win on nodes 61, 62, and 64. On 63, the dry-run path is mostly failing with head_error, so 24h Summary
Prometheus / Builder Metrics
Live Counter Snapshot
A concrete healthy example from 61 on 2026-03-17T19:45:49Z: baseline 1.353s, shadow 38.3ms, shadow finished at 238ms from slot start, with a 0.0163 ETH bid. Conclusion: for 61, 62, and 64, the dry-run is materially speeding up proposal readiness by roughly 1.0s to 1.1s and looks healthy. 63 is not healthy enough |
|
Tip: Greploop — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
Preliminary Research: What vouch does.
The beacon-node API defines
That is the standard external-builder path. The Builder API defines
Steps B. and C. above are the standard builder API flow. |
|
• Node 69 MEV Dry-Run Report on stage
Overall
Prometheus Stats
Time Gains
|
# Conflicts: # operator/node.go # ssvsigner/go.sum
Stage Dry-Run Metrics ReportCollected from stage nodes SummaryThe dry-run is doing materially better than the baseline on the latency metric it is designed to improve. Across the full The last 24h look especially clean: Key caveat: this is still dry-run only. It proves that the in-node prefetch/cache path is much faster and highly available, but it does not yet prove live proposal success after making the beacon node use SSV-MEV as its authoritative Builder API endpoint. MethodologyData sources used:
Rollout selector: Live pod checks showed dry-run behavior as expected:
Prometheus note: counter Dry-Run vs BaselineThis compares the legacy proposer
Latency distributions:
Slot finish offsets:
Parent-hash and fallback checks:
Interpretation:
Builder Endpoint Cache and
|
| Window | getHeader total | Cache hits | Cache misses | Hit rate | Hit avg latency | Miss avg latency | Overall p99 |
|---|---|---|---|---|---|---|---|
24h |
337 |
329 |
8 |
97.6% |
0.006 ms |
676 ms |
682 ms |
7d |
2,102 |
2,043 |
59 |
97.2% |
0.003 ms |
643 ms |
667 ms |
482h38m46s |
6,207 |
6,073 |
134 |
97.8% |
0.003 ms |
651 ms |
629 ms |
The important result is the hit/miss split. Cache hits are effectively immediate. Misses still go to external relays and cost hundreds of milliseconds, which is exactly what the feature is trying to move out of the critical path.
Slot offset for shadow getHeader calls:
| Window | Avg offset | p50 offset | p90 offset |
|---|---|---|---|
24h |
174 ms |
174 ms |
289 ms |
7d |
165 ms |
168 ms |
244 ms |
482h38m46s |
184 ms |
175 ms |
299 ms |
Prefetch and Cache Warming
| Window | Prefetch requests | Cached | No bid | Warm skips | Late prefetches | Late rate | Cached rate |
|---|---|---|---|---|---|---|---|
24h |
390 |
381 |
0 |
7 |
19 |
4.9% |
97.7% |
7d |
2,461 |
2,406 |
3 |
46 |
129 |
5.2% |
97.8% |
482h38m46s |
7,225 |
7,061 |
10 |
105 |
341 |
4.7% |
97.7% |
Prefetch timing:
| Window | Prefetch lead avg | Prefetch lead p50 | Prefetch lead p90 | First-cached lead avg | First-cached lead p50 | First-cached lead p90 | First-cached late |
|---|---|---|---|---|---|---|---|
24h |
1,408 ms |
1,703 ms |
2,341 ms |
1,228 ms |
1,719 ms |
2,344 ms |
10 |
7d |
1,414 ms |
1,706 ms |
2,341 ms |
1,226 ms |
1,710 ms |
2,342 ms |
84 |
482h38m46s |
1,418 ms |
1,709 ms |
2,342 ms |
1,257 ms |
1,712 ms |
2,342 ms |
207 |
The configured prefetch lead target is around 1.5s, and the measured lead averages are close to that. The first-cached lead averages are lower because relay fetches take time, but still typically complete before the beacon asks for getHeader.
Prefetch parent-hash compare:
| Window | Match | Missing |
|---|---|---|
24h |
337 |
0 |
7d |
2,103 |
0 |
482h38m46s |
6,252 |
0 |
Relay Bid Fetch Metrics
Bid fetch volume and latency:
| Window | Source | Result | Fetches | Avg latency | p95 latency by source |
|---|---|---|---|---|---|
24h |
prefetch |
bid |
381 |
2,289 ms |
2,422 ms |
24h |
get_header |
bid |
8 |
676 ms |
950 ms |
7d |
prefetch |
bid |
2,406 |
2,292 ms |
2,422 ms |
7d |
prefetch |
no_bid |
3 |
2,348 ms |
2,422 ms |
7d |
get_header |
bid |
56 |
678 ms |
816 ms |
7d |
get_header |
no_bid |
3 |
<1 ms |
816 ms |
482h38m46s |
prefetch |
bid |
7,061 |
2,295 ms |
2,425 ms |
482h38m46s |
prefetch |
no_bid |
10 |
2,654 ms |
2,425 ms |
482h38m46s |
get_header |
bid |
119 |
693 ms |
748 ms |
482h38m46s |
get_header |
no_bid |
16 |
302 ms |
748 ms |
Relay winners:
| Window | Source | Relay | Wins | Avg winning value |
|---|---|---|---|---|
24h |
prefetch |
boost-relay-hoodi.flashbots.net |
229 |
0.009 ETH |
24h |
prefetch |
hoodi.aestus.live |
148 |
0.008 ETH |
24h |
get_header |
boost-relay-hoodi.flashbots.net |
6 |
0.008 ETH |
24h |
get_header |
hoodi.aestus.live |
2 |
0.005 ETH |
7d |
prefetch |
boost-relay-hoodi.flashbots.net |
2,124 |
0.010 ETH |
7d |
prefetch |
hoodi.aestus.live |
278 |
0.008 ETH |
7d |
get_header |
boost-relay-hoodi.flashbots.net |
40 |
0.012 ETH |
7d |
get_header |
hoodi.aestus.live |
13 |
0.007 ETH |
482h38m46s |
prefetch |
boost-relay-hoodi.flashbots.net |
5,349 |
0.011 ETH |
482h38m46s |
prefetch |
hoodi.aestus.live |
1,670 |
0.006 ETH |
482h38m46s |
get_header |
boost-relay-hoodi.flashbots.net |
67 |
0.014 ETH |
482h38m46s |
get_header |
hoodi.aestus.live |
39 |
0.006 ETH |
Only Flashbots Hoodi and Aestus won bid selection in this sample. Titan and Ultrasound were configured but did not win according to the exported winner counters.
Loki Hourly Summary Sanity Check
Sanity check: queried Loki hourly summary logs for mev dry-run hourly summary across stage nodes 61-75 over the last 24h.
Last 24h log summary totals:
| Metric | Count |
|---|---|
| Hourly summary rows | 189 |
| Window total | 330 |
| Baseline OK | 330 |
| Baseline error | 0 |
| Shadow bid | 330 |
| Shadow no-bid | 0 |
| Shadow error | 0 |
| Shadow head-error | 0 |
| Shadow timeout | 0 |
| Parent hash known | 330 |
| Parent hash match | 330 |
| Parent hash mismatch | 0 |
| Exact-parent total | 0 |
| Recovered bid | 0 |
Per-node last 24h log totals:
| Node | Comparisons | Shadow bid | Parent hash match |
|---|---|---|---|
61 |
21 |
21 |
21 |
62 |
21 |
21 |
21 |
63 |
22 |
22 |
22 |
64 |
22 |
22 |
22 |
65 |
17 |
17 |
17 |
66 |
16 |
16 |
16 |
67 |
16 |
16 |
16 |
68 |
16 |
16 |
16 |
69 |
17 |
17 |
17 |
70 |
17 |
17 |
17 |
71 |
17 |
17 |
17 |
72 |
17 |
17 |
17 |
73 |
37 |
37 |
37 |
74 |
37 |
37 |
37 |
75 |
37 |
37 |
37 |
Full-Window Per-Pod Coverage
| Pod | Baseline comparisons | Shadow bids | Prefetch requests |
|---|---|---|---|
ssv-node-61-0 |
339 |
339 |
528 |
ssv-node-62-0 |
241 |
241 |
529 |
ssv-node-63-0 |
340 |
337 |
526 |
ssv-node-64-0 |
340 |
332 |
527 |
ssv-node-65-0 |
323 |
321 |
328 |
ssv-node-66-0 |
324 |
323 |
332 |
ssv-node-67-0 |
324 |
323 |
330 |
ssv-node-68-0 |
324 |
323 |
329 |
ssv-node-69-0 |
314 |
310 |
321 |
ssv-node-70-0 |
313 |
310 |
314 |
ssv-node-71-0 |
314 |
306 |
330 |
ssv-node-72-0 |
314 |
312 |
323 |
ssv-node-73-0 |
813 |
810 |
832 |
ssv-node-74-0 |
813 |
811 |
838 |
ssv-node-75-0 |
813 |
810 |
837 |
Limitations and Follow-Ups
| Topic | Note |
|---|---|
| Dry-run scope | The existing builder path remains authoritative. This report validates latency/availability of the shadow in-node path, not live replacement behavior. |
| Baseline metric shape | Baseline measures full GetBeaconBlock; shadow measures in-node getHeader. The comparison is intentionally focused on removing builder-header latency from the critical path. |
| Negative histogram metric | ssv_runner_mev_dry_run_shadow_minus_baseline_seconds records negative observations. Prometheus histogram increase() on negative sums produced unusable average values, so this report uses baseline_avg - shadow_total_avg and Loki raw duration summaries instead. |
| Rare exact-parent recovery | Full-window data had 12 recovered bids from exact-parent checks. This was absent in the last 24h, but it is the main reliability item to keep watching before moving out of dry-run. |
| Proposer success counters | Rollout pods had proposer submission successes and failures during the window, but these counters are not a clean feature verdict because dry-run is not authoritative and stage proposer outcomes include consensus, validator assignment, and beacon-node factors outside SSV-MEV. |
Verdict
The dry-run is better than baseline for the intended objective: making builder headers available much earlier and avoiding external relay latency during the critical proposer path.
The current evidence supports continuing toward a live-routing test, with one explicit guardrail: keep tracking parent-hash lookup reliability and exact-parent recovery cases. The last 24h was clean, and the full-window recovery rate was low, but those rare cases are the most relevant correctness risk before making SSV-MEV authoritative.
Production MEV Dry-Run Metrics ReportCollected from production mainnet SSV nodes SummaryThe production dry-run data points in the same direction as stage: the in-node SSV-MEV shadow path is faster than the baseline proposer block path. The production sample is much smaller, so treat this as a consistency check rather than a statistically strong production conclusion. Across the full Over Data-quality note: Mimir returned MethodologyData sources used:
Prometheus note: counter Dry-Run vs BaselineThis compares the legacy proposer
Latency and finish offset:
Parent-hash and fallback checks:
Interpretation:
Builder Endpoint Cache and
|
| Window | getHeader total | Cache hits | Cache misses | Hit rate | getHeader p99 |
|---|---|---|---|---|---|
24h |
0 |
0 |
0 |
n/a | n/a |
7d |
8 |
8 |
0 |
100.0% |
0.99 ms |
482h38m46s |
44 |
44 |
0 |
100.0% |
0.99 ms |
The production cache-hit sample is small but clean: every Mimir-observed getHeader request was served from cache, with no miss-triggered relay fetches on the getHeader critical path.
Prefetch and Cache Warming
| Window | Prefetch requests | Cached | No bid | Late prefetches | Late rate | Cached rate | Avg prefetch lead |
|---|---|---|---|---|---|---|---|
24h |
0 |
0 |
0 |
0 |
n/a | n/a | n/a |
7d |
12 |
11 |
0 |
0 |
0.0% |
91.7% |
996 ms |
482h38m46s |
48 |
47 |
0 |
0 |
0.0% |
97.9% |
1,371 ms |
The full-window production prefetch lead average is close to the intended early-prefetch behavior. Unlike stage, production had no observed late prefetches in the Mimir windows.
Relay Bid Fetch Metrics
Bid fetch volume:
| Window | Source | Result | Fetches |
|---|---|---|---|
24h |
prefetch |
bid |
0 |
24h |
get_header |
bid |
0 |
7d |
prefetch |
bid |
11 |
7d |
get_header |
bid |
0 |
482h38m46s |
prefetch |
bid |
47 |
482h38m46s |
get_header |
bid |
0 |
Relay winners:
| Window | Source | Relay | Wins |
|---|---|---|---|
24h |
prefetch |
boost-relay.flashbots.net |
0 |
7d |
prefetch |
boost-relay.flashbots.net |
11 |
482h38m46s |
prefetch |
boost-relay.flashbots.net |
47 |
Only boost-relay.flashbots.net won bid selection in the production sample. There were no observed get_header relay fetch winners because observed getHeader calls were cache hits.
Current Gauges
At collection time:
| Gauge | Observed value |
|---|---|
| Cache entries | 0 on nodes with current gauge samples |
| Cache provenance entries | 0 on nodes with current gauge samples |
| In-flight prefetches | 0 on nodes with current gauge samples |
Gauge samples were present for nodes 1, 2, and 4 at collection time. Node 3 had counter/history data in Mimir but no current gauge sample in the instant gauge query. This is a point-in-time scrape observation, not evidence that node 3 did not participate.
Loki Last-24h Sanity Check
Loki hourly summary logs for mev dry-run hourly summary across production nodes 1-4 over the last 24h showed:
| Metric | Count |
|---|---|
| Hourly summary rows | 4 |
| Window total | 4 |
| Baseline OK | 4 |
| Baseline error | 0 |
| Shadow bid | 4 |
| Shadow no-bid | 0 |
| Shadow error | 0 |
| Shadow head-error | 0 |
| Shadow timeout | 0 |
| Parent hash known | 4 |
| Parent hash match | 4 |
| Parent hash mismatch | 0 |
| Exact-parent total | 0 |
| Recovered bid | 0 |
Per-node last-24h Loki summary:
| Node | Comparisons | Shadow bid | Parent hash match |
|---|---|---|---|
mainnet-ssv-node-1 |
1 |
1 |
1 |
mainnet-ssv-node-2 |
1 |
1 |
1 |
mainnet-ssv-node-3 |
1 |
1 |
1 |
mainnet-ssv-node-4 |
1 |
1 |
1 |
Full-Window Per-Pod Coverage
| Pod | Baseline comparisons | Shadow bids | Prefetch requests |
|---|---|---|---|
mainnet-ssv-node-1-0 |
12 |
12 |
12 |
mainnet-ssv-node-2-0 |
12 |
12 |
12 |
mainnet-ssv-node-3-0 |
12 |
12 |
12 |
mainnet-ssv-node-4-0 |
12 |
12 |
12 |
Limitations and Follow-Ups
| Topic | Note |
|---|---|
| Sparse production proposer sample | The full-window production sample is only 48 comparisons. This is enough to confirm the feature is behaving cleanly on production nodes, but not enough for a high-confidence production latency distribution. |
| Dry-run scope | The existing builder path remains authoritative. This report validates the shadow in-node path, not live authoritative Builder API routing. |
| 24h Mimir/log mismatch | Mimir showed 0 last-24h counter increase, while Loki showed 4 clean hourly summary records. This should be treated as telemetry sparsity or scrape-window mismatch until checked further. |
| Baseline metric shape | Baseline measures full GetBeaconBlock; shadow measures in-node getHeader. The comparison is still relevant because the PR targets builder-header latency on the critical proposer path. |
| No unblind data | No unblind requests were observed, expected for dry-run. |
Verdict
Production nodes 1-4 show the same qualitative outcome as stage: SSV-MEV dry-run serves builder headers much faster than the baseline path and did not show reliability failures in observed samples.
The production evidence is positive but sample-limited. I would use it as a production sanity check that the dry-run behaves correctly on mainnet nodes, while relying on the larger stage sample for stronger latency-distribution confidence.
Recently we observed issues where enabling MEV boost on a beacon node would cause our validators to miss/fail proposer duties. This PR attempts to address this issue.
Summary
SSV Nodes are run by operators on a distributed-validator network. For each validator, the validator key is split into key shares across multiple operators. For each duty, the operator set first reaches consensus on what should be signed, and then a threshold of operators produces partial signatures. Because Ethereum proposal slots are only 12 seconds long, extra latency in the proposal path can cause late or missed proposals.
When the beacon node is configured to use the Builder API, the builder path adds extra latency before the SSV Node receives the proposal data to sign. In our current setup, that latency can overlap with SSV pre-consensus and signing deadlines, which increases the risk of late proposals.
This PR adds an in-node Builder API component ("SSV-MEV") inside the SSV Node. SSV-MEV prefetches builder bids ahead of the critical path and caches the best bid seen so far. Later, when the beacon node asks for a builder header, SSV-MEV can respond immediately with the best prefetched bid instead of waiting on external requests at proposal time.
This PR does not change the beacon node's local-vs-builder selection logic. The beacon node still decides whether to use the local execution payload or the builder path. This PR only reduces latency on the builder side.
The diagrams below focus only on the latency-sensitive
getHeaderstage. Validator registration and post-sign publication still follow the standard Builder API / beacon-node flow.Current flow
sequenceDiagram participant SSV as SSV Node participant Ops as SSV operator set participant BN as Beacon node participant Builder as Current Builder API endpoint participant EL as Execution client SSV->>Ops: Pre-consensus / proposal preparation SSV->>BN: GET /eth/v3/validator/blocks/{slot} BN->>Builder: GET /eth/v1/builder/header/{slot}/{parent_hash}/{pubkey} Builder-->>BN: Signed builder bid/header BN->>EL: Build local execution payload BN-->>SSV: Return blinded or unblinded block SSV->>Ops: Consensus + threshold signingIn the current flow, the builder-header fetch is on the critical proposer path.
New flow
sequenceDiagram participant SSV as SSV Node participant Ops as SSV operator set participant BN as Beacon node participant SSVMEV as SSV-MEV (in-node Builder API proxy/cache) participant Relays as External relays participant EL as Execution client SSV->>Ops: Pre-consensus / proposal preparation SSV->>SSVMEV: Start prefetch once slot/parent_hash/pubkey are known SSVMEV->>Relays: Query relays early Relays-->>SSVMEV: Signed builder bids SSV->>BN: GET /eth/v3/validator/blocks/{slot} BN->>SSVMEV: GET /eth/v1/builder/header/{slot}/{parent_hash}/{pubkey} SSVMEV-->>BN: Best cached bid/header (if available) BN->>EL: Build local execution payload BN-->>SSV: Return blinded or unblinded block SSV->>Ops: Consensus + threshold signingThis moves external builder communication earlier, so the beacon node can be served from local cache instead of waiting on external requests during the most time-sensitive part of the proposer path.
Why this helps
The key idea is:
This should reduce proposal latency and lower the chance of missing proposer duties due to late builder responses.
Relation to Vouch
This follows the same general pattern as Vouch: the beacon node talks to a local Builder API endpoint, and that endpoint talks to external relays. The difference is that we are moving that functionality into the SSV Node and adding bid prefetching.
Current rollout
At the moment, this PR runs both paths in parallel. The existing builder path remains authoritative, while SSV-MEV fetches the same bids ahead of time inside the SSV Node.
The in-node path is currently dry-run only. It is used for latency and bid comparison, but it does not yet replace the existing proposal path.
Stage results and statistics
See: Feat mev with duties prefetching (T-ssv-057) #2700 (comment)
EDIT New stats collected on May 5th 2026 on stage: Feat mev with duties prefetching (T-ssv-057) #2700 (comment)
EDIT New stats collected on May 6th 2026 on prod: Feat mev with duties prefetching (T-ssv-057) #2700 (comment)
Known limitation
This feature requires the beacon node to point its Builder API configuration at our SSV-MEV proxy/cache.
That does not compose well when a single beacon node is shared by multiple SSV clusters. Each cluster-local SSV-MEV instance only has visibility into its own validators, so no single in-node instance can serve builder bids for every validator known to the beacon node.
A separate external prefetch/cache service would be more general. Multiple clusters could register with the same service, and the beacon node could use that single service as its Builder API endpoint for all validators.
Linked: T-ssv-057