switch ssv-node storage to Pebble DB by nkryuchkov · Pull Request #2709 · ssvlabs/ssv

nkryuchkov · 2026-03-02T16:56:28Z

The PR applies Pebble DB usage for both ssv-node and exporter instead of Badger DB for ssv-node and Pebble DB for exporter.

nkryuchkov · 2026-03-02T16:56:48Z

@greptileai please review this PR

greptile-apps · 2026-03-02T17:06:19Z

Greptile Summary

This PR switches SSV node storage from BadgerDB to PebbleDB for both the node and exporter, introducing a new storage/migration package that handles zero-downtime Badger→Pebble data migration with support for context-aware interruption and resumption. The code is significantly cleaner than previous iterations — context handling, truncation recovery, temp-file atomicity, test cleanup, and t.Parallel() gaps have all been addressed.

The one remaining finding worth noting is a minor resource leak in the truncation-recovery path of openBadgerReadOnly.

Confidence Score: 5/5

Safe to merge — all previously flagged P0/P1 issues are resolved; the single remaining finding is a minor resource-leak edge case in the truncation-recovery error path.

The code has gone through multiple review rounds and the critical issues (context propagation, Badger truncation recovery, temp-file atomicity, test cleanup, double-close) are all resolved. The one new finding (recoveredDB not closed on Close() failure) requires a disk-IO error during a crash-recovery code path to trigger and is self-healing on process restart. All remaining findings are P2 quality-of-life items that don't block correctness.

storage/migration/db_migration.go — openBadgerReadOnly truncation-recovery close path (line 553)

Important Files Changed

Filename	Overview
storage/migration/db_migration.go	New file implementing Badger→Pebble migration with context-aware interruption, atomic marker files, resume support, and Badger truncation recovery. Logic is well-structured; one minor resource leak in the truncation-recovery close path.
storage/migration/db_migration_test.go	Comprehensive test suite covering all migration scenarios including resume, context cancellation, marker state, and the fast-path that skips Badger open after done marker. All tests use t.Parallel() and proper cleanup.
storage/pebble/pebble.go	Stripped-down Pebble wrapper; QuickGC is a no-op (Pebble handles compaction internally), FullGC triggers Compact. Close correctly propagates errors; cleanupPath removal is handled in TempDB.
storage/pebble/state.go	DirState correctly checks iter.Error() after First() and handles the empty-vs-error distinction. Used only transiently during startup for path resolution.
storage/pebble/testutils.go	NewTempDB creates a uniquely-named temp directory and removes it on Close(). Clean and consistent with the rest of the PR.
cli/operator/node.go	setupPebbleDB correctly closes the DB via closeOnSetupError on migration or migration-step failure, and passes cmd.Context() for graceful shutdown. Clean integration of the new migration plan/execute flow.
storage/basedb/storage.go	Removed GCInterval and Reporting fields that were Badger-specific; Options now contains only Ctx and Path. GarbageCollector interface retained as QuickGC/FullGC for Pebble's compaction hooks.
protocol/v2/ssv/testing/storage.go	newDB now accepts *testing.T and uses t.Cleanup for guaranteed resource release, fixing the goroutine/resource leak flagged in prior review.
network/p2p/testutils.go	NewTestP2pNetwork now uses t.Cleanup for database teardown (replacing the fragile goroutine-based ctx-cancel pattern). The p2pNetworkWithDB type assertion pattern is correct via interface embedding.
ssvsigner/ekm/slashing_protector_test.go	Both pebble.New calls now use &cockroachdb.Options{} (not nil) and the Phase 2 DB uses t.Cleanup. Previously flagged concerns resolved.
ssvsigner/e2e/testenv/key_managers.go	createLocalKeyManager and createRemoteKeyManager now use &cockroachdb.Options{} (not nil) with proper error-path DB cleanup.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Node Startup] --> B[ResolvePebbleDBPlan]
    B --> C{probePebbleLayout}
    C --> D[DirState canonical path]
    C --> E[DirState legacy path]
    D & E --> F{hasBadgerFiles basePath}
    F --> G{Single non-empty Pebble?}
    G -- yes --> H[resolveSingleNonEmptyPebblePathPlan]
    H --> I{badgerImportMarkerState}
    I -- done marker exists --> J[Return plan, no BadgerImportPath]
    I -- inProgress or badger files --> K[Return plan with BadgerImportPath]
    G -- no --> L{Both Pebble paths empty}
    L --> M[resolveEmptyLayoutPlan]
    M --> N[Return plan with BadgerImportPath]
    J --> O[pebble.New open DB]
    K --> O
    N --> O
    O --> P{BadgerImportPath set?}
    P -- no --> Q[applyMigrations]
    P -- yes --> R[MigrateBadgerToPebbleIfNeeded]
    R --> S{doneMarkerExists?}
    S -- yes --> T[Verify pebble non-empty, return]
    S -- no --> U[badgerDirState open Badger once]
    U --> V{Badger non-empty?}
    V -- no --> W[Skip migration]
    V -- yes --> X[runBadgerImport]
    X --> Y[Write inProgress marker]
    Y --> Z[copyBadgerToPebble batched with ctx check]
    Z --> AA[Write done marker]
    AA --> BB[Remove inProgress marker]
    BB --> Q
    T --> Q
    W --> Q
    Q --> CC[DB ready]

_{Reviews (23): Last reviewed commit: "simplify migration code" | Re-trigger Greptile}

greptile-apps

_{62 files reviewed, 15 comments}

_{Edit Code Review Agent Settings | Greptile}

codecov · 2026-03-02T17:11:44Z

Codecov Report

❌ Patch coverage is 56.99831% with 255 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.5%. Comparing base (b1ee6d1) to head (2b77861).

Files with missing lines	Patch %	Lines
storage/migration/db_migration.go	62.2%	115 Missing and 61 partials ⚠️
cli/operator/node.go	0.0%	38 Missing ⚠️
storage/pebble/state.go	0.0%	26 Missing ⚠️
network/p2p/testutils.go	70.5%	9 Missing and 1 partial ⚠️
storage/pebble/testutils.go	73.6%	3 Missing and 2 partials ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

nkryuchkov · 2026-03-03T10:44:24Z

@greptileai please review it again

nkryuchkov · 2026-03-03T11:04:53Z

@greptileai review it again

greptile-apps · 2026-03-03T11:12:17Z

Additional Comments (2)

storage/pebble/pebble.go
Batch leaked on fn error

If fn(txn) returns an error, the IndexedBatch created on line 166 is never committed or closed, leaking its memory. Pebble batches must always be explicitly closed after use. Add a defer to ensure cleanup:

func (pdb *DB) Update(fn func(basedb.Txn) error) error {
	batch := pdb.NewIndexedBatch()
	defer func() { _ = batch.Close() }()
	txn := newTxn(pdb.logger, batch)
	if err := fn(txn); err != nil {
		return err
	}
	return batch.Commit(pebble.Sync)
}

storage/pebble/pebble.go
Batch leaked on SetMany error

If txn.SetMany returns an error, the Batch created on line 175 is neither committed nor closed. Same resource leak as Update. Fix:

func (pdb *DB) SetMany(prefix []byte, n int, next func(int) (basedb.Obj, error)) error {
	batch := pdb.NewBatch()
	defer func() { _ = batch.Close() }()
	txn := newTxn(pdb.logger, batch)
	if err := txn.SetMany(prefix, n, next); err != nil {
		return err
	}
	return batch.Commit(pebble.Sync)
}

nkryuchkov · 2026-03-03T13:26:55Z

@greptileai review this again, considering all comments on this PR

nkryuchkov · 2026-03-03T17:43:27Z

@greptileai review this again, considering all comments on this PR

nkryuchkov · 2026-03-03T18:12:28Z

@greptileai review this again, considering all comments on this PR

nkryuchkov · 2026-03-03T19:08:10Z

@greptileai review this again, considering all comments on this PR

nkryuchkov · 2026-03-03T19:39:56Z

@greptileai review this again, considering all comments on this PR

y0sher · 2026-03-27T12:53:29Z

Staging Test Report (operators 93-96, BadgerDB → PebbleDB migration)

Branch: pebbledb-all
Baseline: operators 29-32 running stage (BadgerDB)

Migration: PASS

All 4 nodes migrated from BadgerDB to PebbleDB on startup:

Node	Keys Migrated	Status
ssv-node-93	93,618	Success
ssv-node-94	93,618	Success
ssv-node-95	93,624	Success
ssv-node-96	100,359	Success

All 9 schema migrations preserved. Logs confirm "using pebble db" and "migrated legacy badger db to pebble".

Memory: 55% REDUCTION

Node Type	Memory (RSS)
PebbleDB avg	~550 MB
BadgerDB avg	~1,280 MB

Node	Memory
PebbleDB node-93	530 MB
PebbleDB node-94	537 MB
PebbleDB node-95	562 MB
PebbleDB node-96	570 MB
BadgerDB node-29	1,186 MB
BadgerDB node-30	1,360 MB
BadgerDB node-31	1,269 MB
BadgerDB node-32	1,299 MB

⚠️ PebbleDB nodes were freshly restarted (~20 min uptime). Will re-check after 1h+ to confirm memory holds.

Goroutines: ~100 fewer per node

PebbleDB (93-96)	BadgerDB (29-32)
18,530 – 18,581	18,654 – 18,657

Likely from eliminated BadgerDB background GC goroutines.

Functional Correctness: PASS

Duties executing: AGGREGATOR, SYNC_COMMITTEE_CONTRIBUTION, VALIDATOR_REGISTRATION
Consensus completing in round 1-2
Event syncer tracking blocks at 100% progress
Zero storage errors
E2M epoch 84379 effectiveness: 99.20% (network-wide, no regression)

Errors: PASS (all pre-existing)

"could not publish p2p message: validation ignored" — P2P, not storage
"failed to find event by ID" — appeared on both PebbleDB AND BadgerDB nodes at same time (network-wide event)

P2P Peers: Converging

PebbleDB nodes at 3-9 peers (just restarted), baseline at 10-11. Expected to converge.

Note on path

Nodes log a WARN: "using legacy pebble database path" — configured path is db-holesky-stage-new but actual pebble path is db-holesky-stage-new-pebble. Not an error, just informational.

Next Steps

Re-check memory after 1h+ uptime to confirm 55% reduction holds
Re-check P2P peers have converged to 10+
Consider longer soak (24h+) before mainnet

Tested via SSV Scout

olegshmuelov

LGTM!

One thing for release notes @y0sher: the migration requires roughly 2x the current database size in free disk space (both Badger and Pebble coexist during and after migration). Operators with large DBs should be warned before upgrading. The code handles ENOSPC gracefully (clear error, resumable), but better to warn proactively than have operators discover it at runtime.

iurii-ssv

Looks quite comprehensive, left some minor suggestions.

Although I'm struggling a bit to get the full picture, it maybe feels like the intermediary PebbleDBPlan concept complicates things more than helps (or maybe it just could use some restructuring) ... I get the idea to separate the "parsing of the current state/plan" (represented by PebbleDBPlan) from the "data-migration execution" ... it's just a bit hard to map it onto a simple list of scenarios I would expect us to have to implement such a migration, which would be something like:

pebble exists, badger does not -> just use pebble
pebble exists, badger exists -> resolve which one is the source of truth and finish the migration if needed
pebble doesn't exist, badger exists -> run the migration
...

maybe the code in migrateBadgerToPebbleIfNeeded func could be restructured a bit & supplied with step-by-step comments to clarify where/when each of those cases (from my list above) takes place - currently it kinda mixed a lot of things together making it hard to see what's going on

* pebble-db: define temporary DB * remane

nkryuchkov · 2026-04-10T17:10:32Z

@greptileai please review the latest changes

# Conflicts: # cli/operator/node_test.go # identity/store_test.go

# Conflicts: # network/p2p/testutils.go

# Conflicts: # ssvsigner/e2e/testenv/environment.go # ssvsigner/e2e/testenv/key_managers.go # ssvsigner/ekm/local_key_manager_test.go # ssvsigner/ekm/signer_storage_test.go # ssvsigner/ekm/slashing_protector_test.go # ssvsigner/go.mod # ssvsigner/go.sum

iurii-ssv

Looks much cleaner! Left some minor suggestions.

Also, we should probably create a task to "remove the Badger directory once the done marker has been present for, say, 24 hours" ?

iurii-ssv · 2026-04-21T16:31:11Z

+func ResolveDBLayout(basePath string) (DBLayout, error) {
+	layout, err := probePebbleLayout(basePath)
+	if err != nil {
+		return DBLayout{}, err
+	}
+	if err := validateEmptyPebbleDoneMarkers(basePath, layout); err != nil {
+		return DBLayout{}, err
+	}
+	badgerFilesPresent, err := hasBadgerFiles(basePath)
+	if err != nil {
+		return DBLayout{}, fmt.Errorf("check badger path %q: %w", basePath, err)
+	}


Badger opened on every post-migration startup. Even after the done marker is written, badgerDirState still stats the legacy directory and (depending on branch) opens it read-only. For already-migrated nodes this is pure IO waste — the done marker alone should short-circuit the layout resolution. Consider returning early in ResolveDBLayout the moment the done marker is observed.

@iurii-ssv I'm not sure I understood the part about badgerDirState, can you please clarify it?

iurii-ssv · 2026-04-21T16:32:31Z

+	}
+
+	if layout.BadgerImportPath != "" {
+		migrated, migratedKeys, err := storagemigration.MigrateBadgerToPebbleIfNeeded(


During import, both Badger and Pebble directories coexist. This is mentioned in the PR description but not in user-facing startup logs; operators on tight disks will discover this at the worst moment. Suggest a preflight statfs check at copyBadgerToPebble entry with a clear error if free space < badger size × 1.2.

Probably a good idea to add a check like this.

@iurii-ssv I had this check, but I decided to remove it because it's not reliable and the code is already complicated. Anyway, if migration fails due to a lack of space, it should interrupt the migration

Why do you think it's not reliable ?

I think it makes sense to check the available space not only to make sure the migration succeeds, but also to ensure the SSV node is able to keep functioning after the migration too (e.g if we have 20 GB of space on disk with Badger occupying 10GB, and newly created Pebble occupying 9.5GB after migration is done ... we have 0.5GB left to work with - which might result in a failure shortly after, even before the Operator can spot & resolve this manually), right ?

I don't think we need to do the "space check" on every migration as a habit (although we maybe could add this as a generic sanity-check on startup? So we fail to start the node cleanly, instead of letting it run out of disk-space), but for this one specifically seems like it might prevent some pain for Operators.

I don't think it's reliable because we don't know how much space the Pebble DB will take. If we check for 1.2x of Badger, but Pebble uses 1.5x, then the check will be useless. If we check for 1.2x but Pebble uses 0.5x, the check could prevent migration even with enough space. Even if the size is roughly the same, but a disk has 15% extra space, not 20%, it might be fine, because we don't need Pebble DB after successful migration.

We could probably estimate the relative size (Badger vs Pebble) by running this migration once on some operator from our stage env (so we'll know what *.*x value is reasonable to use).

I mean, I agree that it's a heuristic/judgement that might result in false-positives ... I just think it's worthwhile even so.

iurii-ssv · 2026-04-21T16:34:30Z

+	tmpPath := path + ".tmp"
+	// #nosec G304 -- marker path is derived from configured DB path plus fixed marker filenames.
+	tempFile, err := os.OpenFile(tmpPath, os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0o600)
+	if err != nil {
+		return fmt.Errorf("open marker temp file: %w", err)
+	}


Nit: writeMarker does not os.Remove the temp file if fsync or rename fails partway, leaving a stray .tmp next to the marker. defer os.Remove(tmp) after the initial create, cleared on success, would be robust.

What paths do you mean where it's not removed?

iurii-ssv · 2026-04-22T12:47:00Z

+	if !inProgressMarkerExists {
+		if err := writeBadgerImportInProgressMarker(pebblePath, badgerPath); err != nil {
+			return false, 0, wrapNoSpaceImportError(err, badgerPath, pebblePath)
+		}
+	}


I'm not sure I understand what the purpose of InProgressMarker is:

if it is it to prevent concurrent migrations from running (multiple SSV nodes accessing the same DB) ... it doesn't seem like it works for that

if it is to "continue from where we left off", it doesn't track where we stopped (and it's probably an overkill for this anyway)

so it kinda seems like we could just use the DoneMarker alone and get the same results ?

I'm not sure DoneMarker alone would be enough here. Historically we could already have both DBs populated without any migration marker at all, because Pebble was used under cfg.DBOptions.Path + "-pebble" while Badger stayed at cfg.DBOptions.Path. InProgressMarker is what lets us distinguish that kind of overlap from an actual interrupted migration that is safe to resume.

momosh-ssv

One more that didn't fit inline (DropPrefix sits outside the diff hunks):

storage/pebble/pebble.go DropPrefix — could this lead to keys persisting after the call returns? Seems that the Badger version ran inside a transaction, while the Pebble version here uses a non-indexed batch plus a live-DB iterator, so writes that race with the loop can persist after DropPrefix returns. Worth considering db.DeleteRange(prefix, upperBound, pebble.Sync) — single atomic operation, and faster too.

momosh-ssv · 2026-04-29T12:40:21Z


-	if err := applyMigrations(logger, beaconConfig, operatorPrivKey, db, dbPath); err != nil {
-		return nil, fmt.Errorf("apply migrations: %w", err)
+	if err := applyMigrations(logger, beaconConfig, operatorPrivKey, db, layout.PebblePath); err != nil {


What do you think about deferring the done-marker write until after applyMigrations succeeds? I fear that if applyMigrations crashes mid-run on a fresh import, the next start sees the done marker present plus a half-transformed Pebble. The resolver then trusts Pebble as the source of truth, and recovery depends on each schema migration being individually idempotent — which is a per-author guarantee rather than a structural one. Either move the done-marker write after applyMigrations, or add a separate schema-applied sentinel and refuse to start when the done marker exists but the schema sentinel doesn't.

momosh-ssv · 2026-04-29T12:40:21Z

+	return marker.BadgerDirPath == "" || samePath(marker.BadgerDirPath, badgerPath), nil
+}
+
+func samePath(left string, right string) bool {


Could this lead to spurious re-imports on operator setups that bind-mount or symlink the data dir? Seems that samePath only does Clean+Abs but not symlink resolution, so e.g. /var/lib/ssv -> /mnt/data/ssv will produce a different absolute path on subsequent runs, markerExistsForSource returns false, and the migration tries to re-import on top of an already-imported Pebble. Maybe worth using filepath.EvalSymlinks with a fallback to Clean+Abs when EvalSymlinks fails (e.g. path doesn't exist yet).

momosh-ssv · 2026-04-29T12:40:21Z

-	dbPath := cfg.DBOptions.Path + "-pebble" // opinionated approach to avoid corrupting old db location
-
-	db, err := pebble.New(logger, dbPath, &cockroachdb.Options{})
+	db, err := pebble.New(logger, layout.PebblePath, &cockroachdb.Options{})


Any reason we're not wiring zap into Pebble's Options here? I fear this is an observability regression vs Badger — Pebble's WriteStallBegin/End, DiskSlow, compaction stalls and manifest recovery events would otherwise go nowhere, and migrating the operator-facing data layer makes those signals more important, not less. Could we add a pebble.Logger adapter over zap on Options.Logger, an EventListener for stall/slow events, and expose db.Metrics() via the existing metrics handler?

switch ssv-node storage to Pebble DB

48e58c2

greptile-apps Bot reviewed Mar 2, 2026

View reviewed changes

nkryuchkov added 4 commits March 2, 2026 20:21

fix remaining linter issues

4bd6464

attempt to fix migration error

08e0b6d

minor refactoring

fb2bead

delete redundant Badger DB usages

b89c70e