From dc44135015335d2c67a55c06888dd0eeb42ddc1a Mon Sep 17 00:00:00 2001 From: nksazonov Date: Fri, 5 Jun 2026 16:09:31 +0200 Subject: [PATCH 01/23] docs(nitronode): reorg fix spec --- nitronode/reorg-fix-spec.md | 285 ++++++++++++++++++++++++++++++++++++ 1 file changed, 285 insertions(+) create mode 100644 nitronode/reorg-fix-spec.md diff --git a/nitronode/reorg-fix-spec.md b/nitronode/reorg-fix-spec.md new file mode 100644 index 000000000..5dcf6e864 --- /dev/null +++ b/nitronode/reorg-fix-spec.md @@ -0,0 +1,285 @@ +# Reorg Attack Fix — Confirmation Window Specification + +## 1. Risk + +The Nitronode event listener credits a user's off-chain balance the moment it observes a deposit event on-chain. If the block containing that deposit is subsequently removed from the canonical chain (a "reorganisation"), the off-chain credit persists while the on-chain deposit no longer exists. Because the credited balance can be transferred to a receiver before the node has any way to detect the reorg, the node ends up honoring an off-chain state transition that is permanently unbacked. + +The worst-case outcome is a net loss of node liquidity equal to the sum of all deposit amounts that were credited during a reorg window and successfully drained to attacker-controlled receivers before the reorg was detected. There is no recovery path for the node once a signed receiver state exists. + +This risk is meaningful on any chain where head-level reorgs occur naturally or can be induced. On modern fast-finality chains (BNB, Polygon post-Rio, Avalanche) the residual probability is very low. On Ethereum L1, depth-1 reorgs are routine and cryptoeconomic finality takes ~12.8 minutes. + +--- + +## 2. Solution Overview + +A **per-chain confirmation window** is introduced between raw event delivery and handler invocation. When the listener observes any event on chain C: + +- It does **not** invoke the handler immediately. +- It waits for `confirmation_delay_sec` seconds (configured per chain in `blockchains.yaml`). +- If no reorg of the event's block occurs during that window, the handler is invoked normally. +- If the event's block is reorged out (`removed: true` log arrives), the pending invocation is cancelled with no side effects. +- If the reorged transaction is re-included (the same event appears again), the confirmation window restarts from zero. + +The delay applies uniformly to **all** events, not only deposit-class ones. Selective gating would require the component to understand event semantics and introduce ordering hazards when events for different channels arrive interleaved — for example, a deposit event and a challenge event on separate channels could fire their handlers out of original arrival order if only the deposit is delayed. Uniform delay preserves the relative order of all events as they arrived from the chain while adding a single, predictable latency layer. + +### 2.1 Residual risk and the finality trade-off + +The confirmation window eliminates the reorg risk only when `confirmation_delay_sec` is set to or above the chain's cryptoeconomic finality time. For the representative values in §3: + +- **Ethereum at 780s (~13 min):** matches Casper FFG hard finality. Reorging past this point requires ≥1/3 of total stake to be slashed. No residual risk. +- **Polygon at 10s, BNB at 5s:** exceeds the empirical reorg tail depth. Residual risk is negligible but not cryptoeconomically eliminated. +- **Ethereum at 36s (3 blocks, "quick" finality):** P(reorg depth ≥ 4) ≈ 10⁻⁵–10⁻⁶ per event. Residual risk is real. + +When `confirmation_delay_sec` is set *below* the chain's finality time, **this specification acknowledges a residual risk**: it is possible — with low but non-zero probability — that an event passes the gate, the reactor commits it to the database, and the block containing that event is subsequently reorged out by a reorg deeper than the gate window. + +When this occurs, the committed state (balance credit, channel open) has no corresponding on-chain event in the canonical chain. If the transaction is re-mined in the new canonical block, the reactor's idempotency guard (§6.6) handles the re-delivery cleanly. If it is not re-mined, the DB retains stale state that can only be partially corrected on the next node restart via the reconciliation walk (§4.4). There is no automated rollback; the exposure scales with the deposit value and is bounded by the probability of deep reorgs on the target chain. + +Operators who cannot accept this residual exposure should set `confirmation_delay_sec` to the chain's hard-finality time (Ethereum: 780s; Polygon: `finalized` tag resolves to ~5s; L2s: `finalized` maps to L1 Casper FFG at ~13 min). The gate's detection mechanisms (§6.5, §6.6) provide observability when the residual-risk scenario occurs. + +--- + +## 3. Configuration + +A new `confirmation_delay_sec` field is added per chain in `blockchains.yaml`. Representative values: + +```yaml +chains: + - id: 1 # Ethereum mainnet + confirmation_delay_sec: 780 # ~13 min — Casper FFG hard finality + - id: 137 # Polygon PoS (post-Heimdall v2 / Rio) + confirmation_delay_sec: 10 # 5 blocks × ~2s; empirical reorg tail is sub-10s + - id: 56 # BNB Smart Chain + confirmation_delay_sec: 5 # fast-finality, ~3-4 blocks + - id: 42161 # Arbitrum One + confirmation_delay_sec: 120 # L2 `safe` tag (L1-posted batch), ~1-2 min + - id: 8453 # Base + confirmation_delay_sec: 120 # same L2 `safe` semantics +``` + +`confirmation_delay_sec: 0` disables the gate — events are processed immediately. Appropriate for BFT single-slot chains where the node operator accepts the negligible residual risk, or for chains using a finality-tag subscription rather than a block-count gate. + +--- + +## 4. Confirmation Window Behavior + +### 4.1 Normal path + +When a log `E` arrives (without `Removed: true`): + +1. Record the event under a key of `(txHash, blockHash, logIndex)`. +2. Start an in-memory timer for the chain's `confirmation_delay_sec`. +3. When the timer fires, invoke the event handler. + +### 4.2 Reorg path + +If a log with `Removed: true` arrives for the same `(txHash, blockHash, logIndex)` before the timer fires: + +- Cancel the pending timer. +- Do not invoke the handler — no state change occurs. +- The listener remains active. When the same transaction is re-included, its event will be delivered again (without `Removed: true`) and the gate starts a fresh window under the new block's key. + +### 4.3 Out-of-order delivery + +The re-added event (no `Removed: true`, new block) may arrive at the listener before the corresponding `Removed: true` log for the old block. Because the re-added event is in a different block, it carries a different `blockHash` and therefore a different key. The two events are handled independently: + +- The re-added event starts a fresh timer under its own key. +- The `Removed: true` log, when it arrives, looks up the OLD block's key — which has no pending timer (it was never created, or it already expired) — and performs a no-op. + +This means out-of-order delivery requires no special case beyond the normal path: the block-scoped key prevents the remove from accidentally cancelling the re-added event's timer. + +- On a `Removed: true` log for a key that **has no pending timer**: no-op. The event either confirmed and was already processed (reorg arrived after the window), or belongs to a different block whose timer was never started (re-add arrived first, has its own key). + +> Repeated reorgs of the same transaction are theoretically possible but imply a chain-level consensus failure. The gate's cancel/restart cycle handles each naturally; no special cap is needed. + +### 4.4 Startup and reconciliation + +#### Prerequisites + +Before the reconciliation logic described below can function, `block_hash` must be added as a column to `contract_events` and to the `core.BlockchainEvent` struct. The value is available in `types.Log.BlockHash` at the time the gate calls the reactor. Without this column, reorg detection in steps 2–4 is not possible. + +#### Definition: latest processed block + +The **latest processed block** for a chain is the highest block number at which the reactor successfully committed at least one event to the database — identical to the listener's existing startup cursor (`MAX(block_number)` in `contract_events` for this `blockchain_id` and contract address, computed by `GetLatestContractEventBlockNumber`). This is distinct from the highest block the listener ever *saw*: the listener may have seen many blocks that contained no relevant events and therefore left no `contract_events` rows. + +#### Reconciliation steps + +On startup, for each chain, after the `block_hash` migration has been applied: + +1. Query `contract_events` for the latest committed event: `latestBlockNum = MAX(block_number)`, `latestBlockHash = block_hash` at that row. If no rows exist, start the scan from the chain's configured genesis / start block and skip to step 5. +2. Call `eth_getBlockByHash(latestBlockHash)` on the chain's RPC. + - If the response is non-null: `latestBlockHash` is still in the canonical chain — no reorg above this block. Proceed to step 4. + - If the response is null: the block has been reorged out. Proceed to step 3. +3. **Common-ancestor walk using stored block hashes:** query `contract_events` for the next-older distinct `block_hash` (the highest `block_number` strictly below the current candidate). Repeat step 2 with this hash. Continue until a block hash is found that is still in the canonical chain, or until no older stored hash exists (treat genesis as the fallback). This height is the **common ancestor**. + + > **Why walk stored hashes, not block numbers?** In normal operation most blocks contain no `ChannelHub` events, so `contract_events` has no row for them. A block-number walk would find nothing to compare at event-gap heights and could miss a reorg that occurred entirely within such a gap. Walking by stored block hashes ensures every comparison is against a block the reactor actually processed. + +4. Set the scan start to `commonAncestorBlockNum`. Events between `commonAncestorBlockNum` and `latestBlockNum` that came from the reorged fork are still present in the DB. The reactor has no rollback mechanism for those rows — the re-scan below will re-apply canonical events over them where the transaction was re-mined (idempotent), and leave the orphaned DB state in place where the transaction was not re-mined (residual risk; see §2.1). State-setting operations (`UpdateChannel`, `RefreshUserEnforcedBalance`) will overwrite with canonical values for re-mined events; rows from dropped transactions remain as stale data with no automated cleanup. +5. Start the event scan from `commonAncestorBlockNum` (or genesis if step 1 found no rows). Feed all replayed events **directly to the reactor, bypassing the gate entirely**. Historical events come from `eth_getLogs` and are, by definition, already in the current canonical chain. The common-ancestor walk in steps 2–3 additionally confirms that the starting block is canonical. There is no incremental reorg risk to guard against for these events, and applying a full confirmation delay would only stall the node on restart for no safety benefit. The gate applies exclusively to live WebSocket events; any reorgs of very-recent blocks during replay are handled by the buffered live-subscription signals processed immediately after replay completes (step 7). +6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)`. If a duplicate is inserted, Postgres returns a constraint-violation error, causing the entire transaction (including all state mutations in the same `useStoreInTx` call) to roll back. The reactor therefore cannot double-apply state changes for an event it has already committed. +7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay. The gate operates in timer-only mode during reconciliation. Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and processed only after the historical replay phase completes. + +--- + +## 5. Scope + +The delay applies to **all** events emitted by the `ChannelHub` contract on a given chain. No filtering by event type is performed inside the gate. + +> **Note:** `ChannelCreated` (`handleHomeChannelCreated`) calls `RefreshUserEnforcedBalance`. Verify whether the initial channel state carries a non-zero deposit; if it does, the uniform delay already protects it — no special casing is needed. + +--- + +## 6. Implementation Notes + +### 6.1 Component placement and wiring + +The `ConfirmationGate` is a thin in-memory component that sits between the raw log stream (`listener.go`) and the `ChannelHubReactor`. + +**Existing wiring** (`nitronode/main.go:127-129`): + +```go +reactor := evm.NewChannelHubReactor(b.ID, ...) +l := evm.NewListener(..., reactor.HandleEvent, ...) +``` + +The listener accepts a handler of type `HandleEvent func(ctx context.Context, eventLog types.Log) error`. The gate exposes the same signature and is inserted between the two: + +```go +reactor := evm.NewChannelHubReactor(b.ID, ...) +gate := evm.NewConfirmationGate(confirmationDelay, reactor.HandleEvent) +l := evm.NewListener(..., gate.HandleEvent, ...) +``` + +The reactor itself does not change. All the listener's existing logic — subscription management, cursor tracking, reconnection, historical replay — is unaffected. + +**Handling `Removed: true` logs:** currently `listener.go:289-294` skips removed logs before they reach the handler. This skip must be moved: the listener should forward removed logs to `gate.HandleEvent` (they still carry the `Removed` flag on `types.Log`), and the gate alone decides whether to cancel a pending timer or ignore the signal. The reactor never sees a `Removed: true` log. + +### 6.2 Event identity for removal scanning + +The Listener delivers events in strict block order, so the queue is naturally ordered by arrival time. When a `Removed: true` log arrives in the Pusher, it scans the queue for the **first** entry matching `(txHash, logIndex)` and deletes it. + +`blockHash` is deliberately excluded from the removal scan key. Because the queue is FIFO and reorgs produce the re-add event *after* the original event, the original always sits earlier in the queue than any re-add. Scanning for `(txHash, logIndex)` and deleting the first match therefore always targets the original entry and leaves any re-add untouched. + +A single transaction can emit multiple events for the same `txHash` (e.g., two `ChannelDeposited` logs in a batch open). `logIndex` disambiguates these; it is unique per log within a block and is present in both the live event and its corresponding `Removed: true` log. + +`blockHash` is still present in each `types.Log` stored in the queue and is used by: +- The `recentlyForwarded` detection map (§6.5) — keyed by `(txHash, blockHash, logIndex)` to identify which specific occurrence was forwarded. +- `StoreContractEvent` in the reactor — stored in `contract_events` for the reconciliation walk (§4.4). + +### 6.3 Two-goroutine design + +**Data structure:** a FIFO queue of `(types.Log, arrivedAt time.Time)`. Naturally ordered by arrival time because the Listener delivers events in strict block order. + +```go +type queueEntry struct { + log types.Log + arrivedAt time.Time +} + +type eventKey struct { // used for removal scan + txHash common.Hash + logIndex uint +} + +type forwardedKey struct { // used for post-gate reorg detection + txHash common.Hash + blockHash common.Hash + logIndex uint +} + +type ConfirmationGate struct { + delay time.Duration + chainID uint64 + handler HandleEvent + queue []queueEntry // protected by mu + recentlyForwarded map[forwardedKey]time.Time // protected by mu; TTL = 2× delay + mu sync.Mutex +} +``` + +--- + +**Goroutine 1 — Pusher** (driven by the existing Listener; implements the `HandleEvent` signature) + +Receives `types.Log` from the Listener. On each event: + +- If `Removed: true` — scan the queue for the first entry matching `(txHash, logIndex)` and delete it. If no match found, check `recentlyForwarded` for a post-gate reorg signal (see §6.5). +- Otherwise — append `(log, time.Now())` to the queue tail. + +No expiration check, no forwarding. Push only. + +--- + +**Goroutine 2 — Poller** + +Wakes every ~50 ms on a ticker. Each wake: + +- Inspect the queue front. +- While `front.arrivedAt + delay ≤ now`: pop the entry, record `forwardedKey{txHash, blockHash, logIndex}` in `recentlyForwarded` with the current timestamp, then forward the log to the Reactor outside the lock. +- Stop as soon as the front is not yet ready — everything behind it is newer. +- Sleep until next tick. + +No event handling, no Listener awareness. Drain-and-forward only. + +--- + +**Properties** + +| Property | Detail | +| --- | --- | +| Zero RPC calls in the gate | Delay is a pure `time.Duration`; no chain queries | +| Chain-agnostic | `confirmationDelay` is the only chain-specific input | +| Forward latency after window | At most one tick (~50 ms) | +| Reorg within window | Pusher's scan removes the entry; Reactor never sees the event | +| Reorg deeper than window | Rare; Reactor-level idempotency (§6.6) handles re-delivered events | +| Concurrency | Both goroutines share `mu`; Reactor is called outside the lock | +| Shutdown | Poller exits on `ctx.Done()`; entries still in queue are discarded (safe — they were never forwarded) | + +### 6.4 Exposing `confirmation_delay_secs` via API + +Clients need to know the confirmation delay for each chain so they can display the correct waiting time to users after submitting a deposit. The best existing candidate is **`node.v1.GetConfig`**, which already returns a per-chain `BlockchainInfoV1` object. + +Files to update: + +- `pkg/rpc/types.go` — add `ConfirmationDelaySecs uint64` to `BlockchainInfoV1`. +- `nitronode/api/node_v1/utils.go` — populate the new field in `mapBlockchainV1` from the chain's loaded config. +- `pkg/core/types.go` (or wherever `core.Blockchain` is defined) — add `ConfirmationDelaySec uint64` so the value flows from `blockchains.yaml` through config loading into the API handler. + +No new endpoint is needed. The field appears alongside existing per-chain fields (contract addresses, asset list, block time) and is read-only from the client's perspective. + +### 6.5 Post-gate reorg detection in the gate + +The `recentlyForwarded` map (already in the `ConfirmationGate` struct, §6.3) provides detection without any DB access. The **Poller** writes to it each time it forwards an event; the **Pusher** reads from it when a `Removed: true` log arrives and the queue scan finds no matching entry. + +When `Removed: true` arrives in the Pusher: + +- **Match found in queue** → normal removal; no log. +- **No match in queue, but `forwardedKey{txHash, blockHash, logIndex}` found in `recentlyForwarded`** → the event was already forwarded to the Reactor and its block has now been reorged out. Log at **`WARN`** with `txHash`, `blockHash`, `logIndex`, `chainID`. Remove the entry. +- **Match in neither** → log at `DEBUG` ("removal for unknown/stale event" — predates the current run or arrived long after the TTL). + +`recentlyForwarded` entries are evicted lazily: when the Pusher reads an entry, it checks `time.Since(forwardedAt) > 2 × delay` and discards stale entries on access. The map stays small because post-gate reorgs are rare and `Removed: true` arrives within one or two block-times of the reorg. No separate cleanup goroutine is needed. + +### 6.6 Reactor defense-in-depth: skip re-delivered events + +When the gate lets a re-added event through (same tx re-mined in a new block after a reorg, confirmed by a fresh timer), the reactor would attempt to process an event it has already committed. Currently this surfaces as a DB constraint-violation error and a full transaction rollback — noisy and potentially confusing. + +Add a new method to `ChannelHubReactorStore`: + +```go +// IsContractEventProcessed reports whether an event identified by +// (txHash, logIndex, blockchainID) has already been committed, +// regardless of which block it appeared in. +IsContractEventProcessed(txHash string, logIndex uint, blockchainID uint64) (bool, error) +``` + +At the top of `HandleEvent`, before entering `useStoreInTx`, call this method. If the event is already committed, log at **`INFO`** ("skipping re-delivered event, already committed") and return `nil` immediately. No transaction is opened; no state is touched. + +The existing unique constraint on `(transaction_hash, log_index, blockchain_id)` in `contract_events` remains as the definitive safety net. This pre-check converts the constraint-violation rollback path into a clean, explicit, logged early exit that also serves as the idempotency guard for the reconciliation re-scan path. + +Together, §6.5 and §6.6 produce two complementary log signals: + +| Signal | Source | Level | Meaning | +| --- | --- | --- | --- | +| "post-gate reorg detected for event X" | Gate | WARN | Committed block was reorged; residual-risk scenario is active | +| "skipping re-delivered event X" | Reactor | INFO | Same tx re-mined; reactor correctly skips it | + +If the operator sees the WARN but never the INFO, the transaction was not re-mined — the stale DB state from §2.1 is in effect. From 7b1097633c377ed58ad8fbf01061df9fa26001f2 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Mon, 8 Jun 2026 12:33:02 +0200 Subject: [PATCH 02/23] feat(nitronode): add ConfirmationGate reorg protection --- nitronode/main.go | 19 +- nitronode/reorg-fix-spec.md | 42 ++ nitronode/store/memory/blockchain_config.go | 4 + nitronode/store/memory/memory_store.go | 9 +- pkg/blockchain/evm/confirmation_gate.go | 278 ++++++++++ pkg/blockchain/evm/confirmation_gate_test.go | 529 +++++++++++++++++++ pkg/blockchain/evm/listener.go | 53 +- pkg/blockchain/evm/listener_test.go | 63 +++ pkg/core/types.go | 1 + 9 files changed, 962 insertions(+), 36 deletions(-) create mode 100644 pkg/blockchain/evm/confirmation_gate.go create mode 100644 pkg/blockchain/evm/confirmation_gate_test.go diff --git a/nitronode/main.go b/nitronode/main.go index 1735cbe0f..62d101c4f 100644 --- a/nitronode/main.go +++ b/nitronode/main.go @@ -25,6 +25,8 @@ import ( "github.com/layer-3/nitrolite/pkg/log" ) +const blockTimestampFetchTimeout = 10 * time.Second + func main() { if len(os.Args) > 1 && os.Args[1] == "stress-test" { os.Exit(stress.Run(os.Args[2:])) @@ -121,7 +123,22 @@ func main() { reactor := evm.NewChannelHubReactor(b.ID, bb.StateSigner.PublicKey().Address().String(), eventHandlerService, bb.MemoryStore, useCHRStoreInTx) reactor.SetOnEventProcessed(bb.RuntimeMetrics.IncBlockchainEvent) - l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, logger, reactor.HandleEvent, bb.DbStore) + + blockTimestampFetcher := func(blockHash common.Hash) (time.Time, error) { + fetchCtx, cancel := context.WithTimeout(context.Background(), blockTimestampFetchTimeout) + defer cancel() + header, err := client.HeaderByHash(fetchCtx, blockHash) + if err != nil { + return time.Time{}, err + } + return time.Unix(int64(header.Time), 0), nil + } + + confirmationDelay := time.Duration(b.ConfirmationDelaySecs) * time.Second + gate := evm.NewConfirmationGate(confirmationDelay, b.ID, reactor.HandleEvent, blockTimestampFetcher, logger) + gate.Start(blockchainCtx) + + l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, logger, gate.HandleEvent, bb.DbStore) l.Listen(blockchainCtx, func(err error) { if err != nil { logger.Fatal("blockchain listener stopped", "error", err, "blockchainID", b.ID) diff --git a/nitronode/reorg-fix-spec.md b/nitronode/reorg-fix-spec.md index 5dcf6e864..c5e13eb96 100644 --- a/nitronode/reorg-fix-spec.md +++ b/nitronode/reorg-fix-spec.md @@ -283,3 +283,45 @@ Together, §6.5 and §6.6 produce two complementary log signals: | "skipping re-delivered event X" | Reactor | INFO | Same tx re-mined; reactor correctly skips it | If the operator sees the WARN but never the INFO, the transaction was not re-mined — the stale DB state from §2.1 is in effect. + +### 6.7 Block timestamp cache + +#### Purpose + +The gate uses the **block timestamp** of each event as its `arrivedAt` reference rather than wall-clock time. This ensures that events replayed from historical blocks (whose timestamps are minutes or hours in the past) are forwarded immediately on the first Poller tick, without waiting for the full confirmation delay to elapse again. + +Fetching the block timestamp requires one `eth_getBlockByHash` RPC call per block. A single block can produce multiple events (e.g. two `ChannelDeposited` logs in a batch open). The **block timestamp cache** avoids the redundant RPC calls: the first event from a block fetches and stores the timestamp; subsequent events from the same block read it from the cache. + +#### Data structure + +```go +blockTimestampCache map[common.Hash]time.Time // protected by mu; evicted by Poller +``` + +The cache is keyed by `blockHash`. Values are written once (on the first event from a block) and are never modified. + +#### Eviction + +The cache grows monotonically without eviction: every block that produces at least one relevant event adds a permanent entry. Over the lifetime of a long-running node, this is an unbounded memory leak. + +Entries are evicted by the Poller in the same sweep pass that cleans `recentlyForwarded`. An entry is safe to remove once: + +> `now − blockTimestamp > recentMultiplier × delay` + +At that age, every event from the block has either been forwarded (within `delay` of its `arrivedAt`) or cancelled by a `Removed: true` signal. No new event from the same block can arrive after it (the listener delivers events in ascending block order). The cached timestamp therefore serves no further purpose. + +**Eviction is performed in `poll()`, under the mutex, after the `recentlyForwarded` sweep:** + +```go +for bh, ts := range g.blockTimestampCache { + if now.Sub(ts) > recentMultiplier*g.delay { + delete(g.blockTimestampCache, bh) + } +} +``` + +#### Bound after eviction + +With eviction, the cache holds at most one entry per block whose timestamp falls within the window `[now − recentMultiplier×delay, now]`. That is at most `recentMultiplier × delay × (blocks per second)` entries — a small constant for every supported chain. + +Each entry is 56 bytes (`common.Hash` 32 B + `time.Time` 24 B). Even the worst case would be under 100 KB. diff --git a/nitronode/store/memory/blockchain_config.go b/nitronode/store/memory/blockchain_config.go index 4b0a85035..7f6dd8c68 100644 --- a/nitronode/store/memory/blockchain_config.go +++ b/nitronode/store/memory/blockchain_config.go @@ -43,6 +43,10 @@ type BlockchainConfig struct { ChannelHubAddress string `yaml:"channel_hub_address"` // ChannelHubSigValidators maps validator IDs to the addresses of signature validators for the ChannelHub contract on this blockchain ChannelHubSigValidators map[uint8]string `yaml:"channel_hub_sig_validators"` + // ConfirmationDelaySecs is the number of seconds to wait before processing an event. + // Set to 0 to process events immediately (disables the confirmation gate). + // Maximum meaningful value is ~780s (Ethereum Casper FFG hard finality). + ConfirmationDelaySecs uint32 `yaml:"confirmation_delay_secs"` } // LoadEnabledBlockchains loads and validates blockchain configurations from a YAML file. diff --git a/nitronode/store/memory/memory_store.go b/nitronode/store/memory/memory_store.go index 02d813b42..1ee224beb 100644 --- a/nitronode/store/memory/memory_store.go +++ b/nitronode/store/memory/memory_store.go @@ -33,10 +33,11 @@ func NewMemoryStoreV1(assetsConfig AssetsConfig, blockchainsConfig map[uint64]Bl } blockchains = append(blockchains, core.Blockchain{ - ID: bc.ID, - Name: bc.Name, - ChannelHubAddress: bc.ChannelHubAddress, - BlockStep: bc.BlockStep, + ID: bc.ID, + Name: bc.Name, + ChannelHubAddress: bc.ChannelHubAddress, + BlockStep: bc.BlockStep, + ConfirmationDelaySecs: bc.ConfirmationDelaySecs, }) } slices.SortFunc(blockchains, func(a, b core.Blockchain) int { diff --git a/pkg/blockchain/evm/confirmation_gate.go b/pkg/blockchain/evm/confirmation_gate.go new file mode 100644 index 000000000..6690d8ed1 --- /dev/null +++ b/pkg/blockchain/evm/confirmation_gate.go @@ -0,0 +1,278 @@ +package evm + +import ( + "context" + "sync" + "time" + + "github.com/ethereum/go-ethereum/common" + "github.com/ethereum/go-ethereum/core/types" + "github.com/layer-3/nitrolite/pkg/log" +) + +const pollInterval = 50 * time.Millisecond +const recentMultiplier = 3 // recentlyForwarded entries are kept for (recentMultiplier × delay) to catch post-gate reorgs + +// queueEntry holds a pending event waiting for the confirmation delay to expire. +type queueEntry struct { + log types.Log + arrivedAt time.Time // block timestamp from fetcher; fallback time.Now() on error +} + +// eventKey identifies an event by tx and log index; blockHash is intentionally excluded +// so that a reorg-replacement event (same tx, same index, different block) can match +// and cancel the original pending entry. +type eventKey struct { + txHash common.Hash + logIndex uint +} + +// forwardedKey identifies an event that has already been forwarded to the downstream +// handler; blockHash is included so a Removed notification from a different block fork +// does NOT falsely trigger post-gate reorg logic. +type forwardedKey struct { + txHash common.Hash + blockHash common.Hash + logIndex uint +} + +// ConfirmationGate buffers incoming events for a configurable delay before forwarding +// them to a downstream handler, providing a window to cancel events that are reorged +// out before the delay expires. +type ConfirmationGate struct { + delay time.Duration + chainID uint64 + handler HandleEvent + blockTimestampFetcher func(blockHash common.Hash) (time.Time, error) + + mu sync.Mutex + queue []queueEntry + recentlyForwarded map[forwardedKey]time.Time // TTL = recentMultiplier × delay; protected by mu + // blockTimestampCache holds the timestamp for every block that has delivered at + // least one event to the gate. It avoids a redundant RPC call when the same block + // produces multiple events (e.g. a batch open with two ChannelDeposited logs). + // Entries are evicted by the Poller once the block timestamp is older than + // recentMultiplier × delay — by that point every event from the block has either + // been forwarded or cancelled, so the entry will never be read again. + blockTimestampCache map[common.Hash]time.Time // protected by mu + logger log.Logger +} + +// NewConfirmationGate creates a ConfirmationGate that holds events for delay before +// forwarding them to handler. fetcher is called once per unique blockHash to obtain the +// block's timestamp, which is used as the event's arrivedAt reference. If fetcher fails, +// time.Now() is used as a fallback. +func NewConfirmationGate( + delay time.Duration, + chainID uint64, + handler HandleEvent, + fetcher func(blockHash common.Hash) (time.Time, error), + logger log.Logger, +) *ConfirmationGate { + return &ConfirmationGate{ + delay: delay, + chainID: chainID, + handler: handler, + blockTimestampFetcher: fetcher, + recentlyForwarded: make(map[forwardedKey]time.Time), + blockTimestampCache: make(map[common.Hash]time.Time), + logger: logger.WithName("confirmation-gate"), + } +} + +// Start begins the polling goroutine that forwards matured entries to the downstream +// handler. If delay is zero the gate is fully transparent and no goroutine is started. +func (g *ConfirmationGate) Start(ctx context.Context) { + if g.delay == 0 { + return + } + go g.poll(ctx) +} + +// HandleEvent is the entry point called by the upstream Listener for each event. +// +// When delay == 0 the gate is fully transparent: every event (including Removed ones) +// is forwarded to the downstream handler immediately. +// +// When delay > 0: +// - A non-removed event is queued and will be forwarded after the confirmation delay. +// - A removed event cancels its pending queue entry (pre-gate reorg), or — if the +// entry was already forwarded — records a post-gate reorg warning. +func (g *ConfirmationGate) HandleEvent(ctx context.Context, eventLog types.Log) error { + if g.delay == 0 { + // Removed:true events are never forwarded to the reactor regardless of delay + // setting — the reactor was never designed to handle them and has no guard on + // Topics[0]. This preserves the pre-gate listener behavior of dropping reorged + // logs before they reach any downstream handler. + if eventLog.Removed { + return nil + } + return g.handler(ctx, eventLog) + } + + key := eventKey{txHash: eventLog.TxHash, logIndex: uint(eventLog.Index)} + + if !eventLog.Removed { + // Fetch block timestamp, using cache to avoid redundant RPC calls. + var ts time.Time + + g.mu.Lock() + cached, hit := g.blockTimestampCache[eventLog.BlockHash] + if hit { + ts = cached + } + g.mu.Unlock() + + if !hit { + fetched, err := g.blockTimestampFetcher(eventLog.BlockHash) + if err != nil { + g.logger.Warn("failed to fetch block timestamp, falling back to now", + "error", err, + "blockHash", eventLog.BlockHash.Hex(), + "chainID", g.chainID, + ) + // Use gate entry arrival time as a fallback to avoid blocking events indefinitely when the fetcher fails. + ts = time.Now() + } else { + ts = fetched + + // Update cache for future events from the same block. + g.mu.Lock() + g.blockTimestampCache[eventLog.BlockHash] = ts + g.mu.Unlock() + } + } + + g.mu.Lock() + // Remove any existing queue entry for the same (txHash, logIndex) so that + // a re-delivered event (after reorg, with different blockHash) replaces + // the original and resets the confirmation timer. + g.removeFromQueueByKey(key) + g.queue = append(g.queue, queueEntry{log: eventLog, arrivedAt: ts}) + g.mu.Unlock() + + return nil + } + + // eventLog.Removed == true: attempt pre-gate cancellation. + g.mu.Lock() + defer g.mu.Unlock() + + // Build the full key once; it is reused for both the queue scan and the + // recentlyForwarded lookup. blockHash is included so that a Removed notification for + // an old block does not accidentally cancel a re-mined entry with the same tx/logIndex + // in a new block. + fk := forwardedKey{txHash: eventLog.TxHash, blockHash: eventLog.BlockHash, logIndex: uint(eventLog.Index)} + if g.removeFromQueueByFullKey(fk) { + return nil + } + + // Not in queue — check whether it was already forwarded (post-gate reorg). + if _, ok := g.recentlyForwarded[fk]; ok { + g.logger.Warn("post-gate reorg detected", + "txHash", eventLog.TxHash.Hex(), + "blockHash", eventLog.BlockHash.Hex(), + "logIndex", eventLog.Index, + "chainID", g.chainID, + ) + delete(g.recentlyForwarded, fk) + return nil + } + + g.logger.Debug("removal for unknown/stale event", + "txHash", eventLog.TxHash.Hex(), + "blockHash", eventLog.BlockHash.Hex(), + "logIndex", eventLog.Index, + "chainID", g.chainID, + ) + return nil +} + +// removeFromQueueByKey removes the first queue entry matching key (ignores blockHash). +// Used when a non-removed re-delivery replaces an earlier entry for the same logical event. +// Caller must hold mu. +func (g *ConfirmationGate) removeFromQueueByKey(key eventKey) { + for i, e := range g.queue { + ek := eventKey{txHash: e.log.TxHash, logIndex: uint(e.log.Index)} + if ek == key { + g.queue = append(g.queue[:i], g.queue[i+1:]...) + return + } + } +} + +// removeFromQueueByFullKey removes the first queue entry matching txHash, blockHash, and +// logIndex. Used in the Removed handler so that a removal notification for an old block +// does not accidentally cancel a re-mined entry with the same tx/logIndex in a new block. +// Caller must hold mu. +func (g *ConfirmationGate) removeFromQueueByFullKey(fk forwardedKey) bool { + for i, e := range g.queue { + if e.log.TxHash == fk.txHash && e.log.BlockHash == fk.blockHash && uint(e.log.Index) == fk.logIndex { + g.queue = append(g.queue[:i], g.queue[i+1:]...) + return true + } + } + return false +} + +// poll is the background goroutine that wakes on each pollInterval tick, forwards +// all matured queue entries to the downstream handler, and evicts stale recentlyForwarded +// entries whose TTL (recentMultiplier × delay) has elapsed. +func (g *ConfirmationGate) poll(ctx context.Context) { + ticker := time.NewTicker(pollInterval) + defer ticker.Stop() + + for { + select { + case <-ctx.Done(): + return + case <-ticker.C: + g.mu.Lock() + now := time.Now() + + // Forward all entries whose confirmation delay has elapsed. + for len(g.queue) > 0 && !g.queue[0].arrivedAt.Add(g.delay).After(now) { + entry := g.queue[0] + g.queue = g.queue[1:] + + fk := forwardedKey{ + txHash: entry.log.TxHash, + blockHash: entry.log.BlockHash, + logIndex: uint(entry.log.Index), + } + g.recentlyForwarded[fk] = now + + g.mu.Unlock() + + evCtx := log.SetContextLogger(context.Background(), g.logger) + if err := g.handler(evCtx, entry.log); err != nil { + g.logger.Error("handler error after confirmation delay", + "error", err, + "chainID", g.chainID, + ) + } + + g.mu.Lock() + } + + // Evict recentlyForwarded entries older than (recentMultiplier × delay). + for k, forwardedAt := range g.recentlyForwarded { + if now.Sub(forwardedAt) > recentMultiplier*g.delay { + delete(g.recentlyForwarded, k) + } + } + + // Evict blockTimestampCache entries whose block timestamp is older than + // (recentMultiplier × delay). The listener delivers events in block order, + // so once a block is old enough, all of its events have been forwarded or + // cancelled and the cached timestamp will never be read again. + for bh, ts := range g.blockTimestampCache { + if now.Sub(ts) > recentMultiplier*g.delay { + delete(g.blockTimestampCache, bh) + } + } + + g.mu.Unlock() + } + } +} diff --git a/pkg/blockchain/evm/confirmation_gate_test.go b/pkg/blockchain/evm/confirmation_gate_test.go new file mode 100644 index 000000000..f296397c0 --- /dev/null +++ b/pkg/blockchain/evm/confirmation_gate_test.go @@ -0,0 +1,529 @@ +package evm + +import ( + "context" + "errors" + "sync" + "sync/atomic" + "testing" + "time" + + "github.com/ethereum/go-ethereum/common" + "github.com/ethereum/go-ethereum/core/types" + "github.com/layer-3/nitrolite/pkg/log" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/require" +) + +// helpers + +func noopFetcher(_ common.Hash) (time.Time, error) { + return time.Now(), nil +} + +func makeLog(txHash common.Hash, blockHash common.Hash, logIndex uint, removed bool) types.Log { + return types.Log{ + TxHash: txHash, + BlockHash: blockHash, + Index: uint(logIndex), + Removed: removed, + } +} + +func newGate(t *testing.T, delay time.Duration, handler HandleEvent, fetcher func(common.Hash) (time.Time, error)) *ConfirmationGate { + t.Helper() + if fetcher == nil { + fetcher = noopFetcher + } + return NewConfirmationGate(delay, 1, handler, fetcher, log.NewNoopLogger()) +} + +// T1: delay==0 forwards non-removed events directly; Removed:true events are silently +// dropped to protect reactors that have no guard on l.Topics[0]. +func TestConfirmationGate_Delay0_DirectForward(t *testing.T) { + t.Parallel() + + var calls []types.Log + var mu sync.Mutex + + wantErr := errors.New("handler error") + handler := func(_ context.Context, l types.Log) error { + mu.Lock() + calls = append(calls, l) + mu.Unlock() + return wantErr + } + + g := newGate(t, 0, handler, nil) + g.Start(t.Context()) // should be a no-op for delay==0 + + tx := common.HexToHash("0x01") + bh := common.HexToHash("0xAA") + + // normal event — forwarded, handler error propagated + normalLog := makeLog(tx, bh, 0, false) + err := g.HandleEvent(context.Background(), normalLog) + require.Equal(t, wantErr, err) + + // removed event — silently dropped; handler NOT called, nil returned + removedLog := makeLog(tx, bh, 0, true) + err = g.HandleEvent(context.Background(), removedLog) + require.NoError(t, err) + + mu.Lock() + assert.Len(t, calls, 1, "handler must be called only for non-removed event") + mu.Unlock() +} + +// T2: normal event is queued and delivered after the delay. +func TestConfirmationGate_NormalPath(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + var deliveredLog types.Log + var mu sync.Mutex + + handler := func(_ context.Context, l types.Log) error { + mu.Lock() + deliveredLog = l + mu.Unlock() + callCount.Add(1) + return nil + } + + g := newGate(t, 5*time.Millisecond, handler, nil) + g.Start(t.Context()) + + tx := common.HexToHash("0x02") + bh := common.HexToHash("0xBB") + ev := makeLog(tx, bh, 0, false) + + require.NoError(t, g.HandleEvent(context.Background(), ev)) + + // should NOT be called within 1 ms + time.Sleep(1 * time.Millisecond) + assert.Equal(t, int32(0), callCount.Load(), "handler must not be called before delay expires") + + // should be called within 10 ms total + deadline := time.After(500 * time.Millisecond) + for callCount.Load() == 0 { + select { + case <-deadline: + t.Fatal("handler not called within timeout") + default: + time.Sleep(1 * time.Millisecond) + } + } + + assert.Equal(t, int32(1), callCount.Load()) + mu.Lock() + assert.Equal(t, ev.TxHash, deliveredLog.TxHash) + assert.Equal(t, ev.Index, deliveredLog.Index) + mu.Unlock() +} + +// T3: a Removed event for a queued entry cancels it before forwarding. +func TestConfirmationGate_ReorgCancel(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + g := newGate(t, 10*time.Millisecond, handler, nil) + g.Start(t.Context()) + + tx := common.HexToHash("0x03") + bh := common.HexToHash("0xCC") + + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, true))) + + time.Sleep(20 * time.Millisecond) + assert.Equal(t, int32(0), callCount.Load(), "handler must never be called after reorg cancel") +} + +// T4: a re-delivered event (same tx/logIndex, different blockHash) replaces the original; +// the Removed for the old blockHash is a no-op; the new event is forwarded once. +func TestConfirmationGate_OutOfOrder(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + g := newGate(t, 10*time.Millisecond, handler, nil) + g.Start(t.Context()) + + tx := common.HexToHash("0x04") + bhOld := common.HexToHash("0xAA") + bhNew := common.HexToHash("0xBB") + + // Event A: original block + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bhOld, 0, false))) + // Event B: re-mined in new block (same txHash/logIndex, different blockHash) + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bhNew, 0, false))) + // Removed for old block: matches A's full key (bh=0xAA) and removes it from queue. + // B (bh=0xBB) is left untouched. + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bhOld, 0, true))) + + // Wait long enough for the poll goroutine to fire (pollInterval=50ms) and the delay to + // have elapsed (10ms). 200ms gives generous headroom. + deadline := time.After(500 * time.Millisecond) + for callCount.Load() == 0 { + select { + case <-deadline: + t.Fatal("handler not called within timeout — event B was not forwarded") + default: + time.Sleep(5 * time.Millisecond) + } + } + + // Only B should have been forwarded (A was cancelled). + assert.Equal(t, int32(1), callCount.Load()) +} + +// T5: post-gate reorg — Removed arrives after the event was already forwarded. +// Verify handler is called, Removed is handled gracefully (no panic/error). +func TestConfirmationGate_PostGateReorg(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + g := newGate(t, 2*time.Millisecond, handler, nil) + g.Start(t.Context()) + + tx := common.HexToHash("0x05") + bh := common.HexToHash("0xDD") + ev := makeLog(tx, bh, 0, false) + + require.NoError(t, g.HandleEvent(context.Background(), ev)) + + // Wait until forwarded. + deadline := time.After(500 * time.Millisecond) + for callCount.Load() == 0 { + select { + case <-deadline: + t.Fatal("handler not called within timeout") + default: + time.Sleep(1 * time.Millisecond) + } + } + assert.Equal(t, int32(1), callCount.Load()) + + // Post-gate Removed — should not panic or return error. + // WARN log "post-gate reorg detected" is emitted internally (manually observable). + err := g.HandleEvent(context.Background(), makeLog(tx, bh, 0, true)) + assert.NoError(t, err) + + // Handler should still have been called exactly once. + assert.Equal(t, int32(1), callCount.Load()) +} + +// T6: Removed for a completely unknown event — no error, no handler call. +func TestConfirmationGate_UnknownRemoval(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + g := newGate(t, 10*time.Millisecond, handler, nil) + g.Start(t.Context()) + + tx := common.HexToHash("0x06") + bh := common.HexToHash("0xEE") + + err := g.HandleEvent(context.Background(), makeLog(tx, bh, 0, true)) + assert.NoError(t, err) + + time.Sleep(20 * time.Millisecond) + assert.Equal(t, int32(0), callCount.Load()) +} + +// T7: fetcher returns an old timestamp → event is immediately mature and forwarded fast. +func TestConfirmationGate_BlockTimestampBypass(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + pastFetcher := func(_ common.Hash) (time.Time, error) { + return time.Now().Add(-30 * time.Second), nil + } + + g := newGate(t, 10*time.Millisecond, handler, pastFetcher) + g.Start(t.Context()) + + tx := common.HexToHash("0x07") + bh := common.HexToHash("0xFF") + + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) + + deadline := time.After(500 * time.Millisecond) + for callCount.Load() == 0 { + select { + case <-deadline: + t.Fatal("handler not called within timeout") + default: + time.Sleep(1 * time.Millisecond) + } + } + assert.Equal(t, int32(1), callCount.Load()) +} + +// T8: fetcher returns a timestamp 60ms in the past; delay=100ms; so ~40ms remain. +func TestConfirmationGate_BlockTimestampPartialDelay(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + fetcher := func(_ common.Hash) (time.Time, error) { + return time.Now().Add(-60 * time.Millisecond), nil + } + + g := newGate(t, 100*time.Millisecond, handler, fetcher) + g.Start(t.Context()) + + tx := common.HexToHash("0x08") + bh := common.HexToHash("0x08") + + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) + + // Not called after 20 ms (need ~40ms more). + time.Sleep(20 * time.Millisecond) + assert.Equal(t, int32(0), callCount.Load(), "handler must not be called before remaining delay expires") + + // Called within 200 ms total. + deadline := time.After(500 * time.Millisecond) + for callCount.Load() == 0 { + select { + case <-deadline: + t.Fatal("handler not called within timeout") + default: + time.Sleep(5 * time.Millisecond) + } + } + assert.Equal(t, int32(1), callCount.Load()) +} + +// T9: fetcher returns error → fallback to time.Now() → full delay must still elapse. +func TestConfirmationGate_BlockTimestampFetchError(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + errFetcher := func(_ common.Hash) (time.Time, error) { + return time.Time{}, errors.New("rpc error") + } + + g := newGate(t, 5*time.Millisecond, handler, errFetcher) + g.Start(t.Context()) + + tx := common.HexToHash("0x09") + bh := common.HexToHash("0x09") + + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) + + // Not called immediately (fell back to current time, full delay required). + time.Sleep(1 * time.Millisecond) + assert.Equal(t, int32(0), callCount.Load(), "handler must not be called before delay expires") + + // Called after 10 ms. + deadline := time.After(500 * time.Millisecond) + for callCount.Load() == 0 { + select { + case <-deadline: + t.Fatal("handler not called within timeout") + default: + time.Sleep(1 * time.Millisecond) + } + } + assert.Equal(t, int32(1), callCount.Load()) +} + +// T10: two events sharing the same blockHash should result in exactly one fetcher call. +func TestConfirmationGate_BlockTimestampCache(t *testing.T) { + t.Parallel() + + var fetchCount atomic.Int32 + var callCount atomic.Int32 + + fetcher := func(_ common.Hash) (time.Time, error) { + fetchCount.Add(1) + return time.Now(), nil + } + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + g := newGate(t, 5*time.Millisecond, handler, fetcher) + g.Start(t.Context()) + + tx1 := common.HexToHash("0x10") + tx2 := common.HexToHash("0x11") + bh := common.HexToHash("0xSHARED") + + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx1, bh, 0, false))) + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx2, bh, 1, false))) + + // Wait for both events to be forwarded. + deadline := time.After(500 * time.Millisecond) + for callCount.Load() < 2 { + select { + case <-deadline: + t.Fatal("not all events delivered within timeout") + default: + time.Sleep(1 * time.Millisecond) + } + } + + assert.Equal(t, int32(1), fetchCount.Load(), "fetcher must be called only once for a shared blockHash") +} + +// T11: cancelling the context prevents queued events from being forwarded. +func TestConfirmationGate_Shutdown(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + g := newGate(t, 50*time.Millisecond, handler, nil) + ctx, cancel := context.WithCancel(t.Context()) + g.Start(ctx) + + for i := range 3 { + tx := common.HexToHash(string(rune(0x20 + i))) + bh := common.HexToHash(string(rune(0x30 + i))) + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, uint(i), false))) + } + + // Cancel before delay expires. + cancel() + + time.Sleep(100 * time.Millisecond) + assert.Equal(t, int32(0), callCount.Load(), "no events must be forwarded after context cancellation") +} + +// T12: recentlyForwarded entries are evicted after recentMultiplier × delay. +func TestConfirmationGate_RecentlyForwardedEviction(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + delay := 2 * time.Millisecond + g := newGate(t, delay, handler, nil) + g.Start(t.Context()) + + tx := common.HexToHash("0x12") + bh := common.HexToHash("0x12") + + // Enqueue and wait for forward. + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) + deadline := time.After(500 * time.Millisecond) + for callCount.Load() == 0 { + select { + case <-deadline: + t.Fatal("handler not called within timeout") + default: + time.Sleep(1 * time.Millisecond) + } + } + + // Immediately send a post-gate Removed — should match recentlyForwarded (WARN path). + err := g.HandleEvent(context.Background(), makeLog(tx, bh, 0, true)) + assert.NoError(t, err) + + // Wait well past recentMultiplier × delay so the entry is evicted. + time.Sleep(time.Duration(recentMultiplier) * delay) + + // A second Removed for the same event — should fall through to DEBUG path (not found). + // Verifies the eviction happened. No panic, no error. + err = g.HandleEvent(context.Background(), makeLog(tx, bh, 0, true)) + assert.NoError(t, err) + + // Handler still called exactly once. + assert.Equal(t, int32(1), callCount.Load()) +} + +// T13: multiple events are all delivered, preserving queue order. +func TestConfirmationGate_MultipleEvents_Ordering(t *testing.T) { + t.Parallel() + + var mu sync.Mutex + var delivered []common.Hash + + handler := func(_ context.Context, l types.Log) error { + mu.Lock() + delivered = append(delivered, l.TxHash) + mu.Unlock() + return nil + } + + g := newGate(t, 5*time.Millisecond, handler, nil) + g.Start(t.Context()) + + txHashes := []common.Hash{ + common.HexToHash("0xA1"), + common.HexToHash("0xA2"), + common.HexToHash("0xA3"), + } + bh := common.HexToHash("0xBLOCK") + + for i, tx := range txHashes { + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, uint(i), false))) + } + + // Wait for all 3 events to be delivered. + deadline := time.After(500 * time.Millisecond) + for { + mu.Lock() + n := len(delivered) + mu.Unlock() + if n >= 3 { + break + } + select { + case <-deadline: + t.Fatalf("only %d/3 events delivered within timeout", n) + default: + time.Sleep(1 * time.Millisecond) + } + } + + mu.Lock() + defer mu.Unlock() + require.Len(t, delivered, 3) + assert.Equal(t, txHashes[0], delivered[0]) + assert.Equal(t, txHashes[1], delivered[1]) + assert.Equal(t, txHashes[2], delivered[2]) +} diff --git a/pkg/blockchain/evm/listener.go b/pkg/blockchain/evm/listener.go index 1e2a013d7..2c07aee26 100644 --- a/pkg/blockchain/evm/listener.go +++ b/pkg/blockchain/evm/listener.go @@ -215,11 +215,14 @@ func (l *Listener) listenEvents(ctx context.Context) error { // failed event; the next Listen invocation re-fetches from the same // cursor. Transient handler failures retry instead of silently dropping. // -// 4. Reorged-out logs are discarded. Live deliveries with Removed=true are -// dropped. A reorg that fully removes a ChannelChallenged log also -// removes the matching on-chain status transition to DISPUTED, so the -// contract's Path-1 (challenge-timeout) close cannot subsequently fire -// for the same channel. +// 4. Reorged-out logs are forwarded to the handler (ConfirmationGate). +// Live deliveries with Removed=true are passed to the handler so the +// gate can cancel any pending confirmation timer for that event. The +// reactor never sees Removed=true logs directly; the gate filters them +// before forwarding confirmed events. The lastBlock cursor and +// IsContractEventPresent dedup check are skipped for Removed=true events +// so neither the resume cursor nor the idempotency guard is corrupted +// by a reorg signal. // // A consequence used by the nitronode event handlers: for any channel that // closes via Path-1 (challenge-timeout, ChannelHub Closed-from-DISPUTED), @@ -287,26 +290,22 @@ func (l *Listener) processEvents( eventSubscription.Unsubscribe() return nil case eventLog := <-currentCh: - // During a chain reorganization geth re-delivers orphaned logs with - // Removed: true. Skip them to avoid applying phantom state changes. - if eventLog.Removed { - l.logger.Warn("skipping removed log from reorg", "blockchainID", l.blockchainID, "blockNumber", eventLog.BlockNumber, "logIndex", eventLog.Index, "txHash", eventLog.TxHash.Hex()) - continue - } - *lastBlock = eventLog.BlockNumber - if !currentCheckDone { - present, err := l.eventGetter.IsContractEventPresent(l.blockchainID, eventLog.BlockNumber, eventLog.TxHash.Hex(), uint32(eventLog.Index)) - if err != nil { - eventSubscription.Unsubscribe() - return fmt.Errorf("failed to check current event presence: %w", err) - } - if present { - l.logger.Debug("skipping already present current event", "blockchainID", l.blockchainID, "blockNumber", eventLog.BlockNumber, "logIndex", eventLog.Index) - continue + if !eventLog.Removed { + *lastBlock = eventLog.BlockNumber + if !currentCheckDone { + present, err := l.eventGetter.IsContractEventPresent(l.blockchainID, eventLog.BlockNumber, eventLog.TxHash.Hex(), uint32(eventLog.Index)) + if err != nil { + eventSubscription.Unsubscribe() + return fmt.Errorf("failed to check current event presence: %w", err) + } + if present { + l.logger.Debug("skipping already present current event", "blockchainID", l.blockchainID, "blockNumber", eventLog.BlockNumber, "logIndex", eventLog.Index) + continue + } + currentCheckDone = true } - currentCheckDone = true + l.logger.Debug("received current event", "blockchainID", l.blockchainID, "contractAddress", l.contractAddress.String(), "blockNumber", eventLog.BlockNumber, "logIndex", eventLog.Index) } - l.logger.Debug("received current event", "blockchainID", l.blockchainID, "contractAddress", l.contractAddress.String(), "blockNumber", eventLog.BlockNumber, "logIndex", eventLog.Index) evCtx := log.SetContextLogger(context.Background(), l.logger) if err := l.handleEvent(evCtx, eventLog); err != nil { eventSubscription.Unsubscribe() @@ -397,11 +396,3 @@ func (l *Listener) reconcileBlockRange( } } -// TODO: the current reorg handling (skipping Removed logs) prevents new damage but -// does not undo side effects from the original delivery if it was already processed. -// A more robust approach is a confirmation buffer: hold live logs in memory keyed by -// block number, only apply them after N confirmations (new blocks on top), and discard -// any log that arrives with Removed: true while still in the buffer. This adds N blocks -// of latency (~12s × N on mainnet) but guarantees that only finalized events reach the -// handler. On L2s where reorgs are near-zero, the latency trade-off may not be worth it, -// so this should be configurable per chain. diff --git a/pkg/blockchain/evm/listener_test.go b/pkg/blockchain/evm/listener_test.go index 8b339dc36..e65f133fd 100644 --- a/pkg/blockchain/evm/listener_test.go +++ b/pkg/blockchain/evm/listener_test.go @@ -303,6 +303,69 @@ func TestProcessEvents_SubscriptionErrorDuringPhase1(t *testing.T) { assert.Equal(t, []uint64{100}, handledBlocks) } +func TestListener_RemovedLog_ForwardedToHandler(t *testing.T) { + t.Parallel() + logger := log.NewNoopLogger() + addr := common.HexToAddress("0x123") + eventGetter := new(MockContractEventGetter) + + // Track which logs reached handleEvent. + var handledLogs []types.Log + handleEvent := func(ctx context.Context, eventLog types.Log) error { + handledLogs = append(handledLogs, eventLog) + return nil + } + + listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, handleEvent, eventGetter) + + // No historical events. + historicalCh := make(chan types.Log) + close(historicalCh) + + currentCh := make(chan types.Log, 2) + + // Event 1: non-Removed at block 10 — triggers IsContractEventPresent check, + // advances lastBlock, sets currentCheckDone = true. + normalLog := types.Log{BlockNumber: 10, Index: 0, TxHash: common.HexToHash("0xabc")} + eventGetter.On("IsContractEventPresent", uint64(1), uint64(10), mock.Anything, uint32(0)).Return(false, nil).Once() + + // Event 2: Removed=true at block 11 — must NOT advance lastBlock, must NOT call + // IsContractEventPresent, but MUST reach handleEvent. + removedLog := types.Log{BlockNumber: 11, Index: 0, TxHash: common.HexToHash("0xdef"), Removed: true} + + currentCh <- normalLog + currentCh <- removedLog + + sub := &MockSubscription{errChan: make(chan error, 1), unsub: func() {}} + + ctx, cancel := context.WithCancel(context.Background()) + go func() { + // Give processEvents enough time to drain both buffered events, then cancel. + time.Sleep(100 * time.Millisecond) + cancel() + }() + + var lastBlock uint64 + err := listener.processEvents(ctx, sub, historicalCh, currentCh, &lastBlock) + require.NoError(t, err) + + // Both events must have reached handleEvent. + require.Len(t, handledLogs, 2, "handleEvent must be called for both the normal and the Removed event") + + // Verify first call was the normal log and second was the removed log. + assert.Equal(t, uint64(10), handledLogs[0].BlockNumber) + assert.False(t, handledLogs[0].Removed) + assert.Equal(t, uint64(11), handledLogs[1].BlockNumber) + assert.True(t, handledLogs[1].Removed) + + // lastBlock must NOT have advanced past the normal event's block. + assert.Equal(t, uint64(10), lastBlock, "lastBlock must not be advanced by a Removed=true event") + + // IsContractEventPresent must have been called exactly once (for the normal log only). + eventGetter.AssertNumberOfCalls(t, "IsContractEventPresent", 1) + eventGetter.AssertExpectations(t) +} + func TestReconcileBlockRange_ContextCancellation(t *testing.T) { t.Parallel() mockClient := new(MockEVMClient) diff --git a/pkg/core/types.go b/pkg/core/types.go index 57f949bae..0d4d68978 100644 --- a/pkg/core/types.go +++ b/pkg/core/types.go @@ -1094,6 +1094,7 @@ type Blockchain struct { ID uint64 `json:"id"` // Blockchain network ID ChannelHubAddress string `json:"channel_hub_address"` // Address of the ChannelHub contract on this blockchain BlockStep uint64 `json:"block_step"` // Number of blocks between each channel update + ConfirmationDelaySecs uint32 `json:"confirmation_delay_secs"` // Seconds to wait before processing an event (0 = immediate) } // Asset represents information about a supported asset From be1d73f62a04574af12a154314144f96acab36d7 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Mon, 8 Jun 2026 13:07:15 +0200 Subject: [PATCH 03/23] feat(nitronode): expose confirmation_delay_secs via GetConfig API --- nitronode/api/node_v1/utils.go | 7 ++++--- pkg/rpc/types.go | 3 +++ sdk/go/utils.go | 1 + sdk/go/utils_test.go | 8 +++++--- sdk/ts/src/core/types.ts | 1 + sdk/ts/src/rpc/types.ts | 2 ++ sdk/ts/src/utils.ts | 1 + .../test/unit/__snapshots__/public-api-drift.test.ts.snap | 2 ++ 8 files changed, 19 insertions(+), 6 deletions(-) diff --git a/nitronode/api/node_v1/utils.go b/nitronode/api/node_v1/utils.go index 96209281c..c49919843 100644 --- a/nitronode/api/node_v1/utils.go +++ b/nitronode/api/node_v1/utils.go @@ -10,9 +10,10 @@ import ( func mapBlockchainV1(blockchain core.Blockchain) rpc.BlockchainInfoV1 { return rpc.BlockchainInfoV1{ - Name: blockchain.Name, - BlockchainID: strconv.FormatUint(blockchain.ID, 10), - ChannelHubAddress: blockchain.ChannelHubAddress, + Name: blockchain.Name, + BlockchainID: strconv.FormatUint(blockchain.ID, 10), + ChannelHubAddress: blockchain.ChannelHubAddress, + ConfirmationDelaySecs: blockchain.ConfirmationDelaySecs, } } diff --git a/pkg/rpc/types.go b/pkg/rpc/types.go index 9ad75a804..a1bb5ce7a 100644 --- a/pkg/rpc/types.go +++ b/pkg/rpc/types.go @@ -264,6 +264,9 @@ type BlockchainInfoV1 struct { BlockchainID string `json:"blockchain_id"` // ChannelHubAddress is the contract address on this network ChannelHubAddress string `json:"channel_hub_address"` + // ConfirmationDelaySecs is the number of seconds the node waits before crediting a deposit event. + // Zero means the gate is disabled and events are processed immediately. + ConfirmationDelaySecs uint32 `json:"confirmation_delay_secs"` } // ============================================================================ diff --git a/sdk/go/utils.go b/sdk/go/utils.go index d3dbebaf3..8a5daba08 100644 --- a/sdk/go/utils.go +++ b/sdk/go/utils.go @@ -30,6 +30,7 @@ func transformNodeConfig(resp rpc.NodeV1GetConfigResponse) (*core.NodeConfig, er ID: blockchainID, ChannelHubAddress: info.ChannelHubAddress, BlockStep: 0, // Not provided in RPC response + ConfirmationDelaySecs: info.ConfirmationDelaySecs, }) } diff --git a/sdk/go/utils_test.go b/sdk/go/utils_test.go index 021a47d6c..1d650e95f 100644 --- a/sdk/go/utils_test.go +++ b/sdk/go/utils_test.go @@ -19,9 +19,10 @@ func TestTransformNodeConfig(t *testing.T) { SupportedSigValidators: []core.ChannelSignerType{core.ChannelSignerType_SessionKey}, Blockchains: []rpc.BlockchainInfoV1{ { - Name: "Polygon", - BlockchainID: "137", - ChannelHubAddress: "0xHubAddress", + Name: "Polygon", + BlockchainID: "137", + ChannelHubAddress: "0xHubAddress", + ConfirmationDelaySecs: 10, }, }, } @@ -35,6 +36,7 @@ func TestTransformNodeConfig(t *testing.T) { assert.Len(t, config.Blockchains, 1) assert.Equal(t, uint64(137), config.Blockchains[0].ID) assert.Equal(t, "Polygon", config.Blockchains[0].Name) + assert.Equal(t, uint32(10), config.Blockchains[0].ConfirmationDelaySecs) // Test error case rpcResp.Blockchains[0].BlockchainID = "invalid" diff --git a/sdk/ts/src/core/types.ts b/sdk/ts/src/core/types.ts index 69a8b62a7..c7d18ad8d 100644 --- a/sdk/ts/src/core/types.ts +++ b/sdk/ts/src/core/types.ts @@ -166,6 +166,7 @@ export interface Blockchain { id: bigint; // uint64 channelHubAddress: Address; blockStep: bigint; // uint64 + confirmationDelaySecs: number; // seconds; 0 means gate is disabled } export interface Token { diff --git a/sdk/ts/src/rpc/types.ts b/sdk/ts/src/rpc/types.ts index 70890b5c7..8db60945d 100644 --- a/sdk/ts/src/rpc/types.ts +++ b/sdk/ts/src/rpc/types.ts @@ -195,6 +195,8 @@ export interface BlockchainInfoV1 { blockchain_id: string; // uint64 as string /** Channel hub contract address on this network */ channel_hub_address: Address; + /** Seconds the node waits before crediting a deposit event; 0 means gate is disabled */ + confirmation_delay_secs?: number; } // ============================================================================ diff --git a/sdk/ts/src/utils.ts b/sdk/ts/src/utils.ts index a836891f8..83df72f90 100644 --- a/sdk/ts/src/utils.ts +++ b/sdk/ts/src/utils.ts @@ -42,6 +42,7 @@ export function transformNodeConfig(resp: API.NodeV1GetConfigResponse): core.Nod id: BigInt(info.blockchain_id), channelHubAddress: info.channel_hub_address as Address, blockStep: 0n, // Not provided in RPC response + confirmationDelaySecs: info.confirmation_delay_secs ?? 0, })); return { diff --git a/sdk/ts/test/unit/__snapshots__/public-api-drift.test.ts.snap b/sdk/ts/test/unit/__snapshots__/public-api-drift.test.ts.snap index 7f61abd57..3c41f70bd 100644 --- a/sdk/ts/test/unit/__snapshots__/public-api-drift.test.ts.snap +++ b/sdk/ts/test/unit/__snapshots__/public-api-drift.test.ts.snap @@ -506,6 +506,7 @@ exports[`SDK public runtime API drift guard keeps root TypeScript public API sig "properties": [ "blockStep: bigint", "channelHubAddress: Address", + "confirmationDelaySecs: number", "id: bigint", "name: string", ], @@ -548,6 +549,7 @@ exports[`SDK public runtime API drift guard keeps root TypeScript public API sig "properties": [ "blockchain_id: string", "channel_hub_address: Address", + "confirmation_delay_secs: number", "name: string", ], "signatures": [], From 1d1b8308422fd5bb5fd8af69f9a68947bc2b6dae Mon Sep 17 00:00:00 2001 From: nksazonov Date: Mon, 8 Jun 2026 14:37:35 +0200 Subject: [PATCH 04/23] =?UTF-8?q?feat(nitronode):=20implement=20IsContract?= =?UTF-8?q?EventProcessed=20pre-check=20in=20reactor=20(=C2=A76.6)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- nitronode/main.go | 2 +- nitronode/reorg-fix-spec.md | 42 +++++++++- nitronode/store/database/contract_event.go | 16 ++++ nitronode/store/database/interface.go | 6 ++ pkg/blockchain/evm/channel_hub_reactor.go | 30 ++++++- .../evm/channel_hub_reactor_test.go | 79 ++++++++++++++++++- 6 files changed, 167 insertions(+), 8 deletions(-) diff --git a/nitronode/main.go b/nitronode/main.go index 62d101c4f..8ffd51e9a 100644 --- a/nitronode/main.go +++ b/nitronode/main.go @@ -121,7 +121,7 @@ func main() { return wrapInTx(func(s database.DatabaseStore) error { return h(s) }) } - reactor := evm.NewChannelHubReactor(b.ID, bb.StateSigner.PublicKey().Address().String(), eventHandlerService, bb.MemoryStore, useCHRStoreInTx) + reactor := evm.NewChannelHubReactor(b.ID, bb.StateSigner.PublicKey().Address().String(), eventHandlerService, bb.MemoryStore, useCHRStoreInTx, bb.DbStore) reactor.SetOnEventProcessed(bb.RuntimeMetrics.IncBlockchainEvent) blockTimestampFetcher := func(blockHash common.Hash) (time.Time, error) { diff --git a/nitronode/reorg-fix-spec.md b/nitronode/reorg-fix-spec.md index c5e13eb96..0848b2e15 100644 --- a/nitronode/reorg-fix-spec.md +++ b/nitronode/reorg-fix-spec.md @@ -260,7 +260,9 @@ When `Removed: true` arrives in the Pusher: ### 6.6 Reactor defense-in-depth: skip re-delivered events -When the gate lets a re-added event through (same tx re-mined in a new block after a reorg, confirmed by a fresh timer), the reactor would attempt to process an event it has already committed. Currently this surfaces as a DB constraint-violation error and a full transaction rollback — noisy and potentially confusing. +When a re-added event reaches the reactor (same tx re-mined in a new block after a reorg, confirmed by a fresh gate timer), the reactor attempts to process an event it has already committed. This guard converts what is currently a DB constraint-violation error and a full transaction rollback into a clean, explicit logged exit. + +**Important limitation:** this guard identifies events by `(txHash, logIndex, blockchainID)`, where `log_index` is a **block-level** index in go-ethereum — the position of this log among all logs in the entire block, across all transactions. If a transaction is re-mined in a new block where different transactions precede it, its logs receive different block-level `log_index` values. The new `(txHash, newLogIndex, blockchainID)` tuple does not match any committed row, so `IsContractEventProcessed` returns `false` and **the reorged event passes through this check**. In that case the reactor's business-logic idempotency is the actual guard (see below). This guard therefore only catches exact re-deliveries — cases where `log_index` is unchanged. Add a new method to `ChannelHubReactorStore`: @@ -268,21 +270,53 @@ Add a new method to `ChannelHubReactorStore`: // IsContractEventProcessed reports whether an event identified by // (txHash, logIndex, blockchainID) has already been committed, // regardless of which block it appeared in. +// NOTE: uses block-level logIndex — does not detect reorged events +// where the same tx re-mines with a different block-level log position. IsContractEventProcessed(txHash string, logIndex uint, blockchainID uint64) (bool, error) ``` At the top of `HandleEvent`, before entering `useStoreInTx`, call this method. If the event is already committed, log at **`INFO`** ("skipping re-delivered event, already committed") and return `nil` immediately. No transaction is opened; no state is touched. -The existing unique constraint on `(transaction_hash, log_index, blockchain_id)` in `contract_events` remains as the definitive safety net. This pre-check converts the constraint-violation rollback path into a clean, explicit, logged early exit that also serves as the idempotency guard for the reconciliation re-scan path. +Reorged events that pass through this check are still neutralized by the reactor's **business-logic idempotency**: + +- `HandleHomeChannelCreated` has an explicit early-return when the channel is already open. +- `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (overwrite, not accumulate). +- The `StoreContractEvent` unique constraint on `(transaction_hash, log_index, blockchain_id)` remains as the final backstop for the case where `log_index` happens to be unchanged. + +The value of `IsContractEventProcessed` is therefore: + +1. **Noise reduction for exact re-deliveries** — converts a constraint-violation rollback (logged as an error by the gate poller) into a clean INFO exit with no DB transaction opened. +2. **Correctness for the reconciliation walk (§4.4)** — when the node replays already-processed historical events on startup, every re-delivered event would otherwise produce a constraint-violation error and potentially stall the walk. This pre-check makes the reconciliation path viable. Together, §6.5 and §6.6 produce two complementary log signals: | Signal | Source | Level | Meaning | | --- | --- | --- | --- | | "post-gate reorg detected for event X" | Gate | WARN | Committed block was reorged; residual-risk scenario is active | -| "skipping re-delivered event X" | Reactor | INFO | Same tx re-mined; reactor correctly skips it | +| "skipping re-delivered event X" | Reactor | INFO | Same tx re-mined at same block position; reactor correctly skips it | + +If the operator sees the WARN but never the INFO, either the transaction was not re-mined, or it was re-mined at a different block position (this check did not fire; business-logic idempotency handled it silently). + +#### Reorg-safe idempotency — separate task + +To make the idempotency check itself robust to reorged events regardless of block position, the idempotency key must be stable across re-mining. The block-level `log_index` is not stable; a **tx-relative log index** is. + +The tx-relative log index is the 0-based position of a log within its own transaction's emitted logs. It is invariant: the same transaction always emits the same logs in the same order, so its tx-relative indices never change across reorgs. The EVM guarantees that all logs of a transaction arrive consecutively in ascending block-level order, so the tx-relative index can be computed in-process as: + +``` +tx_log_index = l.Index - min(l.Index for all logs of l.TxHash in this block) +``` + +No RPC call is required — the minimum is established by the first log of each transaction seen in a block, which always arrives before subsequent logs of the same transaction. + +Implementing this requires: + +- **DB migration**: add `tx_log_index` column to `contract_events`; replace the unique index `(transaction_hash, log_index, blockchain_id)` with `(transaction_hash, tx_log_index, blockchain_id)`. +- **`BlockchainEvent` struct**: add `TxLogIndex uint32` field. +- **Reactor**: maintain a small in-memory map `(blockHash, txHash) → minBlockLogIndex` to compute `tx_log_index` for each incoming event; evict entries when a new block is first seen. +- **`IsContractEventProcessed` and `StoreContractEvent`**: operate on `tx_log_index` instead of `log_index`. -If the operator sees the WARN but never the INFO, the transaction was not re-mined — the stale DB state from §2.1 is in effect. +**This is a separate task.** It is not part of the current confirmation-gate scope. Until it is implemented, the reactor relies on business-logic idempotency for the reorged-different-position case, which is correct but not explicitly guarded at the storage layer. ### 6.7 Block timestamp cache diff --git a/nitronode/store/database/contract_event.go b/nitronode/store/database/contract_event.go index cdc5af2db..7f22ffa51 100644 --- a/nitronode/store/database/contract_event.go +++ b/nitronode/store/database/contract_event.go @@ -55,6 +55,22 @@ func (s *DBStore) GetLatestContractEventBlockNumber(contractAddress string, bloc return blockNumber, nil } +// IsContractEventProcessed reports whether an event identified by (txHash, logIndex, blockchainID) +// has already been committed, regardless of which block it appeared in. +func (s *DBStore) IsContractEventProcessed(txHash string, logIndex uint32, blockchainID uint64) (bool, error) { + var ev ContractEvent + err := s.db.Where("transaction_hash = ? AND log_index = ? AND blockchain_id = ?", + strings.ToLower(txHash), logIndex, blockchainID). + Take(&ev).Error + if errors.Is(err, gorm.ErrRecordNotFound) { + return false, nil + } + if err != nil { + return false, err + } + return true, nil +} + // IsContractEventPresent checks whether a specific contract event has already been stored. func (s *DBStore) IsContractEventPresent(blockchainID, blockNumber uint64, txHash string, logIndex uint32) (bool, error) { var ev ContractEvent diff --git a/nitronode/store/database/interface.go b/nitronode/store/database/interface.go index 80099383d..f27281da7 100644 --- a/nitronode/store/database/interface.go +++ b/nitronode/store/database/interface.go @@ -296,4 +296,10 @@ type DatabaseStore interface { // IsContractEventPresent checks if a specific contract event has already been stored. IsContractEventPresent(blockchainID, blockNumber uint64, txHash string, logIndex uint32) (isPresent bool, err error) + + // IsContractEventProcessed reports whether an event identified by (txHash, logIndex, blockchainID) + // has already been committed, regardless of which block it appeared in. + // NOTE: uses block-level logIndex — does not detect reorged events where the same tx + // re-mines with a different block-level log position (see reorg-fix-spec.md §6.6). + IsContractEventProcessed(txHash string, logIndex uint32, blockchainID uint64) (bool, error) } diff --git a/pkg/blockchain/evm/channel_hub_reactor.go b/pkg/blockchain/evm/channel_hub_reactor.go index 1ec7aaa4d..d63831e59 100644 --- a/pkg/blockchain/evm/channel_hub_reactor.go +++ b/pkg/blockchain/evm/channel_hub_reactor.go @@ -112,6 +112,12 @@ type ChannelHubReactorStore interface { // StoreContractEvent persists a blockchain event to the database. StoreContractEvent(ev core.BlockchainEvent) error + + // IsContractEventProcessed reports whether an event identified by (txHash, logIndex, blockchainID) + // has already been committed, regardless of which block it appeared in. + // NOTE: uses block-level logIndex — does not detect reorged events where the same tx + // re-mines with a different block-level log position (see reorg-fix-spec.md §6.6). + IsContractEventProcessed(txHash string, logIndex uint32, blockchainID uint64) (bool, error) } var channelHubAbi *abi.ABI @@ -148,16 +154,18 @@ type ChannelHubReactor struct { nodeAddress string eventHandler core.ChannelHubEventHandler assetStore AssetStore + store ChannelHubReactorStore // non-transactional; used for the pre-check in HandleEvent useStoreInTx ChannelHubReactorStoreTxProvider onEventProcessed func(blockchainID uint64, success bool) } -func NewChannelHubReactor(blockchainID uint64, nodeAddress string, eventHandler core.ChannelHubEventHandler, assetStore AssetStore, useStoreInTx ChannelHubReactorStoreTxProvider) *ChannelHubReactor { +func NewChannelHubReactor(blockchainID uint64, nodeAddress string, eventHandler core.ChannelHubEventHandler, assetStore AssetStore, useStoreInTx ChannelHubReactorStoreTxProvider, store ChannelHubReactorStore) *ChannelHubReactor { return &ChannelHubReactor{ blockchainID: blockchainID, nodeAddress: nodeAddress, eventHandler: eventHandler, assetStore: assetStore, + store: store, useStoreInTx: useStoreInTx, } } @@ -178,7 +186,25 @@ func (r *ChannelHubReactor) HandleEvent(ctx context.Context, l types.Log) error } logger.Debug("received event", "name", eventName, "blockNumber", l.BlockNumber, "txHash", l.TxHash.String(), "logIndex", l.Index) - err := r.useStoreInTx(func(store ChannelHubReactorStore) error { + // Pre-check: skip already-committed events without opening a transaction. + // This converts the constraint-violation rollback path into a clean early exit and + // is required for the reconciliation walk (§4.4) to replay events without errors. + // Reorged events with a changed block-level logIndex pass through this check; + // they are handled by the reactor's business-logic idempotency (see reorg-fix-spec.md §6.6). + processed, err := r.store.IsContractEventProcessed(l.TxHash.String(), uint32(l.Index), r.blockchainID) + if err != nil { + logger.Warn("failed to check if contract event was already processed, proceeding", + "error", err, "txHash", l.TxHash.String(), "logIndex", l.Index, "chainID", r.blockchainID) + } else if processed { + logger.Info("skipping re-delivered event, already committed", + "event", eventName, "txHash", l.TxHash.String(), "logIndex", l.Index, "chainID", r.blockchainID) + if r.onEventProcessed != nil { + r.onEventProcessed(r.blockchainID, true) + } + return nil + } + + err = r.useStoreInTx(func(store ChannelHubReactorStore) error { var err error switch eventID { case channelHubAbi.Events["NodeBalanceUpdated"].ID: diff --git a/pkg/blockchain/evm/channel_hub_reactor_test.go b/pkg/blockchain/evm/channel_hub_reactor_test.go index 62b6abd16..1a5a26f31 100644 --- a/pkg/blockchain/evm/channel_hub_reactor_test.go +++ b/pkg/blockchain/evm/channel_hub_reactor_test.go @@ -136,6 +136,11 @@ func (m *mockChannelHubStore) RecordTransaction(tx core.Transaction, application return args.Error(0) } +func (m *mockChannelHubStore) IsContractEventProcessed(txHash string, logIndex uint32, blockchainID uint64) (bool, error) { + args := m.Called(txHash, logIndex, blockchainID) + return args.Bool(0), args.Error(1) +} + // mockChannelHubEventHandler captures events dispatched by the reactor. type mockChannelHubEventHandler struct { mock.Mock @@ -248,11 +253,14 @@ func packNonIndexed(t *testing.T, eventName string, args ...interface{}) []byte } // newReactor creates a ChannelHubReactor wired to the provided mocks. +// Sets up a default IsContractEventProcessed expectation that returns (false, nil) +// so existing tests don't need to set it up individually. func newReactor(blockchainID uint64, nodeAddress string, handler *mockChannelHubEventHandler, assetStore *MockAssetStore, store *mockChannelHubStore) *ChannelHubReactor { + store.On("IsContractEventProcessed", mock.Anything, mock.Anything, mock.Anything).Return(false, nil) useStoreInTx := func(fn ChannelHubReactorStoreTxHandler) error { return fn(store) } - return NewChannelHubReactor(blockchainID, nodeAddress, handler, assetStore, useStoreInTx) + return NewChannelHubReactor(blockchainID, nodeAddress, handler, assetStore, useStoreInTx, store) } // expectStoreContractEvent sets up the mock expectation for StoreContractEvent. @@ -1039,6 +1047,75 @@ func TestChannelHubReactor_HandleEscrowDepositsPurged(t *testing.T) { store.AssertExpectations(t) } +func TestChannelHubReactor_HandleEvent_PreCheckError(t *testing.T) { + blockchainID := uint64(1) + nodeAddr := "0x1111111111111111111111111111111111111111" + tokenAddr := common.HexToAddress("0xA0b86991c6218b36c1d19D4a2e9Eb0cE3606eB48") + amount := big.NewInt(1_000_000) + + logEntry := types.Log{ + Topics: []common.Hash{ + channelHubAbi.Events["NodeBalanceUpdated"].ID, + common.BytesToHash(tokenAddr.Bytes()), + }, + Data: packNonIndexed(t, "NodeBalanceUpdated", amount), + BlockNumber: 100, + TxHash: common.HexToHash("0xaabbcc"), + Index: 0, + } + + store := new(mockChannelHubStore) + handler := new(mockChannelHubEventHandler) + assetStore := new(MockAssetStore) + + // Pre-check returns an error — reactor must fall through and process normally. + store.On("IsContractEventProcessed", mock.Anything, mock.Anything, mock.Anything).Return(false, assert.AnError) + assetStore.On("GetTokenAsset", blockchainID, tokenAddr.String()).Return("usdc", nil) + assetStore.On("GetTokenDecimals", blockchainID, tokenAddr.String()).Return(uint8(6), nil) + handler.On("HandleNodeBalanceUpdated", mock.Anything, mock.Anything, mock.Anything).Return(nil) + expectStoreContractEvent(store, "NodeBalanceUpdated", 100, blockchainID) + + useStoreInTx := func(fn ChannelHubReactorStoreTxHandler) error { return fn(store) } + reactor := NewChannelHubReactor(blockchainID, nodeAddr, handler, assetStore, useStoreInTx, store) + + err := reactor.HandleEvent(context.Background(), logEntry) + require.NoError(t, err) + + // Business logic and StoreContractEvent must still be called. + handler.AssertCalled(t, "HandleNodeBalanceUpdated", mock.Anything, mock.Anything, mock.Anything) + store.AssertExpectations(t) +} + +func TestChannelHubReactor_HandleEvent_AlreadyProcessed(t *testing.T) { + blockchainID := uint64(1) + nodeAddr := "0x1111111111111111111111111111111111111111" + txHash := common.HexToHash("0xaabbcc") + + logEntry := types.Log{ + Topics: []common.Hash{channelHubAbi.Events["NodeBalanceUpdated"].ID}, + BlockNumber: 100, + TxHash: txHash, + Index: 0, + } + + store := new(mockChannelHubStore) + handler := new(mockChannelHubEventHandler) + assetStore := new(MockAssetStore) + + // Pre-check returns true — event already committed. + store.On("IsContractEventProcessed", txHash.String(), uint32(0), blockchainID).Return(true, nil) + + useStoreInTx := func(fn ChannelHubReactorStoreTxHandler) error { return fn(store) } + reactor := NewChannelHubReactor(blockchainID, nodeAddr, handler, assetStore, useStoreInTx, store) + + err := reactor.HandleEvent(context.Background(), logEntry) + require.NoError(t, err) + + // Neither business logic nor StoreContractEvent should be called. + handler.AssertNotCalled(t, "HandleNodeBalanceUpdated", mock.Anything, mock.Anything, mock.Anything) + store.AssertNotCalled(t, "StoreContractEvent", mock.Anything) +} + func TestChannelHubReactor_UnknownEvent(t *testing.T) { blockchainID := uint64(1) nodeAddr := "0x1111111111111111111111111111111111111111" From ccadb6c9c872e43e77bb71d6f2fe9cbbd03a566f Mon Sep 17 00:00:00 2001 From: nksazonov Date: Mon, 8 Jun 2026 15:19:59 +0200 Subject: [PATCH 05/23] =?UTF-8?q?feat(nitronode):=20implement=20startup=20?= =?UTF-8?q?reconciliation=20walk=20(=C2=A74.4)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...0000_add_block_hash_to_contract_events.sql | 7 + nitronode/reorg-fix-spec.md | 10 + nitronode/store/database/contract_event.go | 36 ++++ nitronode/store/database/interface.go | 9 + pkg/blockchain/evm/channel_hub_reactor.go | 1 + pkg/blockchain/evm/interface.go | 13 +- pkg/blockchain/evm/listener.go | 9 +- pkg/blockchain/evm/listener_test.go | 12 +- pkg/blockchain/evm/mock_test.go | 18 ++ pkg/blockchain/evm/reconciler.go | 84 ++++++++ pkg/blockchain/evm/reconciler_test.go | 185 ++++++++++++++++++ pkg/core/event.go | 1 + 12 files changed, 376 insertions(+), 9 deletions(-) create mode 100644 nitronode/config/migrations/postgres/20260608000000_add_block_hash_to_contract_events.sql create mode 100644 pkg/blockchain/evm/reconciler.go create mode 100644 pkg/blockchain/evm/reconciler_test.go diff --git a/nitronode/config/migrations/postgres/20260608000000_add_block_hash_to_contract_events.sql b/nitronode/config/migrations/postgres/20260608000000_add_block_hash_to_contract_events.sql new file mode 100644 index 000000000..5dc48625f --- /dev/null +++ b/nitronode/config/migrations/postgres/20260608000000_add_block_hash_to_contract_events.sql @@ -0,0 +1,7 @@ +-- +goose Up + +ALTER TABLE contract_events ADD COLUMN block_hash CHAR(66) NOT NULL DEFAULT ''; + +-- +goose Down + +ALTER TABLE contract_events DROP COLUMN block_hash; diff --git a/nitronode/reorg-fix-spec.md b/nitronode/reorg-fix-spec.md index 0848b2e15..1d152f3c9 100644 --- a/nitronode/reorg-fix-spec.md +++ b/nitronode/reorg-fix-spec.md @@ -97,6 +97,16 @@ This means out-of-order delivery requires no special case beyond the normal path Before the reconciliation logic described below can function, `block_hash` must be added as a column to `contract_events` and to the `core.BlockchainEvent` struct. The value is available in `types.Log.BlockHash` at the time the gate calls the reactor. Without this column, reorg detection in steps 2–4 is not possible. +**Why `block_hash` is the minimal required addition — and why alternatives fail:** + +The reconciliation walk needs to answer one question per stored block: "is this specific block still in the canonical chain?" The only RPC call that answers it directly is `eth_getBlockByHash(hash)` — it returns `null` if the block is no longer canonical. Without the stored hash, two alternatives were evaluated and both fail: + +- **`block_number` alone is insufficient.** After a reorg, a *different* block can occupy the same height. Calling `eth_getBlockByNumber(storedBlockNumber)` always returns a block — but it may be a new block from the reorged fork. Without the original hash there is no way to tell whether the block returned is the one the reactor processed. + +- **`transaction_hash` via `eth_getTransactionReceipt` is insufficient.** A block can be reorged out even if every one of its transactions was re-mined in a new block at the same height. In that case all receipt lookups return `blockNumber` matching the stored value, but the original block is gone and the stored DB state no longer corresponds to the canonical chain. Additionally, the backward walk (step 3) must traverse every stored *block* in descending order; rows in `contract_events` only exist for blocks that contained a `ChannelHub` event. A reorg that diverged entirely within a gap — blocks with no relevant events — is invisible to a tx-receipt-based walk. + +`block_hash` is a single `CHAR(66)` column. Its addition enables exact, O(1)-per-step canonicality checks and is the only approach that handles all reorg scenarios correctly. + #### Definition: latest processed block The **latest processed block** for a chain is the highest block number at which the reactor successfully committed at least one event to the database — identical to the listener's existing startup cursor (`MAX(block_number)` in `contract_events` for this `blockchain_id` and contract address, computed by `GetLatestContractEventBlockNumber`). This is distinct from the highest block the listener ever *saw*: the listener may have seen many blocks that contained no relevant events and therefore left no `contract_events` rows. diff --git a/nitronode/store/database/contract_event.go b/nitronode/store/database/contract_event.go index 7f22ffa51..d0c634284 100644 --- a/nitronode/store/database/contract_event.go +++ b/nitronode/store/database/contract_event.go @@ -17,6 +17,7 @@ type ContractEvent struct { BlockchainID uint64 `gorm:"column:blockchain_id"` Name string `gorm:"column:name"` BlockNumber uint64 `gorm:"column:block_number"` + BlockHash string `gorm:"column:block_hash"` TransactionHash string `gorm:"column:transaction_hash"` LogIndex uint32 `gorm:"column:log_index"` CreatedAt time.Time `gorm:"column:created_at"` @@ -34,6 +35,7 @@ func (s *DBStore) StoreContractEvent(ev core.BlockchainEvent) error { BlockchainID: ev.BlockchainID, Name: ev.Name, BlockNumber: ev.BlockNumber, + BlockHash: ev.BlockHash, TransactionHash: strings.ToLower(ev.TransactionHash), LogIndex: ev.LogIndex, CreatedAt: time.Now(), @@ -71,6 +73,40 @@ func (s *DBStore) IsContractEventProcessed(txHash string, logIndex uint32, block return true, nil } +// GetLatestContractEventBlockHashAndNumber returns the block_number and block_hash of the +// highest stored event for the given contract. Returns (0, "", nil) when no rows exist. +func (s *DBStore) GetLatestContractEventBlockHashAndNumber(contractAddress string, blockchainID uint64) (uint64, string, error) { + var ev ContractEvent + err := s.db.Where("blockchain_id = ? AND contract_address = ?", blockchainID, strings.ToLower(contractAddress)). + Order("block_number DESC"). + First(&ev).Error + if errors.Is(err, gorm.ErrRecordNotFound) { + return 0, "", nil + } + if err != nil { + return 0, "", err + } + return ev.BlockNumber, ev.BlockHash, nil +} + +// GetPreviousDistinctBlockHash returns the block_number and block_hash of the highest +// stored event whose block_number is strictly below belowBlockNumber. Returns (0, "", nil) +// when no such row exists (signals genesis fallback). +func (s *DBStore) GetPreviousDistinctBlockHash(contractAddress string, blockchainID uint64, belowBlockNumber uint64) (uint64, string, error) { + var ev ContractEvent + err := s.db.Where("blockchain_id = ? AND contract_address = ? AND block_number < ?", + blockchainID, strings.ToLower(contractAddress), belowBlockNumber). + Order("block_number DESC"). + First(&ev).Error + if errors.Is(err, gorm.ErrRecordNotFound) { + return 0, "", nil + } + if err != nil { + return 0, "", err + } + return ev.BlockNumber, ev.BlockHash, nil +} + // IsContractEventPresent checks whether a specific contract event has already been stored. func (s *DBStore) IsContractEventPresent(blockchainID, blockNumber uint64, txHash string, logIndex uint32) (bool, error) { var ev ContractEvent diff --git a/nitronode/store/database/interface.go b/nitronode/store/database/interface.go index f27281da7..24fcbfef3 100644 --- a/nitronode/store/database/interface.go +++ b/nitronode/store/database/interface.go @@ -302,4 +302,13 @@ type DatabaseStore interface { // NOTE: uses block-level logIndex — does not detect reorged events where the same tx // re-mines with a different block-level log position (see reorg-fix-spec.md §6.6). IsContractEventProcessed(txHash string, logIndex uint32, blockchainID uint64) (bool, error) + + // GetLatestContractEventBlockHashAndNumber returns the block_number and block_hash of + // the highest stored event for the given contract. Returns (0, "", nil) when no rows exist. + GetLatestContractEventBlockHashAndNumber(contractAddress string, blockchainID uint64) (blockNumber uint64, blockHash string, err error) + + // GetPreviousDistinctBlockHash returns the block_number and block_hash of the highest + // stored event with block_number strictly below belowBlockNumber. Returns (0, "", nil) + // when no such row exists. + GetPreviousDistinctBlockHash(contractAddress string, blockchainID uint64, belowBlockNumber uint64) (blockNumber uint64, blockHash string, err error) } diff --git a/pkg/blockchain/evm/channel_hub_reactor.go b/pkg/blockchain/evm/channel_hub_reactor.go index d63831e59..da99a6d8e 100644 --- a/pkg/blockchain/evm/channel_hub_reactor.go +++ b/pkg/blockchain/evm/channel_hub_reactor.go @@ -271,6 +271,7 @@ func (r *ChannelHubReactor) HandleEvent(ctx context.Context, l types.Log) error ContractAddress: l.Address.Hex(), TransactionHash: l.TxHash.String(), LogIndex: uint32(l.Index), + BlockHash: l.BlockHash.Hex(), }); err != nil { logger.Warn("error storing contract event", "error", err, "event", eventName, "blockNumber", l.BlockNumber, "txHash", l.TxHash.String(), "logIndex", l.Index) return errors.Wrap(err, "error storing contract event") diff --git a/pkg/blockchain/evm/interface.go b/pkg/blockchain/evm/interface.go index 23e5bf639..a13477aab 100644 --- a/pkg/blockchain/evm/interface.go +++ b/pkg/blockchain/evm/interface.go @@ -5,17 +5,26 @@ import ( ethereum "github.com/ethereum/go-ethereum" "github.com/ethereum/go-ethereum/accounts/abi/bind" + "github.com/ethereum/go-ethereum/common" "github.com/ethereum/go-ethereum/core/types" ) type HandleEvent func(ctx context.Context, eventLog types.Log) error -// ContractEventGetter is used by Listener for resumption and deduplication. +// ContractEventGetter is used by Listener for resumption, deduplication, and +// reconciliation-walk queries. type ContractEventGetter interface { // GetLatestContractEventBlockNumber returns the block to resume from (0 = start fresh). GetLatestContractEventBlockNumber(contractAddress string, blockchainID uint64) (lastBlock uint64, err error) // IsContractEventPresent checks whether a specific event was already processed. IsContractEventPresent(blockchainID, blockNumber uint64, txHash string, logIndex uint32) (isPresent bool, err error) + // GetLatestContractEventBlockHashAndNumber returns the block_number and block_hash of + // the highest stored event. Returns (0, "", nil) when no rows exist. + GetLatestContractEventBlockHashAndNumber(contractAddress string, blockchainID uint64) (blockNumber uint64, blockHash string, err error) + // GetPreviousDistinctBlockHash returns the block_number and block_hash of the highest + // stored event with block_number strictly below belowBlockNumber. Returns (0, "", nil) + // when no such row exists (genesis fallback). + GetPreviousDistinctBlockHash(contractAddress string, blockchainID uint64, belowBlockNumber uint64) (blockNumber uint64, blockHash string, err error) } type AssetStore interface { @@ -35,4 +44,6 @@ type AssetStore interface { type EVMClient interface { ethereum.ChainStateReader bind.ContractBackend + // HeaderByHash is used by the reconciliation walk to verify block canonicality. + HeaderByHash(ctx context.Context, hash common.Hash) (*types.Header, error) } diff --git a/pkg/blockchain/evm/listener.go b/pkg/blockchain/evm/listener.go index 2c07aee26..10a01eec5 100644 --- a/pkg/blockchain/evm/listener.go +++ b/pkg/blockchain/evm/listener.go @@ -9,7 +9,6 @@ import ( "time" "github.com/ethereum/go-ethereum" - "github.com/ethereum/go-ethereum/accounts/abi/bind" "github.com/ethereum/go-ethereum/common" "github.com/ethereum/go-ethereum/core/types" "github.com/layer-3/nitrolite/pkg/log" @@ -26,7 +25,7 @@ const ( // for graceful shutdown. type Listener struct { contractAddress common.Address - client bind.ContractBackend + client EVMClient blockchainID uint64 blockStep uint64 // max blocks per FilterLogs call during reconciliation logger log.Logger @@ -36,7 +35,7 @@ type Listener struct { // NewListener creates a Listener. blockStep controls how many blocks are fetched // per RPC call during historical reconciliation. -func NewListener(contractAddress common.Address, client bind.ContractBackend, blockchainID uint64, blockStep uint64, logger log.Logger, eventHandler HandleEvent, eventGetter ContractEventGetter) *Listener { +func NewListener(contractAddress common.Address, client EVMClient, blockchainID uint64, blockStep uint64, logger log.Logger, eventHandler HandleEvent, eventGetter ContractEventGetter) *Listener { return &Listener{ contractAddress: contractAddress, client: client, @@ -106,9 +105,9 @@ func (l *Listener) logBackOff(count uint64, originator string) (time.Duration, b // On subscription failure it retries with exponential backoff. Returns non-nil only // when the handler or the event-presence check fails. func (l *Listener) listenEvents(ctx context.Context) error { - lastBlock, err := l.eventGetter.GetLatestContractEventBlockNumber(l.contractAddress.String(), l.blockchainID) + lastBlock, err := findCommonAncestor(ctx, l.client, l.eventGetter, l.contractAddress.String(), l.blockchainID, l.logger) if err != nil { - return fmt.Errorf("failed to get latest processed block: %w", err) + return fmt.Errorf("failed to find common ancestor: %w", err) } var backOffCount atomic.Uint64 diff --git a/pkg/blockchain/evm/listener_test.go b/pkg/blockchain/evm/listener_test.go index e65f133fd..a948150f3 100644 --- a/pkg/blockchain/evm/listener_test.go +++ b/pkg/blockchain/evm/listener_test.go @@ -59,7 +59,8 @@ func TestListener_Listen_CurrentEvents(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - eventGetter.On("GetLatestContractEventBlockNumber", addr.String(), uint64(1)).Return(uint64(0), nil) + // No stored events → findCommonAncestor returns 0 immediately (genesis). + eventGetter.On("GetLatestContractEventBlockHashAndNumber", addr.String(), uint64(1)).Return(uint64(0), "", nil) ctx, cancel := context.WithCancel(context.Background()) t.Cleanup(cancel) @@ -154,9 +155,10 @@ func TestListener_Listen_HistoricalAndCurrent(t *testing.T) { logger := log.NewNoopLogger() addr := common.HexToAddress("0x123") - // Start from block 100 + // Start from block 100 (canonical — HeaderByHash returns non-nil). eventGetter := new(MockContractEventGetter) - eventGetter.On("GetLatestContractEventBlockNumber", addr.String(), uint64(1)).Return(uint64(100), nil) + blockHash100 := common.HexToHash("0xdeadbeef") + eventGetter.On("GetLatestContractEventBlockHashAndNumber", addr.String(), uint64(1)).Return(uint64(100), blockHash100.Hex(), nil) // Historical event at block 105 is not present eventGetter.On("IsContractEventPresent", uint64(1), uint64(105), mock.Anything, uint32(0)).Return(false, nil) // Current event at block 111 — after historical is done, first current event triggers check @@ -184,6 +186,10 @@ func TestListener_Listen_HistoricalAndCurrent(t *testing.T) { listener := NewListener(addr, mockClient, 1, 10, logger, handleEvent, eventGetter) + // findCommonAncestor: block 100 is canonical. + canonicalHeader := &types.Header{Number: big.NewInt(100)} + mockClient.On("HeaderByHash", mock.Anything, blockHash100).Return(canonicalHeader, nil) + // Mock HeaderByNumber (current tip is 110) currentHeader := &types.Header{Number: big.NewInt(110)} mockClient.On("HeaderByNumber", mock.Anything, (*big.Int)(nil)).Return(currentHeader, nil) diff --git a/pkg/blockchain/evm/mock_test.go b/pkg/blockchain/evm/mock_test.go index b96784f80..e562a0b55 100644 --- a/pkg/blockchain/evm/mock_test.go +++ b/pkg/blockchain/evm/mock_test.go @@ -120,6 +120,14 @@ func (m *MockEVMClient) SubscribeFilterLogs(ctx context.Context, query ethereum. return args.Get(0).(ethereum.Subscription), args.Error(1) } +func (m *MockEVMClient) HeaderByHash(ctx context.Context, hash common.Hash) (*types.Header, error) { + args := m.Called(ctx, hash) + if args.Get(0) == nil { + return nil, args.Error(1) + } + return args.Get(0).(*types.Header), args.Error(1) +} + // MockContractEventGetter implements ContractEventGetter interface type MockContractEventGetter struct { mock.Mock @@ -135,6 +143,16 @@ func (m *MockContractEventGetter) IsContractEventPresent(blockchainID, blockNumb return args.Bool(0), args.Error(1) } +func (m *MockContractEventGetter) GetLatestContractEventBlockHashAndNumber(contractAddress string, blockchainID uint64) (uint64, string, error) { + args := m.Called(contractAddress, blockchainID) + return args.Get(0).(uint64), args.String(1), args.Error(2) +} + +func (m *MockContractEventGetter) GetPreviousDistinctBlockHash(contractAddress string, blockchainID uint64, belowBlockNumber uint64) (uint64, string, error) { + args := m.Called(contractAddress, blockchainID, belowBlockNumber) + return args.Get(0).(uint64), args.String(1), args.Error(2) +} + // MockAssetStore implements AssetStore interface type MockAssetStore struct { mock.Mock diff --git a/pkg/blockchain/evm/reconciler.go b/pkg/blockchain/evm/reconciler.go new file mode 100644 index 000000000..f70876451 --- /dev/null +++ b/pkg/blockchain/evm/reconciler.go @@ -0,0 +1,84 @@ +package evm + +import ( + "context" + "fmt" + + "github.com/ethereum/go-ethereum/common" + "github.com/layer-3/nitrolite/pkg/log" +) + +// findCommonAncestor determines the last block in the canonical chain that the +// node has already processed. It walks stored block hashes backward until it +// finds one that eth_getBlockByHash confirms is canonical, then returns that +// block number as the safe replay start point. +// +// Returns 0 when no stored events exist or when every stored block has been +// reorged out — in both cases the caller should replay from genesis/start-block. +func findCommonAncestor( + ctx context.Context, + client EVMClient, + getter ContractEventGetter, + contractAddress string, + blockchainID uint64, + logger log.Logger, +) (uint64, error) { + blockNum, blockHash, err := getter.GetLatestContractEventBlockHashAndNumber(contractAddress, blockchainID) + if err != nil { + return 0, fmt.Errorf("get latest contract event block hash: %w", err) + } + if blockHash == "" { + // No stored events (blockNum=0) or pre-migration row with no hash (blockNum>0). + // Either way, treat blockNum as the safe canonical resume point. + return blockNum, nil + } + + for { + if ctx.Err() != nil { + return 0, ctx.Err() + } + + hash := common.HexToHash(blockHash) + header, err := client.HeaderByHash(ctx, hash) + if err != nil { + return 0, fmt.Errorf("check canonicality of block %d (%s): %w", blockNum, blockHash, err) + } + + if header != nil { + // This block is still in the canonical chain. + if blockNum != header.Number.Uint64() { + // Sanity check: the block at this hash should have the number we stored. + return 0, fmt.Errorf("block hash %s has unexpected number: stored %d, chain %d", blockHash, blockNum, header.Number.Uint64()) + } + logger.Info("reconciliation: found common ancestor", + "blockchainID", blockchainID, + "blockNumber", blockNum, + "blockHash", blockHash, + ) + return blockNum, nil + } + + // Block was reorged out — walk to the next-older stored block. + logger.Info("reconciliation: block reorged, walking backward", + "blockchainID", blockchainID, + "blockNumber", blockNum, + "blockHash", blockHash, + ) + prevNum, prevHash, err := getter.GetPreviousDistinctBlockHash(contractAddress, blockchainID, blockNum) + if err != nil { + return 0, fmt.Errorf("get previous distinct block hash below %d: %w", blockNum, err) + } + if prevHash == "" { + // No older stored block (prevNum=0) or pre-migration row (prevNum>0). + // Use prevNum as the safe canonical resume point. + logger.Info("reconciliation: reached pre-migration or genesis boundary", + "blockchainID", blockchainID, + "blockNumber", prevNum, + ) + return prevNum, nil + } + + blockNum = prevNum + blockHash = prevHash + } +} diff --git a/pkg/blockchain/evm/reconciler_test.go b/pkg/blockchain/evm/reconciler_test.go new file mode 100644 index 000000000..1fdf88e16 --- /dev/null +++ b/pkg/blockchain/evm/reconciler_test.go @@ -0,0 +1,185 @@ +package evm + +import ( + "context" + "errors" + "math/big" + "testing" + + "github.com/ethereum/go-ethereum/common" + "github.com/ethereum/go-ethereum/core/types" + "github.com/layer-3/nitrolite/pkg/log" + "github.com/stretchr/testify/assert" + "github.com/stretchr/testify/mock" + "github.com/stretchr/testify/require" +) + +const ( + testContract = "0x1234567890abcdef1234567890abcdef12345678" + testBlockchainID = uint64(1) +) + +func newTestLogger() log.Logger { + return log.NewNoopLogger() +} + +// TestFindCommonAncestor_NoStoredEvents verifies that when no contract events exist, +// findCommonAncestor returns 0 (genesis fallback). +func TestFindCommonAncestor_NoStoredEvents(t *testing.T) { + t.Parallel() + + client := new(MockEVMClient) + getter := new(MockContractEventGetter) + + getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). + Return(uint64(0), "", nil) + + result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) + require.NoError(t, err) + assert.Equal(t, uint64(0), result) + client.AssertNotCalled(t, "HeaderByHash") +} + +// TestFindCommonAncestor_LatestBlockCanonical verifies that when the latest stored block +// is still canonical, findCommonAncestor returns that block number with no backward walk. +func TestFindCommonAncestor_LatestBlockCanonical(t *testing.T) { + t.Parallel() + + client := new(MockEVMClient) + getter := new(MockContractEventGetter) + + blockHash := common.HexToHash("0xaabbccdd") + getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). + Return(uint64(500), blockHash.Hex(), nil) + + canonicalHeader := &types.Header{Number: big.NewInt(500)} + client.On("HeaderByHash", mock.Anything, blockHash).Return(canonicalHeader, nil) + + result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) + require.NoError(t, err) + assert.Equal(t, uint64(500), result) + getter.AssertNotCalled(t, "GetPreviousDistinctBlockHash") +} + +// TestFindCommonAncestor_SingleReorgDepth verifies that when the latest block is reorged out +// but the previous one is canonical, findCommonAncestor returns the previous block number. +func TestFindCommonAncestor_SingleReorgDepth(t *testing.T) { + t.Parallel() + + client := new(MockEVMClient) + getter := new(MockContractEventGetter) + + reorgedHash := common.HexToHash("0xreorged0") + canonicalHash := common.HexToHash("0xcanon000") + + getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). + Return(uint64(200), reorgedHash.Hex(), nil) + + // Latest block (200) reorged out. + client.On("HeaderByHash", mock.Anything, reorgedHash).Return(nil, nil) + + // Walk to previous block (190) which is canonical. + getter.On("GetPreviousDistinctBlockHash", testContract, testBlockchainID, uint64(200)). + Return(uint64(190), canonicalHash.Hex(), nil) + + canonicalHeader := &types.Header{Number: big.NewInt(190)} + client.On("HeaderByHash", mock.Anything, canonicalHash).Return(canonicalHeader, nil) + + result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) + require.NoError(t, err) + assert.Equal(t, uint64(190), result) +} + +// TestFindCommonAncestor_WalkToGenesis verifies that when all stored blocks are reorged out, +// findCommonAncestor returns 0 (genesis fallback). +func TestFindCommonAncestor_WalkToGenesis(t *testing.T) { + t.Parallel() + + client := new(MockEVMClient) + getter := new(MockContractEventGetter) + + hash300 := common.HexToHash("0x0000300") + hash200 := common.HexToHash("0x0000200") + + getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). + Return(uint64(300), hash300.Hex(), nil) + + // Block 300 reorged out. + client.On("HeaderByHash", mock.Anything, hash300).Return(nil, nil) + getter.On("GetPreviousDistinctBlockHash", testContract, testBlockchainID, uint64(300)). + Return(uint64(200), hash200.Hex(), nil) + + // Block 200 reorged out. + client.On("HeaderByHash", mock.Anything, common.HexToHash(hash200.Hex())).Return(nil, nil) + getter.On("GetPreviousDistinctBlockHash", testContract, testBlockchainID, uint64(200)). + Return(uint64(0), "", nil) + + result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) + require.NoError(t, err) + assert.Equal(t, uint64(0), result) +} + +// TestFindCommonAncestor_PreMigrationLatestRow verifies that when the latest stored row has +// an empty block_hash (pre-migration row), findCommonAncestor returns that block number +// without making any RPC call, treating the row as canonical. +func TestFindCommonAncestor_PreMigrationLatestRow(t *testing.T) { + t.Parallel() + + client := new(MockEVMClient) + getter := new(MockContractEventGetter) + + // blockNum=450 but blockHash="" — pre-migration row. + getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). + Return(uint64(450), "", nil) + + result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) + require.NoError(t, err) + assert.Equal(t, uint64(450), result) + client.AssertNotCalled(t, "HeaderByHash") +} + +// TestFindCommonAncestor_PreMigrationMidWalk verifies that when a pre-migration row (empty +// block_hash) is encountered during the backward walk, the walk stops and returns that +// block number rather than making an RPC call with a zero hash. +func TestFindCommonAncestor_PreMigrationMidWalk(t *testing.T) { + t.Parallel() + + client := new(MockEVMClient) + getter := new(MockContractEventGetter) + + reorgedHash := common.HexToHash("0xreorgedX") + + getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). + Return(uint64(300), reorgedHash.Hex(), nil) + + // Block 300 reorged out. + client.On("HeaderByHash", mock.Anything, reorgedHash).Return(nil, nil) + + // Walk backward hits a pre-migration row with empty hash at block 250. + getter.On("GetPreviousDistinctBlockHash", testContract, testBlockchainID, uint64(300)). + Return(uint64(250), "", nil) + + result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) + require.NoError(t, err) + assert.Equal(t, uint64(250), result) + // HeaderByHash must NOT be called for the zero-hash pre-migration row. + client.AssertNumberOfCalls(t, "HeaderByHash", 1) +} + +// TestFindCommonAncestor_HeaderByHashError verifies that RPC errors are propagated. +func TestFindCommonAncestor_HeaderByHashError(t *testing.T) { + t.Parallel() + + client := new(MockEVMClient) + getter := new(MockContractEventGetter) + + blockHash := common.HexToHash("0xfailhash") + getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). + Return(uint64(100), blockHash.Hex(), nil) + + client.On("HeaderByHash", mock.Anything, blockHash).Return(nil, errors.New("rpc timeout")) + + _, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) + require.Error(t, err) + assert.Contains(t, err.Error(), "rpc timeout") +} diff --git a/pkg/core/event.go b/pkg/core/event.go index 2b6988a5e..c0e8aa131 100644 --- a/pkg/core/event.go +++ b/pkg/core/event.go @@ -87,4 +87,5 @@ type BlockchainEvent struct { BlockNumber uint64 `json:"block_number"` TransactionHash string `json:"transaction_hash"` LogIndex uint32 `json:"log_index"` + BlockHash string `json:"block_hash"` } From 43d0850279a74b98f26b6a78334a474574a94fde Mon Sep 17 00:00:00 2001 From: nksazonov Date: Mon, 8 Jun 2026 15:39:10 +0200 Subject: [PATCH 06/23] docs(nitronode): reconcile reorg-fix spec with implementation --- nitronode/reorg-fix-spec.md | 41 +++++++++++--------- pkg/blockchain/evm/confirmation_gate_test.go | 14 ++++--- 2 files changed, 30 insertions(+), 25 deletions(-) diff --git a/nitronode/reorg-fix-spec.md b/nitronode/reorg-fix-spec.md index 1d152f3c9..501d9eb50 100644 --- a/nitronode/reorg-fix-spec.md +++ b/nitronode/reorg-fix-spec.md @@ -80,16 +80,16 @@ If a log with `Removed: true` arrives for the same `(txHash, blockHash, logIndex ### 4.3 Out-of-order delivery -The re-added event (no `Removed: true`, new block) may arrive at the listener before the corresponding `Removed: true` log for the old block. Because the re-added event is in a different block, it carries a different `blockHash` and therefore a different key. The two events are handled independently: +The re-added event (no `Removed: true`, new block) may arrive at the listener before the corresponding `Removed: true` log for the old block. When this happens, the gate **replaces** the pending entry for `(txHash, logIndex)` with the new one and resets the confirmation timer under the new block's key: -- The re-added event starts a fresh timer under its own key. -- The `Removed: true` log, when it arrives, looks up the OLD block's key — which has no pending timer (it was never created, or it already expired) — and performs a no-op. +- On the non-removed re-add, scan the queue by `(txHash, logIndex)` — ignoring `blockHash` — and drop any existing entry. Append the new event with a fresh `arrivedAt`. +- The subsequent `Removed: true` log for the OLD block carries the old `blockHash` and therefore matches neither the queued (new-block) entry nor any `recentlyForwarded` record. It performs a no-op. -This means out-of-order delivery requires no special case beyond the normal path: the block-scoped key prevents the remove from accidentally cancelling the re-added event's timer. +This collapses the two-entry coexistence model into a single live entry per `(txHash, logIndex)`. The behavior is observationally equivalent — exactly one event is forwarded, and it is the latest re-mining — and it removes the only state-divergence path between the queue and `recentlyForwarded`. -- On a `Removed: true` log for a key that **has no pending timer**: no-op. The event either confirmed and was already processed (reorg arrived after the window), or belongs to a different block whose timer was never started (re-add arrived first, has its own key). +- On a `Removed: true` log for a key that **has no pending timer and no `recentlyForwarded` record**: no-op. The event either belongs to a block that was already replaced by a later re-add (handled above), or it is a stale removal from a fork the gate has no record of. -> Repeated reorgs of the same transaction are theoretically possible but imply a chain-level consensus failure. The gate's cancel/restart cycle handles each naturally; no special cap is needed. +> Repeated reorgs of the same transaction are theoretically possible but imply a chain-level consensus failure. The gate's replace/restart cycle handles each naturally; no special cap is needed. ### 4.4 Startup and reconciliation @@ -124,9 +124,9 @@ On startup, for each chain, after the `block_hash` migration has been applied: > **Why walk stored hashes, not block numbers?** In normal operation most blocks contain no `ChannelHub` events, so `contract_events` has no row for them. A block-number walk would find nothing to compare at event-gap heights and could miss a reorg that occurred entirely within such a gap. Walking by stored block hashes ensures every comparison is against a block the reactor actually processed. 4. Set the scan start to `commonAncestorBlockNum`. Events between `commonAncestorBlockNum` and `latestBlockNum` that came from the reorged fork are still present in the DB. The reactor has no rollback mechanism for those rows — the re-scan below will re-apply canonical events over them where the transaction was re-mined (idempotent), and leave the orphaned DB state in place where the transaction was not re-mined (residual risk; see §2.1). State-setting operations (`UpdateChannel`, `RefreshUserEnforcedBalance`) will overwrite with canonical values for re-mined events; rows from dropped transactions remain as stale data with no automated cleanup. -5. Start the event scan from `commonAncestorBlockNum` (or genesis if step 1 found no rows). Feed all replayed events **directly to the reactor, bypassing the gate entirely**. Historical events come from `eth_getLogs` and are, by definition, already in the current canonical chain. The common-ancestor walk in steps 2–3 additionally confirms that the starting block is canonical. There is no incremental reorg risk to guard against for these events, and applying a full confirmation delay would only stall the node on restart for no safety benefit. The gate applies exclusively to live WebSocket events; any reorgs of very-recent blocks during replay are handled by the buffered live-subscription signals processed immediately after replay completes (step 7). +5. Start the event scan from `commonAncestorBlockNum` (or genesis if step 1 found no rows). Replayed events may flow through the same path as live events (Listener → gate → reactor); the gate does not require a separate bypass. Historical events come from `eth_getLogs` and are, by definition, already in the current canonical chain — the common-ancestor walk in steps 2–3 additionally confirms that the starting block is canonical, so there is no incremental reorg risk to guard against. Provided that the gate uses each event's **block timestamp** as the `arrivedAt` reference (see §6.7), historical events are immediately mature on first poll and forward without per-event delay; the only added cost is one block-timestamp RPC per unique historical block and at most one poll-tick of latency. 6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)`. If a duplicate is inserted, Postgres returns a constraint-violation error, causing the entire transaction (including all state mutations in the same `useStoreInTx` call) to roll back. The reactor therefore cannot double-apply state changes for an event it has already committed. -7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay. The gate operates in timer-only mode during reconciliation. Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and processed only after the historical replay phase completes. +7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay. Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and reach the gate only after the historical replay phase completes; if they cancel a re-mined event that has already been forwarded, the post-gate reorg detection in §6.5 logs them. --- @@ -163,15 +163,18 @@ The reactor itself does not change. All the listener's existing logic — subscr **Handling `Removed: true` logs:** currently `listener.go:289-294` skips removed logs before they reach the handler. This skip must be moved: the listener should forward removed logs to `gate.HandleEvent` (they still carry the `Removed` flag on `types.Log`), and the gate alone decides whether to cancel a pending timer or ignore the signal. The reactor never sees a `Removed: true` log. -### 6.2 Event identity for removal scanning +### 6.2 Event identity for queue keying -The Listener delivers events in strict block order, so the queue is naturally ordered by arrival time. When a `Removed: true` log arrives in the Pusher, it scans the queue for the **first** entry matching `(txHash, logIndex)` and deletes it. +The Listener delivers events in strict block order, so the queue is naturally ordered by arrival time. Two distinct scan keys are used against the queue: -`blockHash` is deliberately excluded from the removal scan key. Because the queue is FIFO and reorgs produce the re-add event *after* the original event, the original always sits earlier in the queue than any re-add. Scanning for `(txHash, logIndex)` and deleting the first match therefore always targets the original entry and leaves any re-add untouched. +- **`(txHash, logIndex)` — used by both Pusher paths (non-removed re-add and removed cancellation).** On a non-removed arrival, any existing entry with the same `(txHash, logIndex)` is dropped and the new event appended with a fresh `arrivedAt`. Because re-adds always replace the prior entry, the queue holds at most one entry per `(txHash, logIndex)` at any time. +- **`(txHash, blockHash, logIndex)` — used by the Removed-cancel scan against the queue.** A `Removed: true` log only cancels a queued entry when the full key matches. A Removed for an OLD block whose entry has already been replaced by a newer re-add will not match the queued (new-block) entry and will fall through to the `recentlyForwarded` lookup (§6.5). + +`blockHash` is excluded from the re-add scan key so that a re-mining of the same tx replaces the original regardless of which block it landed in. `blockHash` is included on the Removed scan so that a stale removal for an already-replaced fork cannot cancel a live entry. A single transaction can emit multiple events for the same `txHash` (e.g., two `ChannelDeposited` logs in a batch open). `logIndex` disambiguates these; it is unique per log within a block and is present in both the live event and its corresponding `Removed: true` log. -`blockHash` is still present in each `types.Log` stored in the queue and is used by: +`blockHash` is also used by: - The `recentlyForwarded` detection map (§6.5) — keyed by `(txHash, blockHash, logIndex)` to identify which specific occurrence was forwarded. - `StoreContractEvent` in the reactor — stored in `contract_events` for the reconciliation walk (§4.4). @@ -185,12 +188,12 @@ type queueEntry struct { arrivedAt time.Time } -type eventKey struct { // used for removal scan +type eventKey struct { // used for re-add scan (replace prior entry) txHash common.Hash logIndex uint } -type forwardedKey struct { // used for post-gate reorg detection +type forwardedKey struct { // used for Removed-cancel scan and post-gate reorg detection txHash common.Hash blockHash common.Hash logIndex uint @@ -200,8 +203,8 @@ type ConfirmationGate struct { delay time.Duration chainID uint64 handler HandleEvent - queue []queueEntry // protected by mu - recentlyForwarded map[forwardedKey]time.Time // protected by mu; TTL = 2× delay + queue []queueEntry // protected by mu + recentlyForwarded map[forwardedKey]time.Time // protected by mu; entries are kept for a small multiple of `delay` (see §6.5) mu sync.Mutex } ``` @@ -212,8 +215,8 @@ type ConfirmationGate struct { Receives `types.Log` from the Listener. On each event: -- If `Removed: true` — scan the queue for the first entry matching `(txHash, logIndex)` and delete it. If no match found, check `recentlyForwarded` for a post-gate reorg signal (see §6.5). -- Otherwise — append `(log, time.Now())` to the queue tail. +- If `Removed: true` — scan the queue for an entry matching the full `(txHash, blockHash, logIndex)` key and delete it. If no match is found, check `recentlyForwarded` for a post-gate reorg signal (see §6.5). +- Otherwise — drop any existing queue entry with the same `(txHash, logIndex)` (ignoring `blockHash`), then append `(log, arrivedAt)` to the queue tail. `arrivedAt` is the block timestamp (see §6.7), falling back to `time.Now()` only on fetch failure. No expiration check, no forwarding. Push only. @@ -266,7 +269,7 @@ When `Removed: true` arrives in the Pusher: - **No match in queue, but `forwardedKey{txHash, blockHash, logIndex}` found in `recentlyForwarded`** → the event was already forwarded to the Reactor and its block has now been reorged out. Log at **`WARN`** with `txHash`, `blockHash`, `logIndex`, `chainID`. Remove the entry. - **Match in neither** → log at `DEBUG` ("removal for unknown/stale event" — predates the current run or arrived long after the TTL). -`recentlyForwarded` entries are evicted lazily: when the Pusher reads an entry, it checks `time.Since(forwardedAt) > 2 × delay` and discards stale entries on access. The map stays small because post-gate reorgs are rare and `Removed: true` arrives within one or two block-times of the reorg. No separate cleanup goroutine is needed. +`recentlyForwarded` entries are evicted on a TTL that is a small multiple of `delay` — long enough that any `Removed: true` for a forwarded event arrives while the entry is still present, short enough that the map remains bounded. The exact multiplier is an implementation choice (current value: see `recentMultiplier` in `confirmation_gate.go`; e.g. 2 or 3 work in practice). Eviction may be performed lazily on Pusher access, in a periodic Poller sweep, or by any equivalent strategy; the post-gate detection contract above is what matters, not the eviction mechanism. The map stays small because post-gate reorgs are rare and `Removed: true` arrives within one or two block-times of the reorg. No separate cleanup goroutine is required. ### 6.6 Reactor defense-in-depth: skip re-delivered events diff --git a/pkg/blockchain/evm/confirmation_gate_test.go b/pkg/blockchain/evm/confirmation_gate_test.go index f296397c0..d2bfbba3a 100644 --- a/pkg/blockchain/evm/confirmation_gate_test.go +++ b/pkg/blockchain/evm/confirmation_gate_test.go @@ -145,8 +145,9 @@ func TestConfirmationGate_ReorgCancel(t *testing.T) { assert.Equal(t, int32(0), callCount.Load(), "handler must never be called after reorg cancel") } -// T4: a re-delivered event (same tx/logIndex, different blockHash) replaces the original; -// the Removed for the old blockHash is a no-op; the new event is forwarded once. +// T4: a re-delivered event (same tx/logIndex, different blockHash) replaces the original +// in the queue; the late-arriving Removed for the old blockHash is a no-op (no queue match, +// no recentlyForwarded match); the new event is forwarded once. func TestConfirmationGate_OutOfOrder(t *testing.T) { t.Parallel() @@ -163,12 +164,13 @@ func TestConfirmationGate_OutOfOrder(t *testing.T) { bhOld := common.HexToHash("0xAA") bhNew := common.HexToHash("0xBB") - // Event A: original block + // Event A: original block — queued under (tx, bhOld, 0). require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bhOld, 0, false))) - // Event B: re-mined in new block (same txHash/logIndex, different blockHash) + // Event B: re-mined in new block (same txHash/logIndex, different blockHash) — + // replaces A in the queue under (tx, bhNew, 0) and resets the confirmation timer. require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bhNew, 0, false))) - // Removed for old block: matches A's full key (bh=0xAA) and removes it from queue. - // B (bh=0xBB) is left untouched. + // Removed for old block (bhOld): the queued entry now has bhNew, so the full-key + // scan finds no match. recentlyForwarded is empty (nothing forwarded yet). No-op. require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bhOld, 0, true))) // Wait long enough for the poll goroutine to fire (pollInterval=50ms) and the delay to From 256463668d96c054e7a8301d7c1a9b4dfd6440b0 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Mon, 8 Jun 2026 15:48:41 +0200 Subject: [PATCH 07/23] feat(nitronode): bypass confirmation gate for historical event replay --- nitronode/main.go | 4 +- nitronode/reorg-fix-spec.md | 4 +- pkg/blockchain/evm/listener.go | 41 ++++++++++------- pkg/blockchain/evm/listener_test.go | 69 +++++++++++++++++++++++++---- 4 files changed, 91 insertions(+), 27 deletions(-) diff --git a/nitronode/main.go b/nitronode/main.go index 8ffd51e9a..855aba810 100644 --- a/nitronode/main.go +++ b/nitronode/main.go @@ -138,7 +138,9 @@ func main() { gate := evm.NewConfirmationGate(confirmationDelay, b.ID, reactor.HandleEvent, blockTimestampFetcher, logger) gate.Start(blockchainCtx) - l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, logger, gate.HandleEvent, bb.DbStore) + // Live events flow through the confirmation gate; historical events from eth_getLogs + // are already canonical and go directly to the reactor (see reorg-fix-spec.md §4.4 step 5). + l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, logger, gate.HandleEvent, reactor.HandleEvent, bb.DbStore) l.Listen(blockchainCtx, func(err error) { if err != nil { logger.Fatal("blockchain listener stopped", "error", err, "blockchainID", b.ID) diff --git a/nitronode/reorg-fix-spec.md b/nitronode/reorg-fix-spec.md index 501d9eb50..1fcfc01fe 100644 --- a/nitronode/reorg-fix-spec.md +++ b/nitronode/reorg-fix-spec.md @@ -124,9 +124,9 @@ On startup, for each chain, after the `block_hash` migration has been applied: > **Why walk stored hashes, not block numbers?** In normal operation most blocks contain no `ChannelHub` events, so `contract_events` has no row for them. A block-number walk would find nothing to compare at event-gap heights and could miss a reorg that occurred entirely within such a gap. Walking by stored block hashes ensures every comparison is against a block the reactor actually processed. 4. Set the scan start to `commonAncestorBlockNum`. Events between `commonAncestorBlockNum` and `latestBlockNum` that came from the reorged fork are still present in the DB. The reactor has no rollback mechanism for those rows — the re-scan below will re-apply canonical events over them where the transaction was re-mined (idempotent), and leave the orphaned DB state in place where the transaction was not re-mined (residual risk; see §2.1). State-setting operations (`UpdateChannel`, `RefreshUserEnforcedBalance`) will overwrite with canonical values for re-mined events; rows from dropped transactions remain as stale data with no automated cleanup. -5. Start the event scan from `commonAncestorBlockNum` (or genesis if step 1 found no rows). Replayed events may flow through the same path as live events (Listener → gate → reactor); the gate does not require a separate bypass. Historical events come from `eth_getLogs` and are, by definition, already in the current canonical chain — the common-ancestor walk in steps 2–3 additionally confirms that the starting block is canonical, so there is no incremental reorg risk to guard against. Provided that the gate uses each event's **block timestamp** as the `arrivedAt` reference (see §6.7), historical events are immediately mature on first poll and forward without per-event delay; the only added cost is one block-timestamp RPC per unique historical block and at most one poll-tick of latency. +5. Start the event scan from `commonAncestorBlockNum` (or genesis if step 1 found no rows). Replayed events are routed **directly to the reactor**, bypassing the gate. Historical events come from `eth_getLogs` and are, by definition, already in the current canonical chain — the common-ancestor walk in steps 2–3 additionally confirms that the starting block is canonical, so there is no incremental reorg risk to guard against. The `Listener` accepts two handlers (`eventHandler` for live events, `historicalEventHandler` for Phase 1) so the gate sits in the live path only. The bypass also avoids one block-timestamp RPC per unique historical block and one poll-tick of latency that the gate would otherwise add on every restart. 6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)`. If a duplicate is inserted, Postgres returns a constraint-violation error, causing the entire transaction (including all state mutations in the same `useStoreInTx` call) to roll back. The reactor therefore cannot double-apply state changes for an event it has already committed. -7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay. Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and reach the gate only after the historical replay phase completes; if they cancel a re-mined event that has already been forwarded, the post-gate reorg detection in §6.5 logs them. +7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay, and replay does not flow through the gate (step 5). Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and reach the gate only after the historical replay phase completes; if they cancel a re-mined event that has already been forwarded by the live path, the post-gate reorg detection in §6.5 logs them. --- diff --git a/pkg/blockchain/evm/listener.go b/pkg/blockchain/evm/listener.go index 10a01eec5..60f3ce494 100644 --- a/pkg/blockchain/evm/listener.go +++ b/pkg/blockchain/evm/listener.go @@ -24,26 +24,35 @@ const ( // deduplicated delivery even across restarts. Cancel the context passed to Listen // for graceful shutdown. type Listener struct { - contractAddress common.Address - client EVMClient - blockchainID uint64 - blockStep uint64 // max blocks per FilterLogs call during reconciliation - logger log.Logger - handleEvent HandleEvent - eventGetter ContractEventGetter + contractAddress common.Address + client EVMClient + blockchainID uint64 + blockStep uint64 // max blocks per FilterLogs call during reconciliation + logger log.Logger + handleEvent HandleEvent // live events (Phase 2); typically the ConfirmationGate + handleHistoricalEvent HandleEvent // historical events (Phase 1); typically the reactor directly + eventGetter ContractEventGetter } // NewListener creates a Listener. blockStep controls how many blocks are fetched // per RPC call during historical reconciliation. -func NewListener(contractAddress common.Address, client EVMClient, blockchainID uint64, blockStep uint64, logger log.Logger, eventHandler HandleEvent, eventGetter ContractEventGetter) *Listener { +// +// eventHandler is invoked for live events (Phase 2); historicalEventHandler is invoked +// for historical events (Phase 1). The two handlers may be the same function. The split +// exists so callers can route live events through a ConfirmationGate while replaying +// historical events directly to the reactor — historical events come from `eth_getLogs` +// and are by definition canonical, so the gate adds no safety value for them (see +// reorg-fix-spec.md §4.4 step 5). +func NewListener(contractAddress common.Address, client EVMClient, blockchainID uint64, blockStep uint64, logger log.Logger, eventHandler HandleEvent, historicalEventHandler HandleEvent, eventGetter ContractEventGetter) *Listener { return &Listener{ - contractAddress: contractAddress, - client: client, - blockchainID: blockchainID, - blockStep: blockStep, - logger: logger.WithName("evm"), - handleEvent: eventHandler, - eventGetter: eventGetter, + contractAddress: contractAddress, + client: client, + blockchainID: blockchainID, + blockStep: blockStep, + logger: logger.WithName("evm"), + handleEvent: eventHandler, + handleHistoricalEvent: historicalEventHandler, + eventGetter: eventGetter, } } @@ -265,7 +274,7 @@ func (l *Listener) processEvents( } l.logger.Debug("received historical event", "blockchainID", l.blockchainID, "contractAddress", l.contractAddress.String(), "blockNumber", eventLog.BlockNumber, "logIndex", eventLog.Index) evCtx := log.SetContextLogger(context.Background(), l.logger) - if err := l.handleEvent(evCtx, eventLog); err != nil { + if err := l.handleHistoricalEvent(evCtx, eventLog); err != nil { eventSubscription.Unsubscribe() return err } diff --git a/pkg/blockchain/evm/listener_test.go b/pkg/blockchain/evm/listener_test.go index a948150f3..b665f00fd 100644 --- a/pkg/blockchain/evm/listener_test.go +++ b/pkg/blockchain/evm/listener_test.go @@ -45,7 +45,7 @@ func TestNewListener(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - l := NewListener(addr, mockClient, 1, 100, logger, nil, eventGetter) + l := NewListener(addr, mockClient, 1, 100, logger, nil, nil, eventGetter) require.NotNil(t, l) assert.Equal(t, addr, l.contractAddress) assert.Equal(t, uint64(1), l.blockchainID) @@ -73,7 +73,7 @@ func TestListener_Listen_CurrentEvents(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 100, logger, handleEvent, eventGetter) + listener := NewListener(addr, mockClient, 1, 100, logger, handleEvent, handleEvent, eventGetter) // Mock SubscribeFilterLogs sub := &MockSubscription{ @@ -110,7 +110,7 @@ func TestListener_ReconcileBlockRange(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, logger, nil, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, logger, nil, nil, eventGetter) // Setup FilterLogs mock // We expect a range fetch. start=100, step=10 -> end=110. current=120. @@ -184,7 +184,7 @@ func TestListener_Listen_HistoricalAndCurrent(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 10, logger, handleEvent, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, logger, handleEvent, handleEvent, eventGetter) // findCommonAncestor: block 100 is canonical. canonicalHeader := &types.Header{Number: big.NewInt(100)} @@ -230,7 +230,7 @@ func TestProcessEvents_DedupSkipsPresent(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, handleEvent, eventGetter) + listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, handleEvent, handleEvent, eventGetter) // Historical: 3 events. First 2 are present (skipped), 3rd is not (handled). // After the 3rd, the check should stop — no IsContractEventPresent call for events 4+. @@ -281,7 +281,7 @@ func TestProcessEvents_SubscriptionErrorDuringPhase1(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, handleEvent, eventGetter) + listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, handleEvent, handleEvent, eventGetter) // Historical channel with events that will block (not closed yet) historicalCh := make(chan types.Log, 2) @@ -309,6 +309,59 @@ func TestProcessEvents_SubscriptionErrorDuringPhase1(t *testing.T) { assert.Equal(t, []uint64{100}, handledBlocks) } +// TestListener_PhaseHandlerRouting verifies that Phase 1 (historical) events are routed +// to handleHistoricalEvent and Phase 2 (live) events are routed to handleEvent. This is +// the gate-bypass for historical replay (reorg-fix-spec.md §4.4 step 5). +func TestListener_PhaseHandlerRouting(t *testing.T) { + t.Parallel() + logger := log.NewNoopLogger() + addr := common.HexToAddress("0x123") + eventGetter := new(MockContractEventGetter) + + var historicalLogs, liveLogs []types.Log + historicalHandler := func(_ context.Context, l types.Log) error { + historicalLogs = append(historicalLogs, l) + return nil + } + liveHandler := func(_ context.Context, l types.Log) error { + liveLogs = append(liveLogs, l) + return nil + } + + listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, liveHandler, historicalHandler, eventGetter) + + // Historical: 1 event at block 100. + histLog := types.Log{BlockNumber: 100, Index: 0, TxHash: common.HexToHash("0xaaa")} + historicalCh := make(chan types.Log, 1) + historicalCh <- histLog + close(historicalCh) + + // Live: 1 event at block 200. + currentLog := types.Log{BlockNumber: 200, Index: 0, TxHash: common.HexToHash("0xbbb")} + currentCh := make(chan types.Log, 1) + currentCh <- currentLog + + eventGetter.On("IsContractEventPresent", uint64(1), uint64(100), mock.Anything, uint32(0)).Return(false, nil).Once() + eventGetter.On("IsContractEventPresent", uint64(1), uint64(200), mock.Anything, uint32(0)).Return(false, nil).Once() + + sub := &MockSubscription{errChan: make(chan error, 1), unsub: func() {}} + + ctx, cancel := context.WithCancel(context.Background()) + go func() { + time.Sleep(100 * time.Millisecond) + cancel() + }() + + var lastBlock uint64 + err := listener.processEvents(ctx, sub, historicalCh, currentCh, &lastBlock) + require.NoError(t, err) + + require.Len(t, historicalLogs, 1, "historical handler must see the Phase 1 event") + assert.Equal(t, uint64(100), historicalLogs[0].BlockNumber) + require.Len(t, liveLogs, 1, "live handler must see the Phase 2 event") + assert.Equal(t, uint64(200), liveLogs[0].BlockNumber) +} + func TestListener_RemovedLog_ForwardedToHandler(t *testing.T) { t.Parallel() logger := log.NewNoopLogger() @@ -322,7 +375,7 @@ func TestListener_RemovedLog_ForwardedToHandler(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, handleEvent, eventGetter) + listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, handleEvent, handleEvent, eventGetter) // No historical events. historicalCh := make(chan types.Log) @@ -379,7 +432,7 @@ func TestReconcileBlockRange_ContextCancellation(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, logger, nil, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, logger, nil, nil, eventGetter) ctx, cancel := context.WithCancel(context.Background()) From b1fb759182412af5c2242d62b4b2d16a573d9bb1 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Wed, 10 Jun 2026 16:38:19 +0200 Subject: [PATCH 08/23] feat: address some docs comments --- nitronode/reorg-fix-spec.md | 1 - nitronode/store/memory/blockchain_config.go | 1 - 2 files changed, 2 deletions(-) diff --git a/nitronode/reorg-fix-spec.md b/nitronode/reorg-fix-spec.md index 1fcfc01fe..54bf203e3 100644 --- a/nitronode/reorg-fix-spec.md +++ b/nitronode/reorg-fix-spec.md @@ -239,7 +239,6 @@ No event handling, no Listener awareness. Drain-and-forward only. | Property | Detail | | --- | --- | -| Zero RPC calls in the gate | Delay is a pure `time.Duration`; no chain queries | | Chain-agnostic | `confirmationDelay` is the only chain-specific input | | Forward latency after window | At most one tick (~50 ms) | | Reorg within window | Pusher's scan removes the entry; Reactor never sees the event | diff --git a/nitronode/store/memory/blockchain_config.go b/nitronode/store/memory/blockchain_config.go index 7f6dd8c68..c968c9a9e 100644 --- a/nitronode/store/memory/blockchain_config.go +++ b/nitronode/store/memory/blockchain_config.go @@ -45,7 +45,6 @@ type BlockchainConfig struct { ChannelHubSigValidators map[uint8]string `yaml:"channel_hub_sig_validators"` // ConfirmationDelaySecs is the number of seconds to wait before processing an event. // Set to 0 to process events immediately (disables the confirmation gate). - // Maximum meaningful value is ~780s (Ethereum Casper FFG hard finality). ConfirmationDelaySecs uint32 `yaml:"confirmation_delay_secs"` } From ce0198753b057a701ee46efe10b059f30e387b95 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Wed, 10 Jun 2026 16:53:30 +0200 Subject: [PATCH 09/23] fix(nitronode/listener): route fresh historical events to gate --- nitronode/main.go | 8 +- nitronode/reorg-fix-spec.md | 6 +- pkg/blockchain/evm/listener.go | 67 ++++++++++--- pkg/blockchain/evm/listener_test.go | 143 +++++++++++++++++++++++----- 4 files changed, 187 insertions(+), 37 deletions(-) diff --git a/nitronode/main.go b/nitronode/main.go index 855aba810..6151dbf36 100644 --- a/nitronode/main.go +++ b/nitronode/main.go @@ -138,9 +138,11 @@ func main() { gate := evm.NewConfirmationGate(confirmationDelay, b.ID, reactor.HandleEvent, blockTimestampFetcher, logger) gate.Start(blockchainCtx) - // Live events flow through the confirmation gate; historical events from eth_getLogs - // are already canonical and go directly to the reactor (see reorg-fix-spec.md §4.4 step 5). - l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, logger, gate.HandleEvent, reactor.HandleEvent, bb.DbStore) + // Live events flow through the confirmation gate. Historical events from eth_getLogs + // are routed per-event based on block age: events older than confirmationDelay go + // directly to the reactor (past the reorg window); recent events still flow through + // the gate because their blocks may still be reorged. + l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, confirmationDelay, logger, gate.HandleEvent, reactor.HandleEvent, bb.DbStore) l.Listen(blockchainCtx, func(err error) { if err != nil { logger.Fatal("blockchain listener stopped", "error", err, "blockchainID", b.ID) diff --git a/nitronode/reorg-fix-spec.md b/nitronode/reorg-fix-spec.md index 54bf203e3..2f9941fb8 100644 --- a/nitronode/reorg-fix-spec.md +++ b/nitronode/reorg-fix-spec.md @@ -124,7 +124,11 @@ On startup, for each chain, after the `block_hash` migration has been applied: > **Why walk stored hashes, not block numbers?** In normal operation most blocks contain no `ChannelHub` events, so `contract_events` has no row for them. A block-number walk would find nothing to compare at event-gap heights and could miss a reorg that occurred entirely within such a gap. Walking by stored block hashes ensures every comparison is against a block the reactor actually processed. 4. Set the scan start to `commonAncestorBlockNum`. Events between `commonAncestorBlockNum` and `latestBlockNum` that came from the reorged fork are still present in the DB. The reactor has no rollback mechanism for those rows — the re-scan below will re-apply canonical events over them where the transaction was re-mined (idempotent), and leave the orphaned DB state in place where the transaction was not re-mined (residual risk; see §2.1). State-setting operations (`UpdateChannel`, `RefreshUserEnforcedBalance`) will overwrite with canonical values for re-mined events; rows from dropped transactions remain as stale data with no automated cleanup. -5. Start the event scan from `commonAncestorBlockNum` (or genesis if step 1 found no rows). Replayed events are routed **directly to the reactor**, bypassing the gate. Historical events come from `eth_getLogs` and are, by definition, already in the current canonical chain — the common-ancestor walk in steps 2–3 additionally confirms that the starting block is canonical, so there is no incremental reorg risk to guard against. The `Listener` accepts two handlers (`eventHandler` for live events, `historicalEventHandler` for Phase 1) so the gate sits in the live path only. The bypass also avoids one block-timestamp RPC per unique historical block and one poll-tick of latency that the gate would otherwise add on every restart. +5. Start the event scan from `commonAncestorBlockNum` (or genesis if step 1 found no rows). Replayed events are routed **per-event by block age**: + - Events whose block timestamp is **older than `confirmation_delay_sec`** are routed directly to the reactor, bypassing the gate. Their block is past the reorg window — `eth_getLogs` returned them as canonical, and any reorg that could displace them would exceed the configured finality bound. There is no incremental reorg risk to guard against, and routing them through the gate would only add latency. + - Events whose block timestamp is **younger than `confirmation_delay_sec`** are routed through the gate, the same path live events take. The common-ancestor walk only confirms the *starting* block is canonical; replay can fetch logs from blocks all the way up to the current chain tip, some of which are still inside the reorg window. Forwarding those directly to the reactor would re-introduce the very double-spend window the gate was built to close. + + The `Listener` accepts two handlers (`eventHandler` for live events and recent historical events, `historicalEventHandler` for mature historical events) and makes the per-event routing decision using one `HeaderByHash` RPC per historical event. When `confirmation_delay_sec` is `0` the gate is disabled and every historical event is routed to `historicalEventHandler` without a timestamp fetch. On a `HeaderByHash` failure the Listener falls back to `eventHandler` (the gate) — the conservative choice that preserves the reorg-protection invariant at the cost of a small delay. 6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)`. If a duplicate is inserted, Postgres returns a constraint-violation error, causing the entire transaction (including all state mutations in the same `useStoreInTx` call) to roll back. The reactor therefore cannot double-apply state changes for an event it has already committed. 7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay, and replay does not flow through the gate (step 5). Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and reach the gate only after the historical replay phase completes; if they cancel a re-mined event that has already been forwarded by the live path, the post-gate reorg detection in §6.5 logs them. diff --git a/pkg/blockchain/evm/listener.go b/pkg/blockchain/evm/listener.go index 60f3ce494..1144d5e59 100644 --- a/pkg/blockchain/evm/listener.go +++ b/pkg/blockchain/evm/listener.go @@ -27,28 +27,37 @@ type Listener struct { contractAddress common.Address client EVMClient blockchainID uint64 - blockStep uint64 // max blocks per FilterLogs call during reconciliation + blockStep uint64 // max blocks per FilterLogs call during reconciliation + confirmationDelay time.Duration // routing threshold for Phase 1 events; 0 disables age-based routing logger log.Logger - handleEvent HandleEvent // live events (Phase 2); typically the ConfirmationGate - handleHistoricalEvent HandleEvent // historical events (Phase 1); typically the reactor directly + handleEvent HandleEvent // live events and recent historical events; typically the ConfirmationGate + handleHistoricalEvent HandleEvent // historical events older than confirmationDelay; typically the reactor directly eventGetter ContractEventGetter } // NewListener creates a Listener. blockStep controls how many blocks are fetched // per RPC call during historical reconciliation. // -// eventHandler is invoked for live events (Phase 2); historicalEventHandler is invoked -// for historical events (Phase 1). The two handlers may be the same function. The split -// exists so callers can route live events through a ConfirmationGate while replaying -// historical events directly to the reactor — historical events come from `eth_getLogs` -// and are by definition canonical, so the gate adds no safety value for them (see -// reorg-fix-spec.md §4.4 step 5). -func NewListener(contractAddress common.Address, client EVMClient, blockchainID uint64, blockStep uint64, logger log.Logger, eventHandler HandleEvent, historicalEventHandler HandleEvent, eventGetter ContractEventGetter) *Listener { +// confirmationDelay controls per-event routing for Phase 1 (historical) events: +// - When 0: every historical event is routed to historicalEventHandler. +// - When > 0: each event's block timestamp is fetched via HeaderByHash. Events older +// than confirmationDelay are routed to historicalEventHandler (their block is past +// the reorg window, so they are safe to forward directly). Events younger than +// confirmationDelay are routed to eventHandler so they pass through the gate — +// historical replay reaching very recent blocks is no safer than live delivery +// and the gate must still protect against reorgs of those blocks. +// +// Live (Phase 2) events always flow to eventHandler. +// +// eventHandler is typically the ConfirmationGate; historicalEventHandler is typically +// the reactor directly. The two handlers may be the same function when no gate is in use. +func NewListener(contractAddress common.Address, client EVMClient, blockchainID uint64, blockStep uint64, confirmationDelay time.Duration, logger log.Logger, eventHandler HandleEvent, historicalEventHandler HandleEvent, eventGetter ContractEventGetter) *Listener { return &Listener{ contractAddress: contractAddress, client: client, blockchainID: blockchainID, blockStep: blockStep, + confirmationDelay: confirmationDelay, logger: logger.WithName("evm"), handleEvent: eventHandler, handleHistoricalEvent: historicalEventHandler, @@ -274,7 +283,8 @@ func (l *Listener) processEvents( } l.logger.Debug("received historical event", "blockchainID", l.blockchainID, "contractAddress", l.contractAddress.String(), "blockNumber", eventLog.BlockNumber, "logIndex", eventLog.Index) evCtx := log.SetContextLogger(context.Background(), l.logger) - if err := l.handleHistoricalEvent(evCtx, eventLog); err != nil { + handler := l.routeHistoricalEvent(ctx, eventLog) + if err := handler(evCtx, eventLog); err != nil { eventSubscription.Unsubscribe() return err } @@ -404,3 +414,38 @@ func (l *Listener) reconcileBlockRange( } } +// routeHistoricalEvent chooses the handler for a Phase 1 event based on the age of +// its block. Events whose block timestamp is older than confirmationDelay are routed +// to handleHistoricalEvent (they are past the reorg window and safe to forward +// directly). Recent events — whose blocks may still be reorged — are routed to +// handleEvent so they pass through the gate. When confirmationDelay is zero, every +// event is routed to handleHistoricalEvent. +// +// On a HeaderByHash fetch error the function falls back to handleEvent: routing +// through the gate is the conservative choice (it preserves the reorg-protection +// invariant at the cost of a small delay), and the gate's own block-timestamp +// fetcher will retry the lookup with its own fallback. +func (l *Listener) routeHistoricalEvent(ctx context.Context, eventLog types.Log) HandleEvent { + if l.confirmationDelay == 0 { + return l.handleHistoricalEvent + } + + headerCtx, cancel := context.WithTimeout(ctx, rpcRequestTimeout) + defer cancel() + header, err := l.client.HeaderByHash(headerCtx, eventLog.BlockHash) + if err != nil { + l.logger.Warn("failed to fetch block timestamp for historical event, routing through gate", + "error", err, + "blockchainID", l.blockchainID, + "blockNumber", eventLog.BlockNumber, + "blockHash", eventLog.BlockHash.Hex(), + ) + return l.handleEvent + } + + blockTime := time.Unix(int64(header.Time), 0) + if time.Since(blockTime) < l.confirmationDelay { + return l.handleEvent + } + return l.handleHistoricalEvent +} diff --git a/pkg/blockchain/evm/listener_test.go b/pkg/blockchain/evm/listener_test.go index b665f00fd..1105e195a 100644 --- a/pkg/blockchain/evm/listener_test.go +++ b/pkg/blockchain/evm/listener_test.go @@ -45,7 +45,7 @@ func TestNewListener(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - l := NewListener(addr, mockClient, 1, 100, logger, nil, nil, eventGetter) + l := NewListener(addr, mockClient, 1, 100, 0, logger, nil, nil, eventGetter) require.NotNil(t, l) assert.Equal(t, addr, l.contractAddress) assert.Equal(t, uint64(1), l.blockchainID) @@ -73,7 +73,7 @@ func TestListener_Listen_CurrentEvents(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 100, logger, handleEvent, handleEvent, eventGetter) + listener := NewListener(addr, mockClient, 1, 100, 0, logger, handleEvent, handleEvent, eventGetter) // Mock SubscribeFilterLogs sub := &MockSubscription{ @@ -110,7 +110,7 @@ func TestListener_ReconcileBlockRange(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, logger, nil, nil, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) // Setup FilterLogs mock // We expect a range fetch. start=100, step=10 -> end=110. current=120. @@ -184,7 +184,7 @@ func TestListener_Listen_HistoricalAndCurrent(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 10, logger, handleEvent, handleEvent, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) // findCommonAncestor: block 100 is canonical. canonicalHeader := &types.Header{Number: big.NewInt(100)} @@ -230,7 +230,7 @@ func TestProcessEvents_DedupSkipsPresent(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, handleEvent, handleEvent, eventGetter) + listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) // Historical: 3 events. First 2 are present (skipped), 3rd is not (handled). // After the 3rd, the check should stop — no IsContractEventPresent call for events 4+. @@ -281,7 +281,7 @@ func TestProcessEvents_SubscriptionErrorDuringPhase1(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, handleEvent, handleEvent, eventGetter) + listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) // Historical channel with events that will block (not closed yet) historicalCh := make(chan types.Log, 2) @@ -309,38 +309,73 @@ func TestProcessEvents_SubscriptionErrorDuringPhase1(t *testing.T) { assert.Equal(t, []uint64{100}, handledBlocks) } -// TestListener_PhaseHandlerRouting verifies that Phase 1 (historical) events are routed -// to handleHistoricalEvent and Phase 2 (live) events are routed to handleEvent. This is -// the gate-bypass for historical replay (reorg-fix-spec.md §4.4 step 5). +// TestListener_PhaseHandlerRouting verifies the age-based routing of Phase 1 events: +// - Historical events older than confirmationDelay → handleHistoricalEvent (direct, gate bypass) +// - Historical events younger than confirmationDelay → handleEvent (through gate; still in reorg window) +// - Live (Phase 2) events → handleEvent (always) +// - HeaderByHash fetch failures → handleEvent (conservative fallback) +// +// See reorg-fix-spec.md §4.4 step 5. func TestListener_PhaseHandlerRouting(t *testing.T) { t.Parallel() logger := log.NewNoopLogger() addr := common.HexToAddress("0x123") + confirmationDelay := 60 * time.Second + + mockClient := new(MockEVMClient) eventGetter := new(MockContractEventGetter) - var historicalLogs, liveLogs []types.Log + var ( + mu sync.Mutex + historicalLogs []types.Log + liveLogs []types.Log + ) historicalHandler := func(_ context.Context, l types.Log) error { + mu.Lock() + defer mu.Unlock() historicalLogs = append(historicalLogs, l) return nil } liveHandler := func(_ context.Context, l types.Log) error { + mu.Lock() + defer mu.Unlock() liveLogs = append(liveLogs, l) return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, liveHandler, historicalHandler, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, confirmationDelay, logger, liveHandler, historicalHandler, eventGetter) - // Historical: 1 event at block 100. - histLog := types.Log{BlockNumber: 100, Index: 0, TxHash: common.HexToHash("0xaaa")} - historicalCh := make(chan types.Log, 1) - historicalCh <- histLog + // Old historical event (block timestamp 10 minutes ago) — should bypass the gate. + oldHash := common.HexToHash("0xa1") + oldLog := types.Log{BlockNumber: 100, Index: 0, TxHash: common.HexToHash("0xaaa"), BlockHash: oldHash} + oldHeader := &types.Header{Number: big.NewInt(100), Time: uint64(time.Now().Add(-10 * time.Minute).Unix())} + mockClient.On("HeaderByHash", mock.Anything, oldHash).Return(oldHeader, nil).Once() + + // Recent historical event (block timestamp 5 seconds ago) — should flow through the gate. + recentHash := common.HexToHash("0xa2") + recentLog := types.Log{BlockNumber: 101, Index: 0, TxHash: common.HexToHash("0xbbb"), BlockHash: recentHash} + recentHeader := &types.Header{Number: big.NewInt(101), Time: uint64(time.Now().Add(-5 * time.Second).Unix())} + mockClient.On("HeaderByHash", mock.Anything, recentHash).Return(recentHeader, nil).Once() + + // Historical event whose HeaderByHash fetch fails — should fall back to the gate. + failHash := common.HexToHash("0xa3") + failLog := types.Log{BlockNumber: 102, Index: 0, TxHash: common.HexToHash("0xccc"), BlockHash: failHash} + mockClient.On("HeaderByHash", mock.Anything, failHash).Return(nil, fmt.Errorf("rpc failure")).Once() + + // Live event — always to liveHandler regardless of age. + currentLog := types.Log{BlockNumber: 200, Index: 0, TxHash: common.HexToHash("0xddd"), BlockHash: common.HexToHash("0xb1")} + + historicalCh := make(chan types.Log, 3) + historicalCh <- oldLog + historicalCh <- recentLog + historicalCh <- failLog close(historicalCh) - // Live: 1 event at block 200. - currentLog := types.Log{BlockNumber: 200, Index: 0, TxHash: common.HexToHash("0xbbb")} currentCh := make(chan types.Log, 1) currentCh <- currentLog + // Only the first historical event triggers IsContractEventPresent (then the check is dropped for the phase); + // the first live event triggers it again for Phase 2. eventGetter.On("IsContractEventPresent", uint64(1), uint64(100), mock.Anything, uint32(0)).Return(false, nil).Once() eventGetter.On("IsContractEventPresent", uint64(1), uint64(200), mock.Anything, uint32(0)).Return(false, nil).Once() @@ -356,10 +391,74 @@ func TestListener_PhaseHandlerRouting(t *testing.T) { err := listener.processEvents(ctx, sub, historicalCh, currentCh, &lastBlock) require.NoError(t, err) - require.Len(t, historicalLogs, 1, "historical handler must see the Phase 1 event") + mu.Lock() + defer mu.Unlock() + require.Len(t, historicalLogs, 1, "only the old historical event should bypass the gate") + assert.Equal(t, uint64(100), historicalLogs[0].BlockNumber) + require.Len(t, liveLogs, 3, "recent + fallback historical events plus the live event must reach the live handler") + assert.Equal(t, uint64(101), liveLogs[0].BlockNumber, "recent historical event routed through the gate") + assert.Equal(t, uint64(102), liveLogs[1].BlockNumber, "HeaderByHash-failed historical event routed through the gate (conservative fallback)") + assert.Equal(t, uint64(200), liveLogs[2].BlockNumber, "live event always routed to the gate") + + mockClient.AssertExpectations(t) + eventGetter.AssertExpectations(t) +} + +// TestListener_PhaseHandlerRouting_DelayZero verifies that when confirmationDelay is 0, +// every historical event is routed to handleHistoricalEvent without any HeaderByHash +// fetch — preserving the legacy bypass for gate-disabled chains. +func TestListener_PhaseHandlerRouting_DelayZero(t *testing.T) { + t.Parallel() + logger := log.NewNoopLogger() + addr := common.HexToAddress("0x123") + + mockClient := new(MockEVMClient) + eventGetter := new(MockContractEventGetter) + + var ( + mu sync.Mutex + historicalLogs []types.Log + ) + historicalHandler := func(_ context.Context, l types.Log) error { + mu.Lock() + defer mu.Unlock() + historicalLogs = append(historicalLogs, l) + return nil + } + liveHandler := func(_ context.Context, _ types.Log) error { + t.Fatal("live handler must not be called when delay is 0 and only Phase 1 events are present") + return nil + } + + listener := NewListener(addr, mockClient, 1, 10, 0, logger, liveHandler, historicalHandler, eventGetter) + + histLog := types.Log{BlockNumber: 100, Index: 0, TxHash: common.HexToHash("0xaaa"), BlockHash: common.HexToHash("0xa1")} + historicalCh := make(chan types.Log, 1) + historicalCh <- histLog + close(historicalCh) + currentCh := make(chan types.Log) + + eventGetter.On("IsContractEventPresent", uint64(1), uint64(100), mock.Anything, uint32(0)).Return(false, nil).Once() + + sub := &MockSubscription{errChan: make(chan error, 1), unsub: func() {}} + + ctx, cancel := context.WithCancel(context.Background()) + go func() { + time.Sleep(50 * time.Millisecond) + cancel() + }() + + var lastBlock uint64 + err := listener.processEvents(ctx, sub, historicalCh, currentCh, &lastBlock) + require.NoError(t, err) + + mu.Lock() + defer mu.Unlock() + require.Len(t, historicalLogs, 1) assert.Equal(t, uint64(100), historicalLogs[0].BlockNumber) - require.Len(t, liveLogs, 1, "live handler must see the Phase 2 event") - assert.Equal(t, uint64(200), liveLogs[0].BlockNumber) + + // HeaderByHash must NOT have been called when delay is 0. + mockClient.AssertNotCalled(t, "HeaderByHash") } func TestListener_RemovedLog_ForwardedToHandler(t *testing.T) { @@ -375,7 +474,7 @@ func TestListener_RemovedLog_ForwardedToHandler(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, logger, handleEvent, handleEvent, eventGetter) + listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) // No historical events. historicalCh := make(chan types.Log) @@ -432,7 +531,7 @@ func TestReconcileBlockRange_ContextCancellation(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, logger, nil, nil, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) ctx, cancel := context.WithCancel(context.Background()) From 5e234bb07d4916ba4eaadc7ba9c2a17d59c70428 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Wed, 10 Jun 2026 17:08:29 +0200 Subject: [PATCH 10/23] fix(nitronode/reconciler): improve findCommonAncestor --- nitronode/reorg-fix-spec.md | 13 ++- pkg/blockchain/evm/interface.go | 5 +- pkg/blockchain/evm/listener_test.go | 17 ++-- pkg/blockchain/evm/reconciler.go | 50 ++++++++-- pkg/blockchain/evm/reconciler_test.go | 129 ++++++++++++++++++-------- 5 files changed, 151 insertions(+), 63 deletions(-) diff --git a/nitronode/reorg-fix-spec.md b/nitronode/reorg-fix-spec.md index 2f9941fb8..f85bd8639 100644 --- a/nitronode/reorg-fix-spec.md +++ b/nitronode/reorg-fix-spec.md @@ -99,12 +99,14 @@ Before the reconciliation logic described below can function, `block_hash` must **Why `block_hash` is the minimal required addition — and why alternatives fail:** -The reconciliation walk needs to answer one question per stored block: "is this specific block still in the canonical chain?" The only RPC call that answers it directly is `eth_getBlockByHash(hash)` — it returns `null` if the block is no longer canonical. Without the stored hash, two alternatives were evaluated and both fail: +The reconciliation walk needs to answer one question per stored block: "is this specific block still in the canonical chain?" The definitive answer combines the stored hash with an `eth_getBlockByNumber(storedBlockNumber)` lookup — the canonical chain has exactly one block at each height, and comparing its hash to the stored hash tells us whether the stored block is still canonical. Without the stored hash, two alternatives were evaluated and both fail: - **`block_number` alone is insufficient.** After a reorg, a *different* block can occupy the same height. Calling `eth_getBlockByNumber(storedBlockNumber)` always returns a block — but it may be a new block from the reorged fork. Without the original hash there is no way to tell whether the block returned is the one the reactor processed. - **`transaction_hash` via `eth_getTransactionReceipt` is insufficient.** A block can be reorged out even if every one of its transactions was re-mined in a new block at the same height. In that case all receipt lookups return `blockNumber` matching the stored value, but the original block is gone and the stored DB state no longer corresponds to the canonical chain. Additionally, the backward walk (step 3) must traverse every stored *block* in descending order; rows in `contract_events` only exist for blocks that contained a `ChannelHub` event. A reorg that diverged entirely within a gap — blocks with no relevant events — is invisible to a tx-receipt-based walk. +Note that `eth_getBlockByHash(storedHash)` alone is **not** suitable as the canonicality check: a node may still have the orphan side-chain header cached locally and return it successfully, so a non-null response does not prove the block is in the canonical chain. The check must use `eth_getBlockByNumber` so the response is by definition the current canonical block at that height. + `block_hash` is a single `CHAR(66)` column. Its addition enables exact, O(1)-per-step canonicality checks and is the only approach that handles all reorg scenarios correctly. #### Definition: latest processed block @@ -116,10 +118,11 @@ The **latest processed block** for a chain is the highest block number at which On startup, for each chain, after the `block_hash` migration has been applied: 1. Query `contract_events` for the latest committed event: `latestBlockNum = MAX(block_number)`, `latestBlockHash = block_hash` at that row. If no rows exist, start the scan from the chain's configured genesis / start block and skip to step 5. -2. Call `eth_getBlockByHash(latestBlockHash)` on the chain's RPC. - - If the response is non-null: `latestBlockHash` is still in the canonical chain — no reorg above this block. Proceed to step 4. - - If the response is null: the block has been reorged out. Proceed to step 3. -3. **Common-ancestor walk using stored block hashes:** query `contract_events` for the next-older distinct `block_hash` (the highest `block_number` strictly below the current candidate). Repeat step 2 with this hash. Continue until a block hash is found that is still in the canonical chain, or until no older stored hash exists (treat genesis as the fallback). This height is the **common ancestor**. +2. Call `eth_getBlockByNumber(latestBlockNum)` on the chain's RPC and compare the returned block's hash against `latestBlockHash`. + - **Hash matches** → the stored block is the current canonical block at that height; no reorg above it. Proceed to step 4. + - **Hash differs** → a different block now occupies that height; the stored block has been reorged out. Proceed to step 3. + - **`ethereum.NotFound`** (RPC has no canonical block at that number, e.g. the height was pruned) → treat as reorged-out and proceed to step 3 rather than failing startup. +3. **Common-ancestor walk using stored block hashes:** query `contract_events` for the next-older distinct `block_hash` (the highest `block_number` strictly below the current candidate). Repeat step 2 with this (number, hash) pair. Continue until a stored block is confirmed canonical, or until no older stored hash exists (treat genesis as the fallback). This height is the **common ancestor**. > **Why walk stored hashes, not block numbers?** In normal operation most blocks contain no `ChannelHub` events, so `contract_events` has no row for them. A block-number walk would find nothing to compare at event-gap heights and could miss a reorg that occurred entirely within such a gap. Walking by stored block hashes ensures every comparison is against a block the reactor actually processed. diff --git a/pkg/blockchain/evm/interface.go b/pkg/blockchain/evm/interface.go index a13477aab..ceadec511 100644 --- a/pkg/blockchain/evm/interface.go +++ b/pkg/blockchain/evm/interface.go @@ -44,6 +44,9 @@ type AssetStore interface { type EVMClient interface { ethereum.ChainStateReader bind.ContractBackend - // HeaderByHash is used by the reconciliation walk to verify block canonicality. + // HeaderByHash is used by the gate's block-timestamp fetcher and by the + // Listener's age-based routing of Phase 1 events. It returns whatever header + // the node has for the given hash (which may be a side-chain header) — it is + // NOT suitable for canonicality checks; use HeaderByNumber for that. HeaderByHash(ctx context.Context, hash common.Hash) (*types.Header, error) } diff --git a/pkg/blockchain/evm/listener_test.go b/pkg/blockchain/evm/listener_test.go index 1105e195a..e5158b978 100644 --- a/pkg/blockchain/evm/listener_test.go +++ b/pkg/blockchain/evm/listener_test.go @@ -155,9 +155,12 @@ func TestListener_Listen_HistoricalAndCurrent(t *testing.T) { logger := log.NewNoopLogger() addr := common.HexToAddress("0x123") - // Start from block 100 (canonical — HeaderByHash returns non-nil). + // Start from block 100, canonical: the reconciler will compute its hash via HeaderByNumber(100) + // and compare against the stored hash. We construct a deterministic Header so we can pre-compute + // the hash and feed it back as the stored value. + canonicalAt100 := &types.Header{Number: big.NewInt(100), Difficulty: big.NewInt(1)} + blockHash100 := canonicalAt100.Hash() eventGetter := new(MockContractEventGetter) - blockHash100 := common.HexToHash("0xdeadbeef") eventGetter.On("GetLatestContractEventBlockHashAndNumber", addr.String(), uint64(1)).Return(uint64(100), blockHash100.Hex(), nil) // Historical event at block 105 is not present eventGetter.On("IsContractEventPresent", uint64(1), uint64(105), mock.Anything, uint32(0)).Return(false, nil) @@ -186,11 +189,13 @@ func TestListener_Listen_HistoricalAndCurrent(t *testing.T) { listener := NewListener(addr, mockClient, 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) - // findCommonAncestor: block 100 is canonical. - canonicalHeader := &types.Header{Number: big.NewInt(100)} - mockClient.On("HeaderByHash", mock.Anything, blockHash100).Return(canonicalHeader, nil) + // findCommonAncestor: HeaderByNumber(100) returns the same header we hashed above, + // so the stored hash matches and block 100 is confirmed canonical. + mockClient.On("HeaderByNumber", mock.Anything, mock.MatchedBy(func(n *big.Int) bool { + return n != nil && n.Cmp(big.NewInt(100)) == 0 + })).Return(canonicalAt100, nil) - // Mock HeaderByNumber (current tip is 110) + // Mock HeaderByNumber(nil) for the chain-tip lookup (current tip is 110). currentHeader := &types.Header{Number: big.NewInt(110)} mockClient.On("HeaderByNumber", mock.Anything, (*big.Int)(nil)).Return(currentHeader, nil) diff --git a/pkg/blockchain/evm/reconciler.go b/pkg/blockchain/evm/reconciler.go index f70876451..ac88fe672 100644 --- a/pkg/blockchain/evm/reconciler.go +++ b/pkg/blockchain/evm/reconciler.go @@ -2,16 +2,19 @@ package evm import ( "context" + "errors" "fmt" + "math/big" + "github.com/ethereum/go-ethereum" "github.com/ethereum/go-ethereum/common" "github.com/layer-3/nitrolite/pkg/log" ) // findCommonAncestor determines the last block in the canonical chain that the // node has already processed. It walks stored block hashes backward until it -// finds one that eth_getBlockByHash confirms is canonical, then returns that -// block number as the safe replay start point. +// finds a stored hash that matches the canonical chain's hash at that height, +// then returns that block number as the safe replay start point. // // Returns 0 when no stored events exist or when every stored block has been // reorged out — in both cases the caller should replay from genesis/start-block. @@ -38,18 +41,12 @@ func findCommonAncestor( return 0, ctx.Err() } - hash := common.HexToHash(blockHash) - header, err := client.HeaderByHash(ctx, hash) + canonical, err := isStoredBlockCanonical(ctx, client, blockNum, common.HexToHash(blockHash)) if err != nil { return 0, fmt.Errorf("check canonicality of block %d (%s): %w", blockNum, blockHash, err) } - if header != nil { - // This block is still in the canonical chain. - if blockNum != header.Number.Uint64() { - // Sanity check: the block at this hash should have the number we stored. - return 0, fmt.Errorf("block hash %s has unexpected number: stored %d, chain %d", blockHash, blockNum, header.Number.Uint64()) - } + if canonical { logger.Info("reconciliation: found common ancestor", "blockchainID", blockchainID, "blockNumber", blockNum, @@ -82,3 +79,36 @@ func findCommonAncestor( blockHash = prevHash } } + +// isStoredBlockCanonical reports whether the block currently occupying blockNum +// in the canonical chain has the given storedHash. It uses HeaderByNumber rather +// than HeaderByHash because the two answer different questions: +// +// - HeaderByHash returns any header the node has indexed, including orphan +// side-chain headers still cached locally. A successful return does NOT prove +// the block is in the canonical chain. A reorged-out hash may also come back +// as ethereum.NotFound depending on the backend's pruning policy — +// conflating those two outcomes with a single boolean is unsafe. +// +// - HeaderByNumber returns the block currently occupying that height in the +// canonical chain. Comparing its hash to the stored hash is definitive: equal +// means the stored block is canonical, different means it has been reorged +// out. +// +// ethereum.NotFound from HeaderByNumber (e.g. the chain has pruned the height or +// has not yet produced a block at that height) is treated as "not canonical" +// rather than a fatal error, so the caller walks backward instead of crashing +// the listener on startup. +func isStoredBlockCanonical(ctx context.Context, client EVMClient, blockNum uint64, storedHash common.Hash) (bool, error) { + header, err := client.HeaderByNumber(ctx, new(big.Int).SetUint64(blockNum)) + if err != nil { + if errors.Is(err, ethereum.NotFound) { + return false, nil + } + return false, err + } + if header == nil { + return false, nil + } + return header.Hash() == storedHash, nil +} diff --git a/pkg/blockchain/evm/reconciler_test.go b/pkg/blockchain/evm/reconciler_test.go index 1fdf88e16..4037dfb95 100644 --- a/pkg/blockchain/evm/reconciler_test.go +++ b/pkg/blockchain/evm/reconciler_test.go @@ -6,6 +6,7 @@ import ( "math/big" "testing" + ethereum "github.com/ethereum/go-ethereum" "github.com/ethereum/go-ethereum/common" "github.com/ethereum/go-ethereum/core/types" "github.com/layer-3/nitrolite/pkg/log" @@ -15,7 +16,7 @@ import ( ) const ( - testContract = "0x1234567890abcdef1234567890abcdef12345678" + testContract = "0x1234567890abcdef1234567890abcdef12345678" testBlockchainID = uint64(1) ) @@ -23,6 +24,22 @@ func newTestLogger() log.Logger { return log.NewNoopLogger() } +// makeHeader builds a Header with a deterministic (and unique-per-seed) hash for +// the given block number. Two calls with different seeds produce headers whose +// Hash() values differ, which lets canonicality tests distinguish "this stored +// block is canonical" (same seed) from "this stored block was reorged out" +// (different seed at the same number). +func makeHeader(blockNum int64, seed int64) *types.Header { + return &types.Header{ + Number: big.NewInt(blockNum), + Difficulty: big.NewInt(seed), + } +} + +func bigEqual(want *big.Int) interface{} { + return mock.MatchedBy(func(got *big.Int) bool { return got != nil && got.Cmp(want) == 0 }) +} + // TestFindCommonAncestor_NoStoredEvents verifies that when no contract events exist, // findCommonAncestor returns 0 (genesis fallback). func TestFindCommonAncestor_NoStoredEvents(t *testing.T) { @@ -37,6 +54,7 @@ func TestFindCommonAncestor_NoStoredEvents(t *testing.T) { result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) require.NoError(t, err) assert.Equal(t, uint64(0), result) + client.AssertNotCalled(t, "HeaderByNumber") client.AssertNotCalled(t, "HeaderByHash") } @@ -48,12 +66,12 @@ func TestFindCommonAncestor_LatestBlockCanonical(t *testing.T) { client := new(MockEVMClient) getter := new(MockContractEventGetter) - blockHash := common.HexToHash("0xaabbccdd") - getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). - Return(uint64(500), blockHash.Hex(), nil) + header := makeHeader(500, 1) + storedHash := header.Hash() - canonicalHeader := &types.Header{Number: big.NewInt(500)} - client.On("HeaderByHash", mock.Anything, blockHash).Return(canonicalHeader, nil) + getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). + Return(uint64(500), storedHash.Hex(), nil) + client.On("HeaderByNumber", mock.Anything, bigEqual(big.NewInt(500))).Return(header, nil) result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) require.NoError(t, err) @@ -61,56 +79,85 @@ func TestFindCommonAncestor_LatestBlockCanonical(t *testing.T) { getter.AssertNotCalled(t, "GetPreviousDistinctBlockHash") } -// TestFindCommonAncestor_SingleReorgDepth verifies that when the latest block is reorged out -// but the previous one is canonical, findCommonAncestor returns the previous block number. +// TestFindCommonAncestor_SingleReorgDepth verifies that when the latest stored block has +// been reorged out (canonical chain has a different block at that height), findCommonAncestor +// walks back one step and returns the previous canonical block. func TestFindCommonAncestor_SingleReorgDepth(t *testing.T) { t.Parallel() client := new(MockEVMClient) getter := new(MockContractEventGetter) - reorgedHash := common.HexToHash("0xreorged0") - canonicalHash := common.HexToHash("0xcanon000") + // Block 200 was reorged: stored hash came from a now-orphan block; canonical chain + // has a different block at the same height. + storedAt200 := makeHeader(200, 1) + canonicalAt200 := makeHeader(200, 2) + require.NotEqual(t, storedAt200.Hash(), canonicalAt200.Hash()) + + // Block 190 is canonical. + headerAt190 := makeHeader(190, 1) + storedAt190 := headerAt190.Hash() getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). - Return(uint64(200), reorgedHash.Hex(), nil) + Return(uint64(200), storedAt200.Hash().Hex(), nil) + client.On("HeaderByNumber", mock.Anything, bigEqual(big.NewInt(200))).Return(canonicalAt200, nil) + getter.On("GetPreviousDistinctBlockHash", testContract, testBlockchainID, uint64(200)). + Return(uint64(190), storedAt190.Hex(), nil) + client.On("HeaderByNumber", mock.Anything, bigEqual(big.NewInt(190))).Return(headerAt190, nil) - // Latest block (200) reorged out. - client.On("HeaderByHash", mock.Anything, reorgedHash).Return(nil, nil) + result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) + require.NoError(t, err) + assert.Equal(t, uint64(190), result) +} - // Walk to previous block (190) which is canonical. - getter.On("GetPreviousDistinctBlockHash", testContract, testBlockchainID, uint64(200)). - Return(uint64(190), canonicalHash.Hex(), nil) +// TestFindCommonAncestor_NotFoundTreatedAsReorg verifies that when HeaderByNumber returns +// ethereum.NotFound (e.g. the RPC backend has pruned that height, or no canonical block +// exists at that number yet), the walk continues backward instead of crashing the listener. +// This is the regression the colleague flagged: the old HeaderByHash path treated NotFound +// as a fatal startup error. +func TestFindCommonAncestor_NotFoundTreatedAsReorg(t *testing.T) { + t.Parallel() + + client := new(MockEVMClient) + getter := new(MockContractEventGetter) - canonicalHeader := &types.Header{Number: big.NewInt(190)} - client.On("HeaderByHash", mock.Anything, canonicalHash).Return(canonicalHeader, nil) + storedAt200 := common.HexToHash("0xreorged200") + headerAt190 := makeHeader(190, 1) + + getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). + Return(uint64(200), storedAt200.Hex(), nil) + // HeaderByNumber(200) returns NotFound — must NOT be treated as fatal. + client.On("HeaderByNumber", mock.Anything, bigEqual(big.NewInt(200))).Return(nil, ethereum.NotFound) + + getter.On("GetPreviousDistinctBlockHash", testContract, testBlockchainID, uint64(200)). + Return(uint64(190), headerAt190.Hash().Hex(), nil) + client.On("HeaderByNumber", mock.Anything, bigEqual(big.NewInt(190))).Return(headerAt190, nil) result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) require.NoError(t, err) assert.Equal(t, uint64(190), result) } -// TestFindCommonAncestor_WalkToGenesis verifies that when all stored blocks are reorged out, -// findCommonAncestor returns 0 (genesis fallback). +// TestFindCommonAncestor_WalkToGenesis verifies that when all stored blocks have been +// reorged out (canonical hashes differ at every stored height), findCommonAncestor returns +// 0 (genesis fallback). func TestFindCommonAncestor_WalkToGenesis(t *testing.T) { t.Parallel() client := new(MockEVMClient) getter := new(MockContractEventGetter) - hash300 := common.HexToHash("0x0000300") - hash200 := common.HexToHash("0x0000200") + storedAt300 := makeHeader(300, 1).Hash() + storedAt200 := makeHeader(200, 1).Hash() + canonicalAt300 := makeHeader(300, 2) + canonicalAt200 := makeHeader(200, 2) getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). - Return(uint64(300), hash300.Hex(), nil) - - // Block 300 reorged out. - client.On("HeaderByHash", mock.Anything, hash300).Return(nil, nil) + Return(uint64(300), storedAt300.Hex(), nil) + client.On("HeaderByNumber", mock.Anything, bigEqual(big.NewInt(300))).Return(canonicalAt300, nil) getter.On("GetPreviousDistinctBlockHash", testContract, testBlockchainID, uint64(300)). - Return(uint64(200), hash200.Hex(), nil) - - // Block 200 reorged out. - client.On("HeaderByHash", mock.Anything, common.HexToHash(hash200.Hex())).Return(nil, nil) + Return(uint64(200), storedAt200.Hex(), nil) + client.On("HeaderByNumber", mock.Anything, bigEqual(big.NewInt(200))).Return(canonicalAt200, nil) getter.On("GetPreviousDistinctBlockHash", testContract, testBlockchainID, uint64(200)). Return(uint64(0), "", nil) @@ -135,7 +182,7 @@ func TestFindCommonAncestor_PreMigrationLatestRow(t *testing.T) { result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) require.NoError(t, err) assert.Equal(t, uint64(450), result) - client.AssertNotCalled(t, "HeaderByHash") + client.AssertNotCalled(t, "HeaderByNumber") } // TestFindCommonAncestor_PreMigrationMidWalk verifies that when a pre-migration row (empty @@ -147,13 +194,12 @@ func TestFindCommonAncestor_PreMigrationMidWalk(t *testing.T) { client := new(MockEVMClient) getter := new(MockContractEventGetter) - reorgedHash := common.HexToHash("0xreorgedX") + storedAt300 := makeHeader(300, 1).Hash() + canonicalAt300 := makeHeader(300, 2) getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). - Return(uint64(300), reorgedHash.Hex(), nil) - - // Block 300 reorged out. - client.On("HeaderByHash", mock.Anything, reorgedHash).Return(nil, nil) + Return(uint64(300), storedAt300.Hex(), nil) + client.On("HeaderByNumber", mock.Anything, bigEqual(big.NewInt(300))).Return(canonicalAt300, nil) // Walk backward hits a pre-migration row with empty hash at block 250. getter.On("GetPreviousDistinctBlockHash", testContract, testBlockchainID, uint64(300)). @@ -162,12 +208,12 @@ func TestFindCommonAncestor_PreMigrationMidWalk(t *testing.T) { result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) require.NoError(t, err) assert.Equal(t, uint64(250), result) - // HeaderByHash must NOT be called for the zero-hash pre-migration row. - client.AssertNumberOfCalls(t, "HeaderByHash", 1) + // HeaderByNumber must NOT be called for the zero-hash pre-migration row. + client.AssertNumberOfCalls(t, "HeaderByNumber", 1) } -// TestFindCommonAncestor_HeaderByHashError verifies that RPC errors are propagated. -func TestFindCommonAncestor_HeaderByHashError(t *testing.T) { +// TestFindCommonAncestor_RPCError verifies that non-NotFound RPC errors are propagated. +func TestFindCommonAncestor_RPCError(t *testing.T) { t.Parallel() client := new(MockEVMClient) @@ -177,7 +223,8 @@ func TestFindCommonAncestor_HeaderByHashError(t *testing.T) { getter.On("GetLatestContractEventBlockHashAndNumber", testContract, testBlockchainID). Return(uint64(100), blockHash.Hex(), nil) - client.On("HeaderByHash", mock.Anything, blockHash).Return(nil, errors.New("rpc timeout")) + client.On("HeaderByNumber", mock.Anything, bigEqual(big.NewInt(100))). + Return(nil, errors.New("rpc timeout")) _, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) require.Error(t, err) From 76fc4cb757245db201a221fb0682afa74ed91591 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Wed, 10 Jun 2026 18:05:25 +0200 Subject: [PATCH 11/23] refactor(nitronode/listener): own timestamp via ensureBlockTimestamp --- pkg/blockchain/evm/listener.go | 103 +++++++++++++++++++++++++++------ 1 file changed, 86 insertions(+), 17 deletions(-) diff --git a/pkg/blockchain/evm/listener.go b/pkg/blockchain/evm/listener.go index 1144d5e59..621e880cc 100644 --- a/pkg/blockchain/evm/listener.go +++ b/pkg/blockchain/evm/listener.go @@ -33,6 +33,12 @@ type Listener struct { handleEvent HandleEvent // live events and recent historical events; typically the ConfirmationGate handleHistoricalEvent HandleEvent // historical events older than confirmationDelay; typically the reactor directly eventGetter ContractEventGetter + + // Single-entry block-timestamp cache for ensureBlockTimestamp. The listener's + // processEvents loop is strictly serial (Phase 1 drains before Phase 2, each + // phase processes one event at a time), so these fields require no mutex. + lastBlockHash common.Hash + lastBlockTimestamp time.Time } // NewListener creates a Listener. blockStep controls how many blocks are fetched @@ -283,7 +289,21 @@ func (l *Listener) processEvents( } l.logger.Debug("received historical event", "blockchainID", l.blockchainID, "contractAddress", l.contractAddress.String(), "blockNumber", eventLog.BlockNumber, "logIndex", eventLog.Index) evCtx := log.SetContextLogger(context.Background(), l.logger) - handler := l.routeHistoricalEvent(ctx, eventLog) + eventLog, err := l.ensureBlockTimestamp(ctx, eventLog) + if err != nil { + l.logger.Warn("failed to ensure block timestamp for historical event, routing through gate", + "error", err, + "blockchainID", l.blockchainID, + "blockNumber", eventLog.BlockNumber, + "blockHash", eventLog.BlockHash.Hex(), + ) + if err := l.handleEvent(evCtx, eventLog); err != nil { + eventSubscription.Unsubscribe() + return err + } + continue + } + handler := l.routeHistoricalEvent(eventLog) if err := handler(evCtx, eventLog); err != nil { eventSubscription.Unsubscribe() return err @@ -325,6 +345,19 @@ func (l *Listener) processEvents( l.logger.Debug("received current event", "blockchainID", l.blockchainID, "contractAddress", l.contractAddress.String(), "blockNumber", eventLog.BlockNumber, "logIndex", eventLog.Index) } evCtx := log.SetContextLogger(context.Background(), l.logger) + if !eventLog.Removed { + ensured, err := l.ensureBlockTimestamp(ctx, eventLog) + if err != nil { + l.logger.Warn("failed to ensure block timestamp for current event, routing through gate", + "error", err, + "blockchainID", l.blockchainID, + "blockNumber", eventLog.BlockNumber, + "blockHash", eventLog.BlockHash.Hex(), + ) + } else { + eventLog = ensured + } + } if err := l.handleEvent(evCtx, eventLog); err != nil { eventSubscription.Unsubscribe() return err @@ -414,6 +447,51 @@ func (l *Listener) reconcileBlockRange( } } +// ensureBlockTimestamp returns eventLog with BlockTimestamp guaranteed non-zero. +// +// Most EVM chains and providers populate BlockTimestamp in the JSON-RPC response, +// in which case eventLog is returned unchanged. For chains/providers that do NOT +// populate it (notably Avalanche C-Chain via ava-labs/libevm, and older BSC +// dataseed nodes), this method fetches the block header via HeaderByHash and +// populates the field on the local-stack copy of types.Log. +// +// Single-entry cache (lastBlockHash) elides repeat fetches for consecutive events +// from the same block — the only relevant case because the listener delivers events +// in block order. +// +// Single-threaded use only: relies on the Listener's serial processEvents loop +// (Phase 1 historical fully drains before Phase 2 live; each phase processes one +// event at a time). No mutex on the cache fields. A future refactor that +// parallelizes event handling must add synchronization or switch to a thread-safe +// cache. +// +// On HeaderByHash failure, returns the original eventLog and the error. Callers +// decide whether to fall back to the gate (which is the conservative behavior; +// see live-path and routeHistoricalEvent below). +func (l *Listener) ensureBlockTimestamp(ctx context.Context, eventLog types.Log) (types.Log, error) { + if eventLog.BlockTimestamp != 0 { + return eventLog, nil + } + + if eventLog.BlockHash == l.lastBlockHash && !l.lastBlockTimestamp.IsZero() { + eventLog.BlockTimestamp = uint64(l.lastBlockTimestamp.Unix()) + return eventLog, nil + } + + headerCtx, cancel := context.WithTimeout(ctx, rpcRequestTimeout) + defer cancel() + header, err := l.client.HeaderByHash(headerCtx, eventLog.BlockHash) + if err != nil { + return eventLog, err + } + + blockTime := time.Unix(int64(header.Time), 0) + l.lastBlockHash = eventLog.BlockHash + l.lastBlockTimestamp = blockTime + eventLog.BlockTimestamp = header.Time + return eventLog, nil +} + // routeHistoricalEvent chooses the handler for a Phase 1 event based on the age of // its block. Events whose block timestamp is older than confirmationDelay are routed // to handleHistoricalEvent (they are past the reorg window and safe to forward @@ -421,29 +499,20 @@ func (l *Listener) reconcileBlockRange( // handleEvent so they pass through the gate. When confirmationDelay is zero, every // event is routed to handleHistoricalEvent. // -// On a HeaderByHash fetch error the function falls back to handleEvent: routing -// through the gate is the conservative choice (it preserves the reorg-protection -// invariant at the cost of a small delay), and the gate's own block-timestamp -// fetcher will retry the lookup with its own fallback. -func (l *Listener) routeHistoricalEvent(ctx context.Context, eventLog types.Log) HandleEvent { +// Reads eventLog.BlockTimestamp directly — callers are expected to have invoked +// ensureBlockTimestamp first. Defense-in-depth: if BlockTimestamp is zero (caller +// failed to ensure it), route through handleEvent (the gate) as the conservative +// choice. +func (l *Listener) routeHistoricalEvent(eventLog types.Log) HandleEvent { if l.confirmationDelay == 0 { return l.handleHistoricalEvent } - headerCtx, cancel := context.WithTimeout(ctx, rpcRequestTimeout) - defer cancel() - header, err := l.client.HeaderByHash(headerCtx, eventLog.BlockHash) - if err != nil { - l.logger.Warn("failed to fetch block timestamp for historical event, routing through gate", - "error", err, - "blockchainID", l.blockchainID, - "blockNumber", eventLog.BlockNumber, - "blockHash", eventLog.BlockHash.Hex(), - ) + if eventLog.BlockTimestamp == 0 { return l.handleEvent } - blockTime := time.Unix(int64(header.Time), 0) + blockTime := time.Unix(int64(eventLog.BlockTimestamp), 0) if time.Since(blockTime) < l.confirmationDelay { return l.handleEvent } From 4ff2a73b2a9d41b3ccf4de7b45a6777134bd0a79 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Wed, 10 Jun 2026 18:06:14 +0200 Subject: [PATCH 12/23] refactor(nitronode/gate): rewrite ConfirmationGate per simplification plan --- nitronode/main.go | 32 ++- pkg/blockchain/evm/confirmation_gate.go | 341 ++++++++++++------------ 2 files changed, 183 insertions(+), 190 deletions(-) diff --git a/nitronode/main.go b/nitronode/main.go index 6151dbf36..4c9956dff 100644 --- a/nitronode/main.go +++ b/nitronode/main.go @@ -25,8 +25,6 @@ import ( "github.com/layer-3/nitrolite/pkg/log" ) -const blockTimestampFetchTimeout = 10 * time.Second - func main() { if len(os.Args) > 1 && os.Args[1] == "stress-test" { os.Exit(stress.Run(os.Args[2:])) @@ -124,25 +122,25 @@ func main() { reactor := evm.NewChannelHubReactor(b.ID, bb.StateSigner.PublicKey().Address().String(), eventHandlerService, bb.MemoryStore, useCHRStoreInTx, bb.DbStore) reactor.SetOnEventProcessed(bb.RuntimeMetrics.IncBlockchainEvent) - blockTimestampFetcher := func(blockHash common.Hash) (time.Time, error) { - fetchCtx, cancel := context.WithTimeout(context.Background(), blockTimestampFetchTimeout) - defer cancel() - header, err := client.HeaderByHash(fetchCtx, blockHash) + confirmationDelay := time.Duration(b.ConfirmationDelaySecs) * time.Second + var liveHandler evm.HandleEvent + if confirmationDelay > 0 { + gate, err := evm.NewConfirmationGate(confirmationDelay, b.ID, reactor.HandleEvent, logger) if err != nil { - return time.Time{}, err + logger.Fatal("failed to create confirmation gate", "error", err, "blockchainID", b.ID) } - return time.Unix(int64(header.Time), 0), nil + gate.Start(blockchainCtx) + liveHandler = gate.HandleEvent + } else { + liveHandler = reactor.HandleEvent } - confirmationDelay := time.Duration(b.ConfirmationDelaySecs) * time.Second - gate := evm.NewConfirmationGate(confirmationDelay, b.ID, reactor.HandleEvent, blockTimestampFetcher, logger) - gate.Start(blockchainCtx) - - // Live events flow through the confirmation gate. Historical events from eth_getLogs - // are routed per-event based on block age: events older than confirmationDelay go - // directly to the reactor (past the reorg window); recent events still flow through - // the gate because their blocks may still be reorged. - l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, confirmationDelay, logger, gate.HandleEvent, reactor.HandleEvent, bb.DbStore) + // Live events flow through the confirmation gate (when delay > 0) or directly to the + // reactor (when delay == 0). Historical events from eth_getLogs are routed per-event + // based on block age: events older than confirmationDelay go directly to the reactor + // (past the reorg window); recent events still flow through the live handler because + // their blocks may still be reorged. + l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, confirmationDelay, logger, liveHandler, reactor.HandleEvent, bb.DbStore) l.Listen(blockchainCtx, func(err error) { if err != nil { logger.Fatal("blockchain listener stopped", "error", err, "blockchainID", b.ID) diff --git a/pkg/blockchain/evm/confirmation_gate.go b/pkg/blockchain/evm/confirmation_gate.go index 6690d8ed1..6b7f3a268 100644 --- a/pkg/blockchain/evm/confirmation_gate.go +++ b/pkg/blockchain/evm/confirmation_gate.go @@ -2,6 +2,7 @@ package evm import ( "context" + "errors" "sync" "time" @@ -10,13 +11,16 @@ import ( "github.com/layer-3/nitrolite/pkg/log" ) -const pollInterval = 50 * time.Millisecond -const recentMultiplier = 3 // recentlyForwarded entries are kept for (recentMultiplier × delay) to catch post-gate reorgs +// recentMultiplier controls how long forwardedSet entries are retained: +// (recentMultiplier × delay). This is the window during which a post-gate +// Removed:true can be matched against a previously forwarded event and emit +// the post-gate reorg WARN. +const recentMultiplier = 3 // queueEntry holds a pending event waiting for the confirmation delay to expire. type queueEntry struct { log types.Log - arrivedAt time.Time // block timestamp from fetcher; fallback time.Now() on error + arrivedAt time.Time // derived from eventLog.BlockTimestamp; fallback time.Now() when zero } // eventKey identifies an event by tx and log index; blockHash is intentionally excluded @@ -36,146 +40,131 @@ type forwardedKey struct { logIndex uint } +// forwardedExpiry pairs a forwardedKey with the wall-clock time at which the event +// was forwarded, for O(1) FIFO eviction from forwardedSet. +type forwardedExpiry struct { + key forwardedKey + forwardedAt time.Time +} + // ConfirmationGate buffers incoming events for a configurable delay before forwarding // them to a downstream handler, providing a window to cancel events that are reorged // out before the delay expires. +// +// The gate is pure in-memory: it reads arrival time from eventLog.BlockTimestamp and +// performs no RPC. The caller (Listener) is responsible for ensuring BlockTimestamp +// is populated before invoking HandleEvent. type ConfirmationGate struct { - delay time.Duration - chainID uint64 - handler HandleEvent - blockTimestampFetcher func(blockHash common.Hash) (time.Time, error) - - mu sync.Mutex - queue []queueEntry - recentlyForwarded map[forwardedKey]time.Time // TTL = recentMultiplier × delay; protected by mu - // blockTimestampCache holds the timestamp for every block that has delivered at - // least one event to the gate. It avoids a redundant RPC call when the same block - // produces multiple events (e.g. a batch open with two ChannelDeposited logs). - // Entries are evicted by the Poller once the block timestamp is older than - // recentMultiplier × delay — by that point every event from the block has either - // been forwarded or cancelled, so the entry will never be read again. - blockTimestampCache map[common.Hash]time.Time // protected by mu - logger log.Logger + delay time.Duration + chainID uint64 + handler HandleEvent + logger log.Logger + + mu sync.Mutex + queue []queueEntry // append-tail, pop-head + pending map[eventKey]common.Hash // live (txHash, logIndex) -> blockHash; source of truth for live entries + forwardedSet map[forwardedKey]time.Time // key -> forwardedAt + forwardedQueue []forwardedExpiry // FIFO of (key, forwardedAt) for O(1) eviction + + kick chan struct{} // buffered 1; non-blocking sends + timer *time.Timer // created in Start(ctx) } // NewConfirmationGate creates a ConfirmationGate that holds events for delay before -// forwarding them to handler. fetcher is called once per unique blockHash to obtain the -// block's timestamp, which is used as the event's arrivedAt reference. If fetcher fails, -// time.Now() is used as a fallback. +// forwarding them to handler. delay must be > 0; delay <= 0 returns an error +// (the wiring layer is responsible for skipping gate construction when the operator +// configured delay == 0; see nitronode/main.go). func NewConfirmationGate( delay time.Duration, chainID uint64, handler HandleEvent, - fetcher func(blockHash common.Hash) (time.Time, error), logger log.Logger, -) *ConfirmationGate { - return &ConfirmationGate{ - delay: delay, - chainID: chainID, - handler: handler, - blockTimestampFetcher: fetcher, - recentlyForwarded: make(map[forwardedKey]time.Time), - blockTimestampCache: make(map[common.Hash]time.Time), - logger: logger.WithName("confirmation-gate"), +) (*ConfirmationGate, error) { + if delay <= 0 { + return nil, errors.New("confirmation gate requires delay > 0") } + return &ConfirmationGate{ + delay: delay, + chainID: chainID, + handler: handler, + logger: logger.WithName("confirmation-gate"), + pending: make(map[eventKey]common.Hash), + forwardedSet: make(map[forwardedKey]time.Time), + forwardedQueue: nil, + kick: make(chan struct{}, 1), + }, nil } -// Start begins the polling goroutine that forwards matured entries to the downstream -// handler. If delay is zero the gate is fully transparent and no goroutine is started. +// Start begins the background goroutine that forwards matured entries to the +// downstream handler. The timer is created here (tied to the goroutine's lifecycle) +// and stopped on shutdown. The goroutine exits when ctx is cancelled. func (g *ConfirmationGate) Start(ctx context.Context) { - if g.delay == 0 { - return + g.timer = time.NewTimer(time.Hour) // arbitrary long initial; will be reset on first drain + if !g.timer.Stop() { + <-g.timer.C } - go g.poll(ctx) + go g.run(ctx) } // HandleEvent is the entry point called by the upstream Listener for each event. // -// When delay == 0 the gate is fully transparent: every event (including Removed ones) -// is forwarded to the downstream handler immediately. -// -// When delay > 0: -// - A non-removed event is queued and will be forwarded after the confirmation delay. -// - A removed event cancels its pending queue entry (pre-gate reorg), or — if the -// entry was already forwarded — records a post-gate reorg warning. -func (g *ConfirmationGate) HandleEvent(ctx context.Context, eventLog types.Log) error { - if g.delay == 0 { - // Removed:true events are never forwarded to the reactor regardless of delay - // setting — the reactor was never designed to handle them and has no guard on - // Topics[0]. This preserves the pre-gate listener behavior of dropping reorged - // logs before they reach any downstream handler. - if eventLog.Removed { - return nil - } - return g.handler(ctx, eventLog) - } - - key := eventKey{txHash: eventLog.TxHash, logIndex: uint(eventLog.Index)} +// A non-removed event is queued and will be forwarded after the confirmation delay. +// A removed event cancels its pending queue entry (pre-gate reorg) or — if the entry +// was already forwarded — records a post-gate reorg warning. +func (g *ConfirmationGate) HandleEvent(_ context.Context, eventLog types.Log) error { + ek := eventKey{txHash: eventLog.TxHash, logIndex: uint(eventLog.Index)} if !eventLog.Removed { - // Fetch block timestamp, using cache to avoid redundant RPC calls. + // Derive arrival time from the event's block timestamp. The listener + // guarantees this is non-zero in steady state; the fallback is + // defense-in-depth for tests/edge cases. No log here — the listener + // owns the warning when it cannot ensure the timestamp. var ts time.Time - - g.mu.Lock() - cached, hit := g.blockTimestampCache[eventLog.BlockHash] - if hit { - ts = cached - } - g.mu.Unlock() - - if !hit { - fetched, err := g.blockTimestampFetcher(eventLog.BlockHash) - if err != nil { - g.logger.Warn("failed to fetch block timestamp, falling back to now", - "error", err, - "blockHash", eventLog.BlockHash.Hex(), - "chainID", g.chainID, - ) - // Use gate entry arrival time as a fallback to avoid blocking events indefinitely when the fetcher fails. - ts = time.Now() - } else { - ts = fetched - - // Update cache for future events from the same block. - g.mu.Lock() - g.blockTimestampCache[eventLog.BlockHash] = ts - g.mu.Unlock() - } + if eventLog.BlockTimestamp != 0 { + ts = time.Unix(int64(eventLog.BlockTimestamp), 0) + } else { + ts = time.Now() } g.mu.Lock() - // Remove any existing queue entry for the same (txHash, logIndex) so that - // a re-delivered event (after reorg, with different blockHash) replaces - // the original and resets the confirmation timer. - g.removeFromQueueByKey(key) + g.pending[ek] = eventLog.BlockHash g.queue = append(g.queue, queueEntry{log: eventLog, arrivedAt: ts}) g.mu.Unlock() + // Non-blocking kick so the poller wakes up to (re)compute the timer + // even when it is currently sleeping on a far-future deadline. + select { + case g.kick <- struct{}{}: + default: + } return nil } - // eventLog.Removed == true: attempt pre-gate cancellation. + // eventLog.Removed == true: attempt pre-gate or post-gate cancellation. + fk := forwardedKey{txHash: eventLog.TxHash, blockHash: eventLog.BlockHash, logIndex: uint(eventLog.Index)} + g.mu.Lock() defer g.mu.Unlock() - // Build the full key once; it is reused for both the queue scan and the - // recentlyForwarded lookup. blockHash is included so that a Removed notification for - // an old block does not accidentally cancel a re-mined entry with the same tx/logIndex - // in a new block. - fk := forwardedKey{txHash: eventLog.TxHash, blockHash: eventLog.BlockHash, logIndex: uint(eventLog.Index)} - if g.removeFromQueueByFullKey(fk) { + // Pre-gate cancel: the live pending entry corresponds to this block. + // Delete from pending; the tombstoned queue entry is skipped on pop. + if liveBlockHash, ok := g.pending[ek]; ok && liveBlockHash == eventLog.BlockHash { + delete(g.pending, ek) return nil } - // Not in queue — check whether it was already forwarded (post-gate reorg). - if _, ok := g.recentlyForwarded[fk]; ok { + // Post-gate: the event has already been forwarded. + if _, ok := g.forwardedSet[fk]; ok { g.logger.Warn("post-gate reorg detected", "txHash", eventLog.TxHash.Hex(), "blockHash", eventLog.BlockHash.Hex(), "logIndex", eventLog.Index, "chainID", g.chainID, ) - delete(g.recentlyForwarded, fk) + // Delete from the membership map; leave the forwardedQueue entry in + // place — it expires on its own. The eviction loop's value-check makes + // the later delete safe even if the same key is forwarded again. + delete(g.forwardedSet, fk) return nil } @@ -188,91 +177,97 @@ func (g *ConfirmationGate) HandleEvent(ctx context.Context, eventLog types.Log) return nil } -// removeFromQueueByKey removes the first queue entry matching key (ignores blockHash). -// Used when a non-removed re-delivery replaces an earlier entry for the same logical event. -// Caller must hold mu. -func (g *ConfirmationGate) removeFromQueueByKey(key eventKey) { - for i, e := range g.queue { - ek := eventKey{txHash: e.log.TxHash, logIndex: uint(e.log.Index)} - if ek == key { - g.queue = append(g.queue[:i], g.queue[i+1:]...) +// run is the background goroutine that wakes on a kick, on the timer firing, or on +// ctx cancellation. It forwards matured entries, evicts stale forwardedSet entries, +// and reschedules the timer for the next head deadline. +func (g *ConfirmationGate) run(ctx context.Context) { + defer g.timer.Stop() + for { + select { + case <-ctx.Done(): return + case <-g.kick: + case <-g.timer.C: } + g.drainAndReschedule() } } -// removeFromQueueByFullKey removes the first queue entry matching txHash, blockHash, and -// logIndex. Used in the Removed handler so that a removal notification for an old block -// does not accidentally cancel a re-mined entry with the same tx/logIndex in a new block. -// Caller must hold mu. -func (g *ConfirmationGate) removeFromQueueByFullKey(fk forwardedKey) bool { - for i, e := range g.queue { - if e.log.TxHash == fk.txHash && e.log.BlockHash == fk.blockHash && uint(e.log.Index) == fk.logIndex { - g.queue = append(g.queue[:i], g.queue[i+1:]...) - return true +// drainAndReschedule forwards all queue entries whose confirmation delay has +// elapsed, evicts forwardedSet entries older than (recentMultiplier × delay), +// and resets the timer to the next head deadline. +func (g *ConfirmationGate) drainAndReschedule() { + g.mu.Lock() + now := time.Now() + + // Step 1: drain matured head entries. + for len(g.queue) > 0 && !g.queue[0].arrivedAt.Add(g.delay).After(now) { + entry := g.queue[0] + g.queue = g.queue[1:] + + ek := eventKey{txHash: entry.log.TxHash, logIndex: uint(entry.log.Index)} + + // Tombstone check: if the live pending entry no longer points at this + // blockHash, a reorg-replacement event has superseded it. Drop silently. + // Do NOT touch pending[ek] — it refers to the new live event (still in + // the queue) and deleting it would break the next tombstone check or the + // next Removed cancel. + liveBlockHash, ok := g.pending[ek] + if !ok || liveBlockHash != entry.log.BlockHash { + continue + } + + // Forward: clear pending, insert into forwardedSet + forwardedQueue + // BEFORE releasing mu so that a fast Removed:true arriving immediately + // after the handler call still sees the entry and emits the post-gate WARN. + delete(g.pending, ek) + fk := forwardedKey{ + txHash: entry.log.TxHash, + blockHash: entry.log.BlockHash, + logIndex: uint(entry.log.Index), } + g.forwardedSet[fk] = now + g.forwardedQueue = append(g.forwardedQueue, forwardedExpiry{key: fk, forwardedAt: now}) + + g.mu.Unlock() + + evCtx := log.SetContextLogger(context.Background(), g.logger) + if err := g.handler(evCtx, entry.log); err != nil { + g.logger.Error("handler error after confirmation delay", + "error", err, + "chainID", g.chainID, + ) + } + + g.mu.Lock() } - return false -} -// poll is the background goroutine that wakes on each pollInterval tick, forwards -// all matured queue entries to the downstream handler, and evicts stale recentlyForwarded -// entries whose TTL (recentMultiplier × delay) has elapsed. -func (g *ConfirmationGate) poll(ctx context.Context) { - ticker := time.NewTicker(pollInterval) - defer ticker.Stop() + // Step 2: FIFO eviction of forwardedSet entries older than recentMultiplier × delay. + for len(g.forwardedQueue) > 0 && now.Sub(g.forwardedQueue[0].forwardedAt) > recentMultiplier*g.delay { + popped := g.forwardedQueue[0] + g.forwardedQueue = g.forwardedQueue[1:] + + // Only delete from forwardedSet if the stored timestamp still equals + // the popped entry's timestamp. This guards the rare re-forward case + // (same key forwarded again after a chain un-reorg) so the older FIFO + // entry does not evict newer set membership. Tolerates the §2.4 Removed + // path having already deleted the entry (no-op). + if storedAt, ok := g.forwardedSet[popped.key]; ok && storedAt.Equal(popped.forwardedAt) { + delete(g.forwardedSet, popped.key) + } + } - for { + // Step 3: reset timer to next head deadline using the standard drain pattern. + if !g.timer.Stop() { select { - case <-ctx.Done(): - return - case <-ticker.C: - g.mu.Lock() - now := time.Now() - - // Forward all entries whose confirmation delay has elapsed. - for len(g.queue) > 0 && !g.queue[0].arrivedAt.Add(g.delay).After(now) { - entry := g.queue[0] - g.queue = g.queue[1:] - - fk := forwardedKey{ - txHash: entry.log.TxHash, - blockHash: entry.log.BlockHash, - logIndex: uint(entry.log.Index), - } - g.recentlyForwarded[fk] = now - - g.mu.Unlock() - - evCtx := log.SetContextLogger(context.Background(), g.logger) - if err := g.handler(evCtx, entry.log); err != nil { - g.logger.Error("handler error after confirmation delay", - "error", err, - "chainID", g.chainID, - ) - } - - g.mu.Lock() - } - - // Evict recentlyForwarded entries older than (recentMultiplier × delay). - for k, forwardedAt := range g.recentlyForwarded { - if now.Sub(forwardedAt) > recentMultiplier*g.delay { - delete(g.recentlyForwarded, k) - } - } - - // Evict blockTimestampCache entries whose block timestamp is older than - // (recentMultiplier × delay). The listener delivers events in block order, - // so once a block is old enough, all of its events have been forwarded or - // cancelled and the cached timestamp will never be read again. - for bh, ts := range g.blockTimestampCache { - if now.Sub(ts) > recentMultiplier*g.delay { - delete(g.blockTimestampCache, bh) - } - } - - g.mu.Unlock() + case <-g.timer.C: + default: } } + if len(g.queue) > 0 { + g.timer.Reset(time.Until(g.queue[0].arrivedAt.Add(g.delay))) + } + // else: leave the timer stopped; the next kick recomputes. + + g.mu.Unlock() } From 7e0996b10b3e69b46802567084d2425596e2631d Mon Sep 17 00:00:00 2001 From: nksazonov Date: Wed, 10 Jun 2026 18:42:24 +0200 Subject: [PATCH 13/23] chore(nitronode/gate): rework tests and update spec for simplified gate Reworks the gate test suite for the new ConfirmationGate API (no fetcher arg, error-returning constructor, BlockTimestamp-driven arrivedAt) and adds coverage for the new behaviors: tombstone-skip, FIFO-eviction with early-delete tolerance, single-timer reschedule, kick-during-pending-timer, shutdown-with-non-empty-queue. Adds listener_test cases for the new ensureBlockTimestamp helper (populated path, fetch path, single-entry cache hit, fetch error) and updates existing listener tests that previously fed events with BlockTimestamp == 0. Updates reorg-fix-spec.md sections 4.1, 4.3, 6.1, 6.2, 6.3, 6.5, 6.6, 6.7 to describe the timer+kick design, tombstone-map queue, forwardedSet/forwardedQueue + FIFO eviction with value-check, and the listener-owned ensureBlockTimestamp fallback (replacing the gate's removed block-timestamp fetcher/cache). --- nitronode/reorg-fix-spec.md | 177 +++--- pkg/blockchain/evm/confirmation_gate_test.go | 539 ++++++++++++++----- pkg/blockchain/evm/listener_test.go | 158 +++++- 3 files changed, 634 insertions(+), 240 deletions(-) diff --git a/nitronode/reorg-fix-spec.md b/nitronode/reorg-fix-spec.md index f85bd8639..8a52faf93 100644 --- a/nitronode/reorg-fix-spec.md +++ b/nitronode/reorg-fix-spec.md @@ -66,9 +66,9 @@ chains: When a log `E` arrives (without `Removed: true`): -1. Record the event under a key of `(txHash, blockHash, logIndex)`. -2. Start an in-memory timer for the chain's `confirmation_delay_sec`. -3. When the timer fires, invoke the event handler. +1. Record the event in the live-entry map under `(txHash, logIndex)` with its `blockHash` as the tombstone discriminator, and append it to the FIFO drain queue with its block timestamp as `arrivedAt`. +2. The gate's drain goroutine (single shared timer per gate; see §6.3) treats the entry as eligible once `arrivedAt + confirmation_delay_sec` has elapsed. +3. When the entry matures, invoke the event handler. ### 4.2 Reorg path @@ -82,12 +82,12 @@ If a log with `Removed: true` arrives for the same `(txHash, blockHash, logIndex The re-added event (no `Removed: true`, new block) may arrive at the listener before the corresponding `Removed: true` log for the old block. When this happens, the gate **replaces** the pending entry for `(txHash, logIndex)` with the new one and resets the confirmation timer under the new block's key: -- On the non-removed re-add, scan the queue by `(txHash, logIndex)` — ignoring `blockHash` — and drop any existing entry. Append the new event with a fresh `arrivedAt`. -- The subsequent `Removed: true` log for the OLD block carries the old `blockHash` and therefore matches neither the queued (new-block) entry nor any `recentlyForwarded` record. It performs a no-op. +- On the non-removed re-add, overwrite `pending[(txHash, logIndex)]` with the new `blockHash` and append the new event to the queue tail with a fresh `arrivedAt`. The earlier queue entry remains in place as a tombstone — its `blockHash` no longer matches `pending`, so the drain goroutine silently skips it when it reaches the head. +- The subsequent `Removed: true` log for the OLD block carries the old `blockHash` and therefore matches neither `pending` (whose value is now the new block's hash) nor any `forwardedSet` record. It performs a no-op. -This collapses the two-entry coexistence model into a single live entry per `(txHash, logIndex)`. The behavior is observationally equivalent — exactly one event is forwarded, and it is the latest re-mining — and it removes the only state-divergence path between the queue and `recentlyForwarded`. +The tombstone-map design replaces the prior slice-scan approach: every live operation is O(1), and exactly one event per `(txHash, logIndex)` is forwarded — the latest re-mining. -- On a `Removed: true` log for a key that **has no pending timer and no `recentlyForwarded` record**: no-op. The event either belongs to a block that was already replaced by a later re-add (handled above), or it is a stale removal from a fork the gate has no record of. +- On a `Removed: true` log for a key that **has no live `pending` entry and no `forwardedSet` record**: no-op. The event either belongs to a block that was already replaced by a later re-add (handled above), or it is a stale removal from a fork the gate has no record of. > Repeated reorgs of the same transaction are theoretically possible but imply a chain-level consensus failure. The gate's replace/restart cycle handles each naturally; no special cap is needed. @@ -131,7 +131,7 @@ On startup, for each chain, after the `block_hash` migration has been applied: - Events whose block timestamp is **older than `confirmation_delay_sec`** are routed directly to the reactor, bypassing the gate. Their block is past the reorg window — `eth_getLogs` returned them as canonical, and any reorg that could displace them would exceed the configured finality bound. There is no incremental reorg risk to guard against, and routing them through the gate would only add latency. - Events whose block timestamp is **younger than `confirmation_delay_sec`** are routed through the gate, the same path live events take. The common-ancestor walk only confirms the *starting* block is canonical; replay can fetch logs from blocks all the way up to the current chain tip, some of which are still inside the reorg window. Forwarding those directly to the reactor would re-introduce the very double-spend window the gate was built to close. - The `Listener` accepts two handlers (`eventHandler` for live events and recent historical events, `historicalEventHandler` for mature historical events) and makes the per-event routing decision using one `HeaderByHash` RPC per historical event. When `confirmation_delay_sec` is `0` the gate is disabled and every historical event is routed to `historicalEventHandler` without a timestamp fetch. On a `HeaderByHash` failure the Listener falls back to `eventHandler` (the gate) — the conservative choice that preserves the reorg-protection invariant at the cost of a small delay. + The `Listener` accepts two handlers (`eventHandler` for live events and recent historical events, `historicalEventHandler` for mature historical events) and makes the per-event routing decision from `eventLog.BlockTimestamp`. To guarantee that field is populated regardless of the RPC provider's behavior, the listener calls `ensureBlockTimestamp` once per event, which uses `eventLog.BlockTimestamp` when present and falls back to `HeaderByHash` otherwise (at most one fetch per block regardless of event count). When `confirmation_delay_sec` is `0` the gate is disabled and every historical event is routed to `historicalEventHandler`. On an `ensureBlockTimestamp` failure the Listener falls back to `eventHandler` (the gate) — the conservative choice that preserves the reorg-protection invariant at the cost of a small delay. 6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)`. If a duplicate is inserted, Postgres returns a constraint-violation error, causing the entire transaction (including all state mutations in the same `useStoreInTx` call) to roll back. The reactor therefore cannot double-apply state changes for an event it has already committed. 7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay, and replay does not flow through the gate (step 5). Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and reach the gate only after the historical replay phase completes; if they cancel a re-mined event that has already been forwarded by the live path, the post-gate reorg detection in §6.5 logs them. @@ -162,32 +162,42 @@ The listener accepts a handler of type `HandleEvent func(ctx context.Context, ev ```go reactor := evm.NewChannelHubReactor(b.ID, ...) -gate := evm.NewConfirmationGate(confirmationDelay, reactor.HandleEvent) -l := evm.NewListener(..., gate.HandleEvent, ...) +var liveHandler evm.HandleEvent +if confirmationDelay > 0 { + gate, err := evm.NewConfirmationGate(confirmationDelay, b.ID, reactor.HandleEvent, logger) + if err != nil { /* fatal */ } + gate.Start(ctx) + liveHandler = gate.HandleEvent +} else { + liveHandler = reactor.HandleEvent +} +l := evm.NewListener(..., liveHandler, reactor.HandleEvent, ...) ``` +The constructor returns an error for `delay <= 0`; the wiring layer is responsible for skipping gate construction when the operator configured `confirmation_delay_sec: 0` and routing live events straight to the reactor. + The reactor itself does not change. All the listener's existing logic — subscription management, cursor tracking, reconnection, historical replay — is unaffected. **Handling `Removed: true` logs:** currently `listener.go:289-294` skips removed logs before they reach the handler. This skip must be moved: the listener should forward removed logs to `gate.HandleEvent` (they still carry the `Removed` flag on `types.Log`), and the gate alone decides whether to cancel a pending timer or ignore the signal. The reactor never sees a `Removed: true` log. ### 6.2 Event identity for queue keying -The Listener delivers events in strict block order, so the queue is naturally ordered by arrival time. Two distinct scan keys are used against the queue: +The Listener delivers events in strict block order, so the FIFO queue is naturally ordered by arrival time. Two distinct keys identify events at different layers of the design: -- **`(txHash, logIndex)` — used by both Pusher paths (non-removed re-add and removed cancellation).** On a non-removed arrival, any existing entry with the same `(txHash, logIndex)` is dropped and the new event appended with a fresh `arrivedAt`. Because re-adds always replace the prior entry, the queue holds at most one entry per `(txHash, logIndex)` at any time. -- **`(txHash, blockHash, logIndex)` — used by the Removed-cancel scan against the queue.** A `Removed: true` log only cancels a queued entry when the full key matches. A Removed for an OLD block whose entry has already been replaced by a newer re-add will not match the queued (new-block) entry and will fall through to the `recentlyForwarded` lookup (§6.5). +- **`(txHash, logIndex)` — the live-entry key, used as the tombstone-map (`pending`) key.** On a non-removed arrival, the Pusher sets `pending[ek] = eventLog.BlockHash` (overwriting any prior value) and appends to the queue tail. On a `Removed: true` arrival, the Pusher checks `pending[ek]` and cancels (deletes from `pending`) iff the stored `blockHash` matches the removed log's. A stale removal for an OLD block whose `pending` value has already been overwritten by a newer re-add will not match and falls through to the `forwardedSet` lookup (§6.5). Both operations are O(1) map lookups; the queue body is never scanned. +- **`(txHash, blockHash, logIndex)` — the post-gate detection key (`forwardedKey`), used to index `forwardedSet`.** When the drain goroutine forwards an event, it inserts this triple into `forwardedSet` so a later `Removed: true` for the same exact occurrence can be matched and the post-gate reorg WARN emitted. Including `blockHash` ensures a stale removal for an already-replaced fork cannot cause a spurious WARN against a different re-mining. -`blockHash` is excluded from the re-add scan key so that a re-mining of the same tx replaces the original regardless of which block it landed in. `blockHash` is included on the Removed scan so that a stale removal for an already-replaced fork cannot cancel a live entry. +`blockHash` is excluded from the live-entry key so that a re-mining of the same tx overwrites the original `pending` value regardless of which block it landed in. `blockHash` is included in the post-gate detection key so that the WARN matches the specific occurrence that was forwarded. A single transaction can emit multiple events for the same `txHash` (e.g., two `ChannelDeposited` logs in a batch open). `logIndex` disambiguates these; it is unique per log within a block and is present in both the live event and its corresponding `Removed: true` log. `blockHash` is also used by: -- The `recentlyForwarded` detection map (§6.5) — keyed by `(txHash, blockHash, logIndex)` to identify which specific occurrence was forwarded. +- The post-gate reorg detection map (`forwardedSet`, §6.5) — keyed by `(txHash, blockHash, logIndex)` to identify which specific occurrence was forwarded, with the FIFO `forwardedQueue` driving O(1) eviction. - `StoreContractEvent` in the reactor — stored in `contract_events` for the reconciliation walk (§4.4). -### 6.3 Two-goroutine design +### 6.3 Timer-and-kick design -**Data structure:** a FIFO queue of `(types.Log, arrivedAt time.Time)`. Naturally ordered by arrival time because the Listener delivers events in strict block order. +**Data structure:** a FIFO queue of `(types.Log, arrivedAt time.Time)` paired with a `pending` tombstone map that is the source of truth for which queue entries are live. The queue is append-tail and pop-head only; stale entries are skipped at the head by comparing `pending[ek]` to the popped entry's `BlockHash`. Removal scans of the queue body are eliminated. ```go type queueEntry struct { @@ -195,48 +205,77 @@ type queueEntry struct { arrivedAt time.Time } -type eventKey struct { // used for re-add scan (replace prior entry) +type eventKey struct { // used as the tombstone-map key (re-add replaces prior entry) txHash common.Hash logIndex uint } -type forwardedKey struct { // used for Removed-cancel scan and post-gate reorg detection +type forwardedKey struct { // post-gate detection key (full triple, written by drain goroutine, read on Removed) txHash common.Hash blockHash common.Hash logIndex uint } +type forwardedExpiry struct { + key forwardedKey + forwardedAt time.Time +} + type ConfirmationGate struct { - delay time.Duration - chainID uint64 - handler HandleEvent - queue []queueEntry // protected by mu - recentlyForwarded map[forwardedKey]time.Time // protected by mu; entries are kept for a small multiple of `delay` (see §6.5) - mu sync.Mutex + delay time.Duration + chainID uint64 + handler HandleEvent + logger log.Logger + + mu sync.Mutex + queue []queueEntry // protected by mu + pending map[eventKey]common.Hash // live (txHash, logIndex) -> blockHash; protected by mu + forwardedSet map[forwardedKey]time.Time // protected by mu; entries are kept for a small multiple of `delay` (see §6.5) + forwardedQueue []forwardedExpiry // FIFO of (key, forwardedAt) driving O(1) eviction; protected by mu + + kick chan struct{} // buffered 1, non-blocking sender + timer *time.Timer // created in Start(ctx); reset to the head entry's deadline } ``` --- -**Goroutine 1 — Pusher** (driven by the existing Listener; implements the `HandleEvent` signature) +**Pusher path** (driven by the existing Listener; implements the `HandleEvent` signature) Receives `types.Log` from the Listener. On each event: -- If `Removed: true` — scan the queue for an entry matching the full `(txHash, blockHash, logIndex)` key and delete it. If no match is found, check `recentlyForwarded` for a post-gate reorg signal (see §6.5). -- Otherwise — drop any existing queue entry with the same `(txHash, logIndex)` (ignoring `blockHash`), then append `(log, arrivedAt)` to the queue tail. `arrivedAt` is the block timestamp (see §6.7), falling back to `time.Now()` only on fetch failure. +- If `Removed: true` — under `mu`: if `pending[ek] == eventLog.BlockHash`, `delete(pending, ek)` (pre-gate cancel; the tombstoned queue entry is silently skipped when it reaches the head). Otherwise, if `forwardedSet[fk]` is set, emit the post-gate WARN (§6.5) and `delete(forwardedSet, fk)`; leave the corresponding `forwardedQueue` entry in place — it expires on its own and the eviction loop's value-check makes the early delete safe. Otherwise, emit a DEBUG "removal for unknown/stale event". +- Otherwise — under `mu`: set `pending[ek] = eventLog.BlockHash` (replacing any prior value for the same `(txHash, logIndex)`) and append `(log, arrivedAt)` to the queue tail. `arrivedAt` is the block timestamp (see §6.7). Release `mu` and send a non-blocking `kick` (`select { case g.kick <- struct{}{}: default: }`). No expiration check, no forwarding. Push only. --- -**Goroutine 2 — Poller** +**Drain goroutine** (single, started by `Start(ctx)`) -Wakes every ~50 ms on a ticker. Each wake: +A single timer drives forwarding; no idle wakeups. The timer is reset to the head entry's deadline; a 1-buffered `kick` channel coalesces wakeups from the Pusher when a new head deadline is sooner than the currently-armed timer (or when the queue was empty). -- Inspect the queue front. -- While `front.arrivedAt + delay ≤ now`: pop the entry, record `forwardedKey{txHash, blockHash, logIndex}` in `recentlyForwarded` with the current timestamp, then forward the log to the Reactor outside the lock. -- Stop as soon as the front is not yet ready — everything behind it is newer. -- Sleep until next tick. +```go +for { + select { + case <-ctx.Done(): + return + case <-g.kick: + case <-g.timer.C: + } + g.drainAndReschedule() +} +``` + +`drainAndReschedule`: + +1. Under `mu`: `now := time.Now()`. While the head entry is mature (`queue[0].arrivedAt + delay <= now`): + - Pop it. + - **Tombstone check:** if `pending[ek] != entry.log.BlockHash`, the live entry for that `(txHash, logIndex)` has been replaced by a re-add. Drop silently. Do **not** touch `pending[ek]` — it refers to the *new* live entry still in the queue. + - Otherwise: `delete(pending, ek)`; `forwardedSet[fk] = now`; `forwardedQueue = append(forwardedQueue, forwardedExpiry{fk, now})`. **These three writes happen before releasing `mu`** around the handler call, so a fast `Removed: true` arriving immediately after forwarding always sees the entry and emits the post-gate WARN. + - Release `mu`, call `handler`, re-acquire. +2. Evict aged-out `forwardedSet` entries (see §6.5). +3. Reset the timer to the new head's deadline, or leave it stopped if the queue is empty (the next `kick` will recompute). No event handling, no Listener awareness. Drain-and-forward only. @@ -247,11 +286,12 @@ No event handling, no Listener awareness. Drain-and-forward only. | Property | Detail | | --- | --- | | Chain-agnostic | `confirmationDelay` is the only chain-specific input | -| Forward latency after window | At most one tick (~50 ms) | -| Reorg within window | Pusher's scan removes the entry; Reactor never sees the event | +| Forward latency after window | Bounded by timer scheduling jitter; no fixed polling tick | +| Idle cost | None — no ticker; the goroutine blocks on `ctx.Done()`/`kick`/`timer.C` | +| Reorg within window | Pusher's tombstone delete cancels the entry; Reactor never sees the event | | Reorg deeper than window | Rare; Reactor-level idempotency (§6.6) handles re-delivered events | -| Concurrency | Both goroutines share `mu`; Reactor is called outside the lock | -| Shutdown | Poller exits on `ctx.Done()`; entries still in queue are discarded (safe — they were never forwarded) | +| Concurrency | Pusher and drain goroutine share `mu`; Reactor is called outside the lock | +| Shutdown | Drain goroutine exits on `ctx.Done()`; `defer g.timer.Stop()` cleans up the timer; entries still in queue are discarded (safe — they were never forwarded). `kick` is **not** closed — the Pusher may still be invoked by an in-flight listener event during shutdown, and the non-blocking send is safe whether the receiver is alive or gone. | ### 6.4 Exposing `confirmation_delay_secs` via API @@ -267,15 +307,22 @@ No new endpoint is needed. The field appears alongside existing per-chain fields ### 6.5 Post-gate reorg detection in the gate -The `recentlyForwarded` map (already in the `ConfirmationGate` struct, §6.3) provides detection without any DB access. The **Poller** writes to it each time it forwards an event; the **Pusher** reads from it when a `Removed: true` log arrives and the queue scan finds no matching entry. +The `forwardedSet` membership map (paired with the `forwardedQueue` FIFO; both in the `ConfirmationGate` struct, §6.3) provides detection without any DB access. The **drain goroutine** writes to both each time it forwards an event; the **Pusher** reads `forwardedSet` when a `Removed: true` log arrives and finds no live entry in `pending`. When `Removed: true` arrives in the Pusher: -- **Match found in queue** → normal removal; no log. -- **No match in queue, but `forwardedKey{txHash, blockHash, logIndex}` found in `recentlyForwarded`** → the event was already forwarded to the Reactor and its block has now been reorged out. Log at **`WARN`** with `txHash`, `blockHash`, `logIndex`, `chainID`. Remove the entry. -- **Match in neither** → log at `DEBUG` ("removal for unknown/stale event" — predates the current run or arrived long after the TTL). +- **`pending[ek] == eventLog.BlockHash`** → normal pre-gate removal; delete from `pending` and return. No log. +- **No pre-gate match, but `forwardedKey{txHash, blockHash, logIndex}` is in `forwardedSet`** → the event was already forwarded to the Reactor and its block has now been reorged out. Log at **`WARN`** with `txHash`, `blockHash`, `logIndex`, `chainID`. `delete(forwardedSet, fk)`. The corresponding `forwardedQueue` entry is left in place — it ages out on its own; the eviction loop's value-check (below) tolerates the early delete. +- **Match in neither** → log at `DEBUG` ("removal for unknown/stale event" — predates the current run or arrived after FIFO eviction). + +`forwardedSet` entries are kept for a small multiple of `delay` — long enough that any `Removed: true` for a forwarded event arrives while the entry is still present, short enough that the map remains bounded. The exact multiplier is an implementation choice (current value: see `recentMultiplier` in `confirmation_gate.go`; e.g. 2 or 3 work in practice). + +Eviction is performed in `drainAndReschedule` (the timer/kick goroutine), not in a separate sweep: -`recentlyForwarded` entries are evicted on a TTL that is a small multiple of `delay` — long enough that any `Removed: true` for a forwarded event arrives while the entry is still present, short enough that the map remains bounded. The exact multiplier is an implementation choice (current value: see `recentMultiplier` in `confirmation_gate.go`; e.g. 2 or 3 work in practice). Eviction may be performed lazily on Pusher access, in a periodic Poller sweep, or by any equivalent strategy; the post-gate detection contract above is what matters, not the eviction mechanism. The map stays small because post-gate reorgs are rare and `Removed: true` arrives within one or two block-times of the reorg. No separate cleanup goroutine is required. +- Pop the front of `forwardedQueue` while `now − forwardedAt > recentMultiplier × delay`. +- For each popped `forwardedExpiry{key, forwardedAt}`, **delete from `forwardedSet` only if `forwardedSet[key] == forwardedAt`**. The value check guards the rare re-forward case (same key forwarded a second time after the chain un-reorgs back to the original block and a fresh delay elapses): the older FIFO entry must not evict the newer set membership. It also makes the §6.5 early delete (post-gate WARN path) a safe no-op when the eviction loop later visits its `forwardedQueue` sibling. + +`forwardedAt` is the gate's wall-clock at forward time — not `BlockTimestamp` — so FIFO ordering is monotonic regardless of how `arrivedAt` was sourced. The map stays small because post-gate reorgs are rare and `Removed: true` arrives within one or two block-times of the reorg. No separate cleanup goroutine is required. ### 6.6 Reactor defense-in-depth: skip re-delivered events @@ -304,7 +351,7 @@ Reorged events that pass through this check are still neutralized by the reactor The value of `IsContractEventProcessed` is therefore: -1. **Noise reduction for exact re-deliveries** — converts a constraint-violation rollback (logged as an error by the gate poller) into a clean INFO exit with no DB transaction opened. +1. **Noise reduction for exact re-deliveries** — converts a constraint-violation rollback (logged as an error by the gate's drain goroutine) into a clean INFO exit with no DB transaction opened. 2. **Correctness for the reconciliation walk (§4.4)** — when the node replays already-processed historical events on startup, every re-delivered event would otherwise produce a constraint-violation error and potentially stall the walk. This pre-check makes the reconciliation path viable. Together, §6.5 and §6.6 produce two complementary log signals: @@ -337,44 +384,16 @@ Implementing this requires: **This is a separate task.** It is not part of the current confirmation-gate scope. Until it is implemented, the reactor relies on business-logic idempotency for the reorged-different-position case, which is correct but not explicitly guarded at the storage layer. -### 6.7 Block timestamp cache - -#### Purpose - -The gate uses the **block timestamp** of each event as its `arrivedAt` reference rather than wall-clock time. This ensures that events replayed from historical blocks (whose timestamps are minutes or hours in the past) are forwarded immediately on the first Poller tick, without waiting for the full confirmation delay to elapse again. +### 6.7 Source of `arrivedAt` and the listener's timestamp fallback -Fetching the block timestamp requires one `eth_getBlockByHash` RPC call per block. A single block can produce multiple events (e.g. two `ChannelDeposited` logs in a batch open). The **block timestamp cache** avoids the redundant RPC calls: the first event from a block fetches and stores the timestamp; subsequent events from the same block read it from the cache. +The gate uses the **block timestamp** of each event as its `arrivedAt` reference rather than wall-clock time. This ensures that events replayed from historical blocks (whose timestamps are minutes or hours in the past) are forwarded immediately on the first drain, without waiting for the full confirmation delay to elapse again. -#### Data structure +#### Source of `arrivedAt` -```go -blockTimestampCache map[common.Hash]time.Time // protected by mu; evicted by Poller -``` - -The cache is keyed by `blockHash`. Values are written once (on the first event from a block) and are never modified. - -#### Eviction - -The cache grows monotonically without eviction: every block that produces at least one relevant event adds a permanent entry. Over the lifetime of a long-running node, this is an unbounded memory leak. - -Entries are evicted by the Poller in the same sweep pass that cleans `recentlyForwarded`. An entry is safe to remove once: - -> `now − blockTimestamp > recentMultiplier × delay` - -At that age, every event from the block has either been forwarded (within `delay` of its `arrivedAt`) or cancelled by a `Removed: true` signal. No new event from the same block can arrive after it (the listener delivers events in ascending block order). The cached timestamp therefore serves no further purpose. - -**Eviction is performed in `poll()`, under the mutex, after the `recentlyForwarded` sweep:** - -```go -for bh, ts := range g.blockTimestampCache { - if now.Sub(ts) > recentMultiplier*g.delay { - delete(g.blockTimestampCache, bh) - } -} -``` +The gate reads `eventLog.BlockTimestamp` directly from the `types.Log` it receives. It performs no RPC, holds no timestamp cache, and depends on nothing other than the in-memory value on the log struct. The listener guarantees `BlockTimestamp` is non-zero before forwarding a non-removed event to the gate. If the gate ever observes a zero value (defense-in-depth for tests and edge cases), it falls back to `time.Now()` for that single event; the listener owns any operational warning. -#### Bound after eviction +#### Reliability and fallback -With eviction, the cache holds at most one entry per block whose timestamp falls within the window `[now − recentMultiplier×delay, now]`. That is at most `recentMultiplier × delay × (blocks per second)` entries — a small constant for every supported chain. +`blockTimestamp` is part of the Ethereum execution JSON-RPC spec (execution-apis `receipt.yaml`, 2024) and is populated by current Geth (≥1.13.10), Erigon, Nethermind, Reth, Besu, recent `bnb-chain/bsc`, Bor, Arbitrum Nitro, and op-geth (Base, Optimism). It is **not** populated by Avalanche C-Chain (`ava-labs/libevm` does not define the field) and is unreliable on older `bsc-dataseed` nodes still in production rotation. -Each entry is 56 bytes (`common.Hash` 32 B + `time.Time` 24 B). Even the worst case would be under 100 KB. +Therefore the **listener** — not the gate — owns the fallback. Before forwarding a non-removed event to the gate (or to the reactor on the historical bypass), the listener calls `ensureBlockTimestamp`, which uses `eventLog.BlockTimestamp` when present and falls back to one `HeaderByHash(blockHash)` RPC otherwise. A single-entry cache keyed on `lastBlockHash` elides repeat fetches for consecutive events from the same block, which — because the listener delivers events in block order — is the only relevant case. `Removed: true` logs skip `ensureBlockTimestamp` entirely; the gate's cancel path never reads `BlockTimestamp`. On `HeaderByHash` failure the listener logs a WARN and forwards the event through the gate anyway, where the zero-defense fallback above degrades the entry to a wall-clock delay rather than dropping it silently. diff --git a/pkg/blockchain/evm/confirmation_gate_test.go b/pkg/blockchain/evm/confirmation_gate_test.go index d2bfbba3a..cefffc830 100644 --- a/pkg/blockchain/evm/confirmation_gate_test.go +++ b/pkg/blockchain/evm/confirmation_gate_test.go @@ -2,7 +2,6 @@ package evm import ( "context" - "errors" "sync" "sync/atomic" "testing" @@ -17,10 +16,16 @@ import ( // helpers -func noopFetcher(_ common.Hash) (time.Time, error) { - return time.Now(), nil -} - +// makeLog builds a types.Log with BlockTimestamp == 0. The gate then derives +// arrivedAt from time.Now() at HandleEvent time, which gives sub-second +// resolution. This is the appropriate helper for tests that use millisecond-scale +// delays — BlockTimestamp itself is unix-seconds and would round-trip-truncate +// any timestamp set by the test, causing arrivedAt to land up to 1s in the past +// and the entry to mature immediately. +// +// Tests that explicitly exercise the BlockTimestamp-driven arrival path use +// makeLogAt instead and pick durations large enough to tolerate second-resolution +// truncation. func makeLog(txHash common.Hash, blockHash common.Hash, logIndex uint, removed bool) types.Log { return types.Log{ TxHash: txHash, @@ -30,49 +35,42 @@ func makeLog(txHash common.Hash, blockHash common.Hash, logIndex uint, removed b } } -func newGate(t *testing.T, delay time.Duration, handler HandleEvent, fetcher func(common.Hash) (time.Time, error)) *ConfirmationGate { - t.Helper() - if fetcher == nil { - fetcher = noopFetcher +// makeLogAt builds a non-removed types.Log whose BlockTimestamp is set to the +// supplied wall-clock time. Used for tests that want the gate to derive +// arrivedAt from a specific moment in the past — must be paired with delays +// large enough (≥1s recommended) to tolerate seconds-resolution truncation of +// BlockTimestamp. +func makeLogAt(txHash common.Hash, blockHash common.Hash, logIndex uint, removed bool, ts time.Time) types.Log { + return types.Log{ + TxHash: txHash, + BlockHash: blockHash, + Index: uint(logIndex), + Removed: removed, + BlockTimestamp: uint64(ts.Unix()), } - return NewConfirmationGate(delay, 1, handler, fetcher, log.NewNoopLogger()) } -// T1: delay==0 forwards non-removed events directly; Removed:true events are silently -// dropped to protect reactors that have no guard on l.Topics[0]. -func TestConfirmationGate_Delay0_DirectForward(t *testing.T) { - t.Parallel() - - var calls []types.Log - var mu sync.Mutex - - wantErr := errors.New("handler error") - handler := func(_ context.Context, l types.Log) error { - mu.Lock() - calls = append(calls, l) - mu.Unlock() - return wantErr - } - - g := newGate(t, 0, handler, nil) - g.Start(t.Context()) // should be a no-op for delay==0 +func newGate(t *testing.T, delay time.Duration, handler HandleEvent) *ConfirmationGate { + t.Helper() + g, err := NewConfirmationGate(delay, 1, handler, log.NewNoopLogger()) + require.NoError(t, err) + return g +} - tx := common.HexToHash("0x01") - bh := common.HexToHash("0xAA") +// T0: constructor rejects non-positive delay (operator-facing delay==0 is handled +// by wiring in main.go which skips constructing the gate). +func TestConfirmationGate_Constructor_RejectsNonPositiveDelay(t *testing.T) { + t.Parallel() - // normal event — forwarded, handler error propagated - normalLog := makeLog(tx, bh, 0, false) - err := g.HandleEvent(context.Background(), normalLog) - require.Equal(t, wantErr, err) + handler := func(_ context.Context, _ types.Log) error { return nil } - // removed event — silently dropped; handler NOT called, nil returned - removedLog := makeLog(tx, bh, 0, true) - err = g.HandleEvent(context.Background(), removedLog) - require.NoError(t, err) + g, err := NewConfirmationGate(0, 1, handler, log.NewNoopLogger()) + require.Error(t, err) + assert.Nil(t, g) - mu.Lock() - assert.Len(t, calls, 1, "handler must be called only for non-removed event") - mu.Unlock() + g, err = NewConfirmationGate(-1*time.Second, 1, handler, log.NewNoopLogger()) + require.Error(t, err) + assert.Nil(t, g) } // T2: normal event is queued and delivered after the delay. @@ -91,7 +89,7 @@ func TestConfirmationGate_NormalPath(t *testing.T) { return nil } - g := newGate(t, 5*time.Millisecond, handler, nil) + g := newGate(t, 5*time.Millisecond, handler) g.Start(t.Context()) tx := common.HexToHash("0x02") @@ -104,7 +102,7 @@ func TestConfirmationGate_NormalPath(t *testing.T) { time.Sleep(1 * time.Millisecond) assert.Equal(t, int32(0), callCount.Load(), "handler must not be called before delay expires") - // should be called within 10 ms total + // should be called within 500 ms total deadline := time.After(500 * time.Millisecond) for callCount.Load() == 0 { select { @@ -132,7 +130,7 @@ func TestConfirmationGate_ReorgCancel(t *testing.T) { return nil } - g := newGate(t, 10*time.Millisecond, handler, nil) + g := newGate(t, 10*time.Millisecond, handler) g.Start(t.Context()) tx := common.HexToHash("0x03") @@ -146,8 +144,8 @@ func TestConfirmationGate_ReorgCancel(t *testing.T) { } // T4: a re-delivered event (same tx/logIndex, different blockHash) replaces the original -// in the queue; the late-arriving Removed for the old blockHash is a no-op (no queue match, -// no recentlyForwarded match); the new event is forwarded once. +// in the pending map; the late-arriving Removed for the old blockHash is a no-op (live +// pending hash no longer matches); the new event is forwarded once. func TestConfirmationGate_OutOfOrder(t *testing.T) { t.Parallel() @@ -157,24 +155,22 @@ func TestConfirmationGate_OutOfOrder(t *testing.T) { return nil } - g := newGate(t, 10*time.Millisecond, handler, nil) + g := newGate(t, 10*time.Millisecond, handler) g.Start(t.Context()) tx := common.HexToHash("0x04") bhOld := common.HexToHash("0xAA") bhNew := common.HexToHash("0xBB") - // Event A: original block — queued under (tx, bhOld, 0). + // Event A: original block — queued under (tx, 0). require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bhOld, 0, false))) - // Event B: re-mined in new block (same txHash/logIndex, different blockHash) — - // replaces A in the queue under (tx, bhNew, 0) and resets the confirmation timer. + // Event B: re-mined in new block — replaces pending[ek] = bhNew. The queued A entry + // becomes a tombstone (its blockHash no longer matches pending[ek]). require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bhNew, 0, false))) - // Removed for old block (bhOld): the queued entry now has bhNew, so the full-key - // scan finds no match. recentlyForwarded is empty (nothing forwarded yet). No-op. + // Removed for old block: pending[ek] is bhNew, not bhOld; no forwarded entry yet; + // no-op (debug log). require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bhOld, 0, true))) - // Wait long enough for the poll goroutine to fire (pollInterval=50ms) and the delay to - // have elapsed (10ms). 200ms gives generous headroom. deadline := time.After(500 * time.Millisecond) for callCount.Load() == 0 { select { @@ -185,7 +181,7 @@ func TestConfirmationGate_OutOfOrder(t *testing.T) { } } - // Only B should have been forwarded (A was cancelled). + // Only B should have been forwarded (A was tombstoned and silently dropped). assert.Equal(t, int32(1), callCount.Load()) } @@ -200,7 +196,7 @@ func TestConfirmationGate_PostGateReorg(t *testing.T) { return nil } - g := newGate(t, 2*time.Millisecond, handler, nil) + g := newGate(t, 2*time.Millisecond, handler) g.Start(t.Context()) tx := common.HexToHash("0x05") @@ -240,7 +236,7 @@ func TestConfirmationGate_UnknownRemoval(t *testing.T) { return nil } - g := newGate(t, 10*time.Millisecond, handler, nil) + g := newGate(t, 10*time.Millisecond, handler) g.Start(t.Context()) tx := common.HexToHash("0x06") @@ -253,7 +249,7 @@ func TestConfirmationGate_UnknownRemoval(t *testing.T) { assert.Equal(t, int32(0), callCount.Load()) } -// T7: fetcher returns an old timestamp → event is immediately mature and forwarded fast. +// T7: BlockTimestamp far in the past → event is immediately mature and forwarded fast. func TestConfirmationGate_BlockTimestampBypass(t *testing.T) { t.Parallel() @@ -263,17 +259,15 @@ func TestConfirmationGate_BlockTimestampBypass(t *testing.T) { return nil } - pastFetcher := func(_ common.Hash) (time.Time, error) { - return time.Now().Add(-30 * time.Second), nil - } - - g := newGate(t, 10*time.Millisecond, handler, pastFetcher) + g := newGate(t, 10*time.Millisecond, handler) g.Start(t.Context()) tx := common.HexToHash("0x07") bh := common.HexToHash("0xFF") - require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) + // Block timestamp 30 seconds ago — arrivedAt + 10ms is far in the past, so the + // entry is matured the moment the drain loop runs. + require.NoError(t, g.HandleEvent(context.Background(), makeLogAt(tx, bh, 0, false, time.Now().Add(-30*time.Second)))) deadline := time.After(500 * time.Millisecond) for callCount.Load() == 0 { @@ -287,7 +281,13 @@ func TestConfirmationGate_BlockTimestampBypass(t *testing.T) { assert.Equal(t, int32(1), callCount.Load()) } -// T8: fetcher returns a timestamp 60ms in the past; delay=100ms; so ~40ms remain. +// T8: partial elapsed delay — BlockTimestamp 2 seconds in the past with delay=5s. +// +// Because BlockTimestamp is unix-seconds, the .Unix() conversion floors to the +// nearest whole second. In the worst case the gate sees arrivedAt up to 1s +// further in the past than the wall-clock target — so the actual remaining +// delay is in [2s, 3s]. Sleeping 500ms is safely inside that "not yet" window +// regardless of where the subsecond boundary landed. func TestConfirmationGate_BlockTimestampPartialDelay(t *testing.T) { t.Parallel() @@ -297,37 +297,35 @@ func TestConfirmationGate_BlockTimestampPartialDelay(t *testing.T) { return nil } - fetcher := func(_ common.Hash) (time.Time, error) { - return time.Now().Add(-60 * time.Millisecond), nil - } - - g := newGate(t, 100*time.Millisecond, handler, fetcher) + g := newGate(t, 5*time.Second, handler) g.Start(t.Context()) tx := common.HexToHash("0x08") bh := common.HexToHash("0x08") - require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) + require.NoError(t, g.HandleEvent(context.Background(), makeLogAt(tx, bh, 0, false, time.Now().Add(-2*time.Second)))) - // Not called after 20 ms (need ~40ms more). - time.Sleep(20 * time.Millisecond) + // Not called after 500 ms (worst-case remaining is ≥2s). + time.Sleep(500 * time.Millisecond) assert.Equal(t, int32(0), callCount.Load(), "handler must not be called before remaining delay expires") - // Called within 200 ms total. - deadline := time.After(500 * time.Millisecond) + // Called within 7s total. + deadline := time.After(7 * time.Second) for callCount.Load() == 0 { select { case <-deadline: t.Fatal("handler not called within timeout") default: - time.Sleep(5 * time.Millisecond) + time.Sleep(50 * time.Millisecond) } } assert.Equal(t, int32(1), callCount.Load()) } -// T9: fetcher returns error → fallback to time.Now() → full delay must still elapse. -func TestConfirmationGate_BlockTimestampFetchError(t *testing.T) { +// T9 (reframed): BlockTimestamp == 0 falls back to time.Now() — the full delay +// must elapse. No log is emitted from the gate side (the listener owns any WARN +// for a missing timestamp). +func TestConfirmationGate_BlockTimestampZeroFallback(t *testing.T) { t.Parallel() var callCount atomic.Int32 @@ -336,23 +334,20 @@ func TestConfirmationGate_BlockTimestampFetchError(t *testing.T) { return nil } - errFetcher := func(_ common.Hash) (time.Time, error) { - return time.Time{}, errors.New("rpc error") - } - - g := newGate(t, 5*time.Millisecond, handler, errFetcher) + g := newGate(t, 10*time.Millisecond, handler) g.Start(t.Context()) tx := common.HexToHash("0x09") bh := common.HexToHash("0x09") + // makeLog produces BlockTimestamp == 0 → gate falls back to time.Now(). require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) // Not called immediately (fell back to current time, full delay required). time.Sleep(1 * time.Millisecond) assert.Equal(t, int32(0), callCount.Load(), "handler must not be called before delay expires") - // Called after 10 ms. + // Called within 500 ms. deadline := time.After(500 * time.Millisecond) for callCount.Load() == 0 { select { @@ -365,46 +360,6 @@ func TestConfirmationGate_BlockTimestampFetchError(t *testing.T) { assert.Equal(t, int32(1), callCount.Load()) } -// T10: two events sharing the same blockHash should result in exactly one fetcher call. -func TestConfirmationGate_BlockTimestampCache(t *testing.T) { - t.Parallel() - - var fetchCount atomic.Int32 - var callCount atomic.Int32 - - fetcher := func(_ common.Hash) (time.Time, error) { - fetchCount.Add(1) - return time.Now(), nil - } - handler := func(_ context.Context, _ types.Log) error { - callCount.Add(1) - return nil - } - - g := newGate(t, 5*time.Millisecond, handler, fetcher) - g.Start(t.Context()) - - tx1 := common.HexToHash("0x10") - tx2 := common.HexToHash("0x11") - bh := common.HexToHash("0xSHARED") - - require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx1, bh, 0, false))) - require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx2, bh, 1, false))) - - // Wait for both events to be forwarded. - deadline := time.After(500 * time.Millisecond) - for callCount.Load() < 2 { - select { - case <-deadline: - t.Fatal("not all events delivered within timeout") - default: - time.Sleep(1 * time.Millisecond) - } - } - - assert.Equal(t, int32(1), fetchCount.Load(), "fetcher must be called only once for a shared blockHash") -} - // T11: cancelling the context prevents queued events from being forwarded. func TestConfirmationGate_Shutdown(t *testing.T) { t.Parallel() @@ -415,7 +370,7 @@ func TestConfirmationGate_Shutdown(t *testing.T) { return nil } - g := newGate(t, 50*time.Millisecond, handler, nil) + g := newGate(t, 50*time.Millisecond, handler) ctx, cancel := context.WithCancel(t.Context()) g.Start(ctx) @@ -432,8 +387,10 @@ func TestConfirmationGate_Shutdown(t *testing.T) { assert.Equal(t, int32(0), callCount.Load(), "no events must be forwarded after context cancellation") } -// T12: recentlyForwarded entries are evicted after recentMultiplier × delay. -func TestConfirmationGate_RecentlyForwardedEviction(t *testing.T) { +// T12: forwardedSet entries are evicted after recentMultiplier × delay. +// Behavior under test: after eviction, a Removed for the same (tx, blockHash, idx) +// must fall through to the DEBUG path — no panic, no error. +func TestConfirmationGate_ForwardedSetEviction(t *testing.T) { t.Parallel() var callCount atomic.Int32 @@ -442,8 +399,8 @@ func TestConfirmationGate_RecentlyForwardedEviction(t *testing.T) { return nil } - delay := 2 * time.Millisecond - g := newGate(t, delay, handler, nil) + delay := 5 * time.Millisecond + g := newGate(t, delay, handler) g.Start(t.Context()) tx := common.HexToHash("0x12") @@ -461,20 +418,40 @@ func TestConfirmationGate_RecentlyForwardedEviction(t *testing.T) { } } - // Immediately send a post-gate Removed — should match recentlyForwarded (WARN path). - err := g.HandleEvent(context.Background(), makeLog(tx, bh, 0, true)) - assert.NoError(t, err) + // At this point forwardedSet contains the entry. + g.mu.Lock() + _, present := g.forwardedSet[forwardedKey{txHash: tx, blockHash: bh, logIndex: 0}] + g.mu.Unlock() + assert.True(t, present, "forwardedSet must contain the entry immediately after forwarding") - // Wait well past recentMultiplier × delay so the entry is evicted. - time.Sleep(time.Duration(recentMultiplier) * delay) + // Wait well past recentMultiplier × delay, then enqueue another event to trigger + // the eviction path inside drainAndReschedule. + time.Sleep(time.Duration(recentMultiplier+1) * delay) - // A second Removed for the same event — should fall through to DEBUG path (not found). - // Verifies the eviction happened. No panic, no error. - err = g.HandleEvent(context.Background(), makeLog(tx, bh, 0, true)) - assert.NoError(t, err) + tx2 := common.HexToHash("0x13") + bh2 := common.HexToHash("0x13") + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx2, bh2, 0, false))) - // Handler still called exactly once. - assert.Equal(t, int32(1), callCount.Load()) + // Wait for tx2 to forward; the eviction loop also runs. + deadline = time.After(500 * time.Millisecond) + for callCount.Load() < 2 { + select { + case <-deadline: + t.Fatal("second handler invocation timed out") + default: + time.Sleep(1 * time.Millisecond) + } + } + + g.mu.Lock() + _, presentAfter := g.forwardedSet[forwardedKey{txHash: tx, blockHash: bh, logIndex: 0}] + g.mu.Unlock() + assert.False(t, presentAfter, "old forwardedSet entry must be evicted after recentMultiplier × delay") + + // A second Removed for the original event — falls through to DEBUG (not found). + // No panic, no error. + err := g.HandleEvent(context.Background(), makeLog(tx, bh, 0, true)) + assert.NoError(t, err) } // T13: multiple events are all delivered, preserving queue order. @@ -491,7 +468,7 @@ func TestConfirmationGate_MultipleEvents_Ordering(t *testing.T) { return nil } - g := newGate(t, 5*time.Millisecond, handler, nil) + g := newGate(t, 5*time.Millisecond, handler) g.Start(t.Context()) txHashes := []common.Hash{ @@ -529,3 +506,283 @@ func TestConfirmationGate_MultipleEvents_Ordering(t *testing.T) { assert.Equal(t, txHashes[1], delivered[1]) assert.Equal(t, txHashes[2], delivered[2]) } + +// New: tombstone-skip — a non-removed re-add with a different blockHash supersedes +// the queued entry. When the original entry's deadline arrives, the gate notices +// the tombstone (pending[ek] != entry.log.BlockHash) and silently drops it. +// Only the new entry's forward is observed. +func TestConfirmationGate_TombstoneSkip(t *testing.T) { + t.Parallel() + + var mu sync.Mutex + var delivered []common.Hash // blockHashes seen by handler + + handler := func(_ context.Context, l types.Log) error { + mu.Lock() + delivered = append(delivered, l.BlockHash) + mu.Unlock() + return nil + } + + delay := 30 * time.Millisecond + g := newGate(t, delay, handler) + g.Start(t.Context()) + + tx := common.HexToHash("0x20") + bhA := common.HexToHash("0xAAA") + bhB := common.HexToHash("0xBBB") + + // Enqueue event for blockHashA. + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bhA, 0, false))) + // Before the delay elapses, send a non-removed re-add with blockHashB — same (tx, idx). + // The gate replaces pending[ek] = bhB and appends a new queue entry; the bhA entry + // becomes a tombstone. + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bhB, 0, false))) + + // Wait past the delay. + deadline := time.After(500 * time.Millisecond) + for { + mu.Lock() + n := len(delivered) + mu.Unlock() + if n >= 1 { + break + } + select { + case <-deadline: + t.Fatal("handler not called within timeout — event B was not forwarded") + default: + time.Sleep(2 * time.Millisecond) + } + } + + // Allow extra time to ensure bhA does not slip through later (it shouldn't — + // it's tombstoned and dropped silently on pop). + time.Sleep(50 * time.Millisecond) + + mu.Lock() + defer mu.Unlock() + require.Len(t, delivered, 1, "exactly one forward expected (the bhB entry)") + assert.Equal(t, bhB, delivered[0], "the bhB entry must be the one forwarded") +} + +// New: FIFO eviction with early-delete tolerance. After forwarding, a Removed:true +// arrives and removes the forwardedSet entry while emitting the post-gate WARN. +// Later, the FIFO eviction loop pops the corresponding forwardedQueue entry — the +// set entry is already gone. The eviction must not panic and must not double-invoke +// the handler. +func TestConfirmationGate_FIFOEviction_ToleratesEarlyDelete(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + delay := 5 * time.Millisecond + g := newGate(t, delay, handler) + g.Start(t.Context()) + + tx := common.HexToHash("0x30") + bh := common.HexToHash("0x30") + + // Forward an event. + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) + deadline := time.After(500 * time.Millisecond) + for callCount.Load() == 0 { + select { + case <-deadline: + t.Fatal("handler not called within timeout") + default: + time.Sleep(1 * time.Millisecond) + } + } + + // Confirm the forwardedSet entry exists. + fk := forwardedKey{txHash: tx, blockHash: bh, logIndex: 0} + g.mu.Lock() + _, presentBefore := g.forwardedSet[fk] + queueLen := len(g.forwardedQueue) + g.mu.Unlock() + require.True(t, presentBefore, "forwardedSet must contain the entry immediately after forwarding") + require.Equal(t, 1, queueLen, "forwardedQueue must contain one entry") + + // Send Removed:true — gate emits post-gate WARN and deletes the entry from forwardedSet + // (but leaves the forwardedQueue entry in place; it will expire on its own). + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, true))) + + g.mu.Lock() + _, presentAfterRemoved := g.forwardedSet[fk] + g.mu.Unlock() + require.False(t, presentAfterRemoved, "forwardedSet entry must be deleted by the post-gate WARN path") + + // Wait well past recentMultiplier × delay, then kick the drain loop with a new event + // so eviction runs and pops the orphaned forwardedQueue entry. + time.Sleep(time.Duration(recentMultiplier+1) * delay) + + tx2 := common.HexToHash("0x31") + bh2 := common.HexToHash("0x31") + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx2, bh2, 0, false))) + + // Wait for the second forward. + deadline = time.After(500 * time.Millisecond) + for callCount.Load() < 2 { + select { + case <-deadline: + t.Fatal("second handler invocation timed out") + default: + time.Sleep(1 * time.Millisecond) + } + } + + // Handler called exactly twice (once per forward; no double-action from eviction). + assert.Equal(t, int32(2), callCount.Load()) + + // The orphaned forwardedQueue entry must have been popped during eviction. + g.mu.Lock() + // After tx2 is forwarded, the queue should have exactly one entry (tx2's). + finalQueueLen := len(g.forwardedQueue) + g.mu.Unlock() + assert.Equal(t, 1, finalQueueLen, "orphan forwardedQueue entry must have been evicted") +} + +// New: timer reschedule — enqueue a single event and do NOT send any further kicks. +// The handler must be invoked when the timer fires. +func TestConfirmationGate_TimerReschedule(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + return nil + } + + g := newGate(t, 20*time.Millisecond, handler) + g.Start(t.Context()) + + tx := common.HexToHash("0x40") + bh := common.HexToHash("0x40") + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) + + // No further HandleEvent calls. Wait for the timer to fire. + deadline := time.After(500 * time.Millisecond) + for callCount.Load() == 0 { + select { + case <-deadline: + t.Fatal("handler not called via timer fire alone") + default: + time.Sleep(2 * time.Millisecond) + } + } + assert.Equal(t, int32(1), callCount.Load()) +} + +// New: kick during a pending timer must NOT extend the original timer's deadline. +// Event A is enqueued first (timer arms for A's deadline). Before A matures we +// enqueue B with a LATER BlockTimestamp. A must still fire at its original +// deadline; the kick rescheduled the timer to A's head deadline (unchanged). +func TestConfirmationGate_KickDuringPendingTimer(t *testing.T) { + t.Parallel() + + var mu sync.Mutex + var deliveredOrder []common.Hash + firstFiredAt := make(chan time.Time, 1) + + handler := func(_ context.Context, l types.Log) error { + mu.Lock() + deliveredOrder = append(deliveredOrder, l.TxHash) + isFirst := len(deliveredOrder) == 1 + mu.Unlock() + if isFirst { + select { + case firstFiredAt <- time.Now(): + default: + } + } + return nil + } + + delay := 100 * time.Millisecond + g := newGate(t, delay, handler) + g.Start(t.Context()) + + txA := common.HexToHash("0x50") + bhA := common.HexToHash("0x50") + txB := common.HexToHash("0x51") + bhB := common.HexToHash("0x51") + + // Event A: BlockTimestamp == 0 → gate uses time.Now() at HandleEvent time as arrivedAt. + enqueueA := time.Now() + require.NoError(t, g.HandleEvent(context.Background(), makeLog(txA, bhA, 0, false))) + + // Brief sleep, then enqueue B. The kick wakes the drain loop; A is not yet + // mature; the timer must be reset to A's deadline. B's deadline is later than + // A's because its arrivedAt is later (HandleEvent uses time.Now() when + // BlockTimestamp == 0). + time.Sleep(20 * time.Millisecond) + require.NoError(t, g.HandleEvent(context.Background(), makeLog(txB, bhB, 0, false))) + + // Wait for A to fire. + select { + case firedAt := <-firstFiredAt: + // A's expected deadline was enqueueA + delay. Firing should occur no + // earlier than ~that moment and not be delayed by B's later deadline. + elapsed := firedAt.Sub(enqueueA) + // Allow generous slack but ensure A did not get pushed to B's deadline + // (B's deadline is enqueueA + ~20ms + 50ms + delay = enqueueA + ~170ms). + assert.GreaterOrEqual(t, elapsed, 90*time.Millisecond, "A fired before its deadline") + assert.Less(t, elapsed, 160*time.Millisecond, "A's deadline was extended by B's kick") + case <-time.After(1 * time.Second): + t.Fatal("A did not fire within timeout") + } + + mu.Lock() + defer mu.Unlock() + require.GreaterOrEqual(t, len(deliveredOrder), 1) + assert.Equal(t, txA, deliveredOrder[0], "A must fire first (queue order preserved)") +} + +// New: shutdown with non-empty queue — cancel the gate's context, assert the +// goroutine exits quickly and no handler is invoked. +func TestConfirmationGate_ShutdownWithNonEmptyQueue(t *testing.T) { + t.Parallel() + + var callCount atomic.Int32 + handlerEntered := make(chan struct{}, 4) + handler := func(_ context.Context, _ types.Log) error { + callCount.Add(1) + select { + case handlerEntered <- struct{}{}: + default: + } + return nil + } + + g := newGate(t, 200*time.Millisecond, handler) + ctx, cancel := context.WithCancel(t.Context()) + g.Start(ctx) + + // Enqueue multiple events far in the future. + for i := range 4 { + tx := common.HexToHash(string(rune(0x60 + i))) + bh := common.HexToHash(string(rune(0x70 + i))) + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, uint(i), false))) + } + + // Cancel and assert the gate's goroutine exits within a short window. + cancel() + + // Give the goroutine time to observe ctx.Done. + time.Sleep(50 * time.Millisecond) + + // Even if we wait far longer than the delay would otherwise require, no handler call. + time.Sleep(300 * time.Millisecond) + assert.Equal(t, int32(0), callCount.Load(), "no handler invocations expected after shutdown") + + select { + case <-handlerEntered: + t.Fatal("handler was invoked after shutdown") + default: + } +} diff --git a/pkg/blockchain/evm/listener_test.go b/pkg/blockchain/evm/listener_test.go index e5158b978..b80d78c39 100644 --- a/pkg/blockchain/evm/listener_test.go +++ b/pkg/blockchain/evm/listener_test.go @@ -81,12 +81,12 @@ func TestListener_Listen_CurrentEvents(t *testing.T) { unsub: func() {}, } - // Mock SubscribeFilterLogs: send a log immediately + // Mock SubscribeFilterLogs: send a log immediately. BlockTimestamp is set so + // the listener's ensureBlockTimestamp short-circuits and does not call HeaderByHash. mockClient.On("SubscribeFilterLogs", mock.Anything, mock.Anything, mock.Anything). Run(func(args mock.Arguments) { ch := args.Get(2).(chan<- types.Log) - // Send a log immediately - ch <- types.Log{BlockNumber: 10, Index: 1} + ch <- types.Log{BlockNumber: 10, Index: 1, BlockTimestamp: uint64(time.Now().Unix())} }). Return(sub, nil) @@ -199,8 +199,8 @@ func TestListener_Listen_HistoricalAndCurrent(t *testing.T) { currentHeader := &types.Header{Number: big.NewInt(110)} mockClient.On("HeaderByNumber", mock.Anything, (*big.Int)(nil)).Return(currentHeader, nil) - // Mock FilterLogs (100-110) - histLogs := []types.Log{{BlockNumber: 105, Index: 0}} + // Mock FilterLogs (100-110). BlockTimestamp is set so ensureBlockTimestamp short-circuits. + histLogs := []types.Log{{BlockNumber: 105, Index: 0, BlockTimestamp: uint64(time.Now().Unix())}} mockClient.On("FilterLogs", mock.Anything, mock.Anything).Return(histLogs, nil) // Mock SubscribeFilterLogs @@ -208,8 +208,7 @@ func TestListener_Listen_HistoricalAndCurrent(t *testing.T) { mockClient.On("SubscribeFilterLogs", mock.Anything, mock.Anything, mock.Anything). Run(func(args mock.Arguments) { ch := args.Get(2).(chan<- types.Log) - // Send a current log - ch <- types.Log{BlockNumber: 111, Index: 0} + ch <- types.Log{BlockNumber: 111, Index: 0, BlockTimestamp: uint64(time.Now().Unix())} }). Return(sub, nil) @@ -239,12 +238,14 @@ func TestProcessEvents_DedupSkipsPresent(t *testing.T) { // Historical: 3 events. First 2 are present (skipped), 3rd is not (handled). // After the 3rd, the check should stop — no IsContractEventPresent call for events 4+. + // BlockTimestamp is set so ensureBlockTimestamp short-circuits. + ts := uint64(time.Now().Unix()) historicalCh := make(chan types.Log, 5) - historicalCh <- types.Log{BlockNumber: 100, Index: 0, TxHash: common.HexToHash("0xaa")} - historicalCh <- types.Log{BlockNumber: 101, Index: 0, TxHash: common.HexToHash("0xbb")} - historicalCh <- types.Log{BlockNumber: 102, Index: 0, TxHash: common.HexToHash("0xcc")} - historicalCh <- types.Log{BlockNumber: 103, Index: 0, TxHash: common.HexToHash("0xdd")} - historicalCh <- types.Log{BlockNumber: 104, Index: 0, TxHash: common.HexToHash("0xee")} + historicalCh <- types.Log{BlockNumber: 100, Index: 0, TxHash: common.HexToHash("0xaa"), BlockTimestamp: ts} + historicalCh <- types.Log{BlockNumber: 101, Index: 0, TxHash: common.HexToHash("0xbb"), BlockTimestamp: ts} + historicalCh <- types.Log{BlockNumber: 102, Index: 0, TxHash: common.HexToHash("0xcc"), BlockTimestamp: ts} + historicalCh <- types.Log{BlockNumber: 103, Index: 0, TxHash: common.HexToHash("0xdd"), BlockTimestamp: ts} + historicalCh <- types.Log{BlockNumber: 104, Index: 0, TxHash: common.HexToHash("0xee"), BlockTimestamp: ts} close(historicalCh) // First two are present, third is not @@ -288,9 +289,10 @@ func TestProcessEvents_SubscriptionErrorDuringPhase1(t *testing.T) { listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) - // Historical channel with events that will block (not closed yet) + // Historical channel with events that will block (not closed yet). BlockTimestamp + // is set so ensureBlockTimestamp short-circuits. historicalCh := make(chan types.Log, 2) - historicalCh <- types.Log{BlockNumber: 100, Index: 0, TxHash: common.HexToHash("0xaa")} + historicalCh <- types.Log{BlockNumber: 100, Index: 0, TxHash: common.HexToHash("0xaa"), BlockTimestamp: uint64(time.Now().Unix())} eventGetter.On("IsContractEventPresent", uint64(1), uint64(100), mock.Anything, uint32(0)).Return(false, nil) @@ -367,8 +369,10 @@ func TestListener_PhaseHandlerRouting(t *testing.T) { failLog := types.Log{BlockNumber: 102, Index: 0, TxHash: common.HexToHash("0xccc"), BlockHash: failHash} mockClient.On("HeaderByHash", mock.Anything, failHash).Return(nil, fmt.Errorf("rpc failure")).Once() - // Live event — always to liveHandler regardless of age. - currentLog := types.Log{BlockNumber: 200, Index: 0, TxHash: common.HexToHash("0xddd"), BlockHash: common.HexToHash("0xb1")} + // Live event — always to liveHandler regardless of age. BlockTimestamp is set + // so ensureBlockTimestamp short-circuits on the Phase 2 path (avoiding a + // HeaderByHash call we'd otherwise have to mock). + currentLog := types.Log{BlockNumber: 200, Index: 0, TxHash: common.HexToHash("0xddd"), BlockHash: common.HexToHash("0xb1"), BlockTimestamp: uint64(time.Now().Unix())} historicalCh := make(chan types.Log, 3) historicalCh <- oldLog @@ -437,7 +441,9 @@ func TestListener_PhaseHandlerRouting_DelayZero(t *testing.T) { listener := NewListener(addr, mockClient, 1, 10, 0, logger, liveHandler, historicalHandler, eventGetter) - histLog := types.Log{BlockNumber: 100, Index: 0, TxHash: common.HexToHash("0xaaa"), BlockHash: common.HexToHash("0xa1")} + // BlockTimestamp populated by the upstream RPC — ensureBlockTimestamp short-circuits + // and routeHistoricalEvent routes directly to historicalHandler because delay == 0. + histLog := types.Log{BlockNumber: 100, Index: 0, TxHash: common.HexToHash("0xaaa"), BlockHash: common.HexToHash("0xa1"), BlockTimestamp: uint64(time.Now().Unix())} historicalCh := make(chan types.Log, 1) historicalCh <- histLog close(historicalCh) @@ -462,7 +468,8 @@ func TestListener_PhaseHandlerRouting_DelayZero(t *testing.T) { require.Len(t, historicalLogs, 1) assert.Equal(t, uint64(100), historicalLogs[0].BlockNumber) - // HeaderByHash must NOT have been called when delay is 0. + // HeaderByHash must NOT have been called — the upstream RPC populated BlockTimestamp, + // so ensureBlockTimestamp short-circuits. mockClient.AssertNotCalled(t, "HeaderByHash") } @@ -488,8 +495,9 @@ func TestListener_RemovedLog_ForwardedToHandler(t *testing.T) { currentCh := make(chan types.Log, 2) // Event 1: non-Removed at block 10 — triggers IsContractEventPresent check, - // advances lastBlock, sets currentCheckDone = true. - normalLog := types.Log{BlockNumber: 10, Index: 0, TxHash: common.HexToHash("0xabc")} + // advances lastBlock, sets currentCheckDone = true. BlockTimestamp is set so + // ensureBlockTimestamp short-circuits. + normalLog := types.Log{BlockNumber: 10, Index: 0, TxHash: common.HexToHash("0xabc"), BlockTimestamp: uint64(time.Now().Unix())} eventGetter.On("IsContractEventPresent", uint64(1), uint64(10), mock.Anything, uint32(0)).Return(false, nil).Once() // Event 2: Removed=true at block 11 — must NOT advance lastBlock, must NOT call @@ -564,3 +572,113 @@ func TestReconcileBlockRange_ContextCancellation(t *testing.T) { assert.LessOrEqual(t, len(received), 1) mockClient.AssertNumberOfCalls(t, "FilterLogs", 1) } + +// TestEnsureBlockTimestamp_Populated: when BlockTimestamp is already set on the +// incoming log, ensureBlockTimestamp returns the log unchanged and does not call +// HeaderByHash. We prove the latter by leaving the mock unconfigured — any call +// would panic. +func TestEnsureBlockTimestamp_Populated(t *testing.T) { + t.Parallel() + mockClient := new(MockEVMClient) + logger := log.NewNoopLogger() + addr := common.HexToAddress("0x123") + eventGetter := new(MockContractEventGetter) + + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) + + originalTs := uint64(1700000000) + eventLog := types.Log{ + BlockNumber: 100, + BlockHash: common.HexToHash("0xabc"), + BlockTimestamp: originalTs, + } + + got, err := listener.ensureBlockTimestamp(context.Background(), eventLog) + require.NoError(t, err) + assert.Equal(t, originalTs, got.BlockTimestamp, "BlockTimestamp must be returned unchanged") + assert.Equal(t, eventLog.BlockHash, got.BlockHash) + mockClient.AssertNotCalled(t, "HeaderByHash") +} + +// TestEnsureBlockTimestamp_Fetch: when BlockTimestamp == 0, ensureBlockTimestamp +// calls HeaderByHash exactly once and populates BlockTimestamp from header.Time. +func TestEnsureBlockTimestamp_Fetch(t *testing.T) { + t.Parallel() + mockClient := new(MockEVMClient) + logger := log.NewNoopLogger() + addr := common.HexToAddress("0x123") + eventGetter := new(MockContractEventGetter) + + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) + + bh := common.HexToHash("0xabc") + headerTime := uint64(1700000000) + header := &types.Header{Number: big.NewInt(100), Time: headerTime} + mockClient.On("HeaderByHash", mock.Anything, bh).Return(header, nil).Once() + + eventLog := types.Log{BlockNumber: 100, BlockHash: bh} + + got, err := listener.ensureBlockTimestamp(context.Background(), eventLog) + require.NoError(t, err) + assert.Equal(t, headerTime, got.BlockTimestamp, "BlockTimestamp must be populated from header.Time") + mockClient.AssertExpectations(t) +} + +// TestEnsureBlockTimestamp_CacheHit: two consecutive events with the same BlockHash +// (both with BlockTimestamp == 0) must trigger exactly one HeaderByHash call. The +// second call reads from the single-entry cache. +func TestEnsureBlockTimestamp_CacheHit(t *testing.T) { + t.Parallel() + mockClient := new(MockEVMClient) + logger := log.NewNoopLogger() + addr := common.HexToAddress("0x123") + eventGetter := new(MockContractEventGetter) + + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) + + bh := common.HexToHash("0xabc") + headerTime := uint64(1700000000) + header := &types.Header{Number: big.NewInt(100), Time: headerTime} + // Set up exactly ONE HeaderByHash expectation; a second call would fail + // AssertExpectations because the mock is .Once(). + mockClient.On("HeaderByHash", mock.Anything, bh).Return(header, nil).Once() + + first := types.Log{BlockNumber: 100, BlockHash: bh, Index: 0} + second := types.Log{BlockNumber: 100, BlockHash: bh, Index: 1} + + got1, err := listener.ensureBlockTimestamp(context.Background(), first) + require.NoError(t, err) + assert.Equal(t, headerTime, got1.BlockTimestamp) + + got2, err := listener.ensureBlockTimestamp(context.Background(), second) + require.NoError(t, err) + assert.Equal(t, headerTime, got2.BlockTimestamp) + + mockClient.AssertNumberOfCalls(t, "HeaderByHash", 1) + mockClient.AssertExpectations(t) +} + +// TestEnsureBlockTimestamp_FetchError: when HeaderByHash returns an error, +// ensureBlockTimestamp returns the original (unmutated) eventLog and the error. +// The caller decides whether to fall back to the gate. +func TestEnsureBlockTimestamp_FetchError(t *testing.T) { + t.Parallel() + mockClient := new(MockEVMClient) + logger := log.NewNoopLogger() + addr := common.HexToAddress("0x123") + eventGetter := new(MockContractEventGetter) + + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) + + bh := common.HexToHash("0xabc") + mockClient.On("HeaderByHash", mock.Anything, bh).Return(nil, fmt.Errorf("rpc failure")).Once() + + eventLog := types.Log{BlockNumber: 100, BlockHash: bh} + + got, err := listener.ensureBlockTimestamp(context.Background(), eventLog) + require.Error(t, err) + // On error, BlockTimestamp remains at the input value (0). + assert.Equal(t, uint64(0), got.BlockTimestamp) + assert.Equal(t, bh, got.BlockHash) + mockClient.AssertExpectations(t) +} From c56163d792a896139ec3ec2d8648ef9ef282f42f Mon Sep 17 00:00:00 2001 From: nksazonov Date: Thu, 11 Jun 2026 12:29:36 +0200 Subject: [PATCH 14/23] fix(nitronode/confirmation_gate): small fixes --- nitronode/reorg-fix-spec.md | 9 ++++++--- pkg/blockchain/evm/confirmation_gate.go | 2 +- pkg/blockchain/evm/listener.go | 6 ------ 3 files changed, 7 insertions(+), 10 deletions(-) diff --git a/nitronode/reorg-fix-spec.md b/nitronode/reorg-fix-spec.md index 8a52faf93..788e1d860 100644 --- a/nitronode/reorg-fix-spec.md +++ b/nitronode/reorg-fix-spec.md @@ -131,7 +131,8 @@ On startup, for each chain, after the `block_hash` migration has been applied: - Events whose block timestamp is **older than `confirmation_delay_sec`** are routed directly to the reactor, bypassing the gate. Their block is past the reorg window — `eth_getLogs` returned them as canonical, and any reorg that could displace them would exceed the configured finality bound. There is no incremental reorg risk to guard against, and routing them through the gate would only add latency. - Events whose block timestamp is **younger than `confirmation_delay_sec`** are routed through the gate, the same path live events take. The common-ancestor walk only confirms the *starting* block is canonical; replay can fetch logs from blocks all the way up to the current chain tip, some of which are still inside the reorg window. Forwarding those directly to the reactor would re-introduce the very double-spend window the gate was built to close. - The `Listener` accepts two handlers (`eventHandler` for live events and recent historical events, `historicalEventHandler` for mature historical events) and makes the per-event routing decision from `eventLog.BlockTimestamp`. To guarantee that field is populated regardless of the RPC provider's behavior, the listener calls `ensureBlockTimestamp` once per event, which uses `eventLog.BlockTimestamp` when present and falls back to `HeaderByHash` otherwise (at most one fetch per block regardless of event count). When `confirmation_delay_sec` is `0` the gate is disabled and every historical event is routed to `historicalEventHandler`. On an `ensureBlockTimestamp` failure the Listener falls back to `eventHandler` (the gate) — the conservative choice that preserves the reorg-protection invariant at the cost of a small delay. + The `Listener` accepts two handlers (`eventHandler` for live events and recent historical events, `historicalEventHandler` for mature historical events) and makes the per-event routing decision from `eventLog.BlockTimestamp`. To guarantee that field is populated regardless of the RPC provider's behavior, the listener calls `ensureBlockTimestamp` once per event, which uses `eventLog.BlockTimestamp` when present and falls back to `HeaderByHash` otherwise (at most one fetch per block regardless of event count). + When `confirmation_delay_sec` is `0` the gate is disabled and every historical event is routed to `historicalEventHandler`. On an `ensureBlockTimestamp` failure the Listener falls back to `eventHandler` (the gate) — the conservative choice that preserves the reorg-protection invariant at the cost of a small delay. 6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)`. If a duplicate is inserted, Postgres returns a constraint-violation error, causing the entire transaction (including all state mutations in the same `useStoreInTx` call) to roll back. The reactor therefore cannot double-apply state changes for an event it has already committed. 7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay, and replay does not flow through the gate (step 5). Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and reach the gate only after the historical replay phase completes; if they cancel a re-mined event that has already been forwarded by the live path, the post-gate reorg detection in §6.5 logs them. @@ -184,7 +185,7 @@ The reactor itself does not change. All the listener's existing logic — subscr The Listener delivers events in strict block order, so the FIFO queue is naturally ordered by arrival time. Two distinct keys identify events at different layers of the design: -- **`(txHash, logIndex)` — the live-entry key, used as the tombstone-map (`pending`) key.** On a non-removed arrival, the Pusher sets `pending[ek] = eventLog.BlockHash` (overwriting any prior value) and appends to the queue tail. On a `Removed: true` arrival, the Pusher checks `pending[ek]` and cancels (deletes from `pending`) iff the stored `blockHash` matches the removed log's. A stale removal for an OLD block whose `pending` value has already been overwritten by a newer re-add will not match and falls through to the `forwardedSet` lookup (§6.5). Both operations are O(1) map lookups; the queue body is never scanned. +- **`(txHash, logIndex)` — the live-entry key, used as the tombstone-map (`pending`) key.** On a non-removed arrival, the Pusher sets `pending[ek] = eventLog.BlockHash` (overwriting any prior value) and appends to the queue tail. On a `Removed: true` arrival, the Pusher checks `pending[ek]` and cancels (deletes from `pending`) if the stored `blockHash` matches the removed log's. A stale removal for an OLD block whose `pending` value has already been overwritten by a newer re-add will not match and falls through to the `forwardedSet` lookup (§6.5). Both operations are O(1) map lookups; the queue body is never scanned. - **`(txHash, blockHash, logIndex)` — the post-gate detection key (`forwardedKey`), used to index `forwardedSet`.** When the drain goroutine forwards an event, it inserts this triple into `forwardedSet` so a later `Removed: true` for the same exact occurrence can be matched and the post-gate reorg WARN emitted. Including `blockHash` ensures a stale removal for an already-replaced fork cannot cause a spurious WARN against a different re-mining. `blockHash` is excluded from the live-entry key so that a re-mining of the same tx overwrites the original `pending` value regardless of which block it landed in. `blockHash` is included in the post-gate detection key so that the WARN matches the specific occurrence that was forwarded. @@ -192,6 +193,7 @@ The Listener delivers events in strict block order, so the FIFO queue is natural A single transaction can emit multiple events for the same `txHash` (e.g., two `ChannelDeposited` logs in a batch open). `logIndex` disambiguates these; it is unique per log within a block and is present in both the live event and its corresponding `Removed: true` log. `blockHash` is also used by: + - The post-gate reorg detection map (`forwardedSet`, §6.5) — keyed by `(txHash, blockHash, logIndex)` to identify which specific occurrence was forwarded, with the FIFO `forwardedQueue` driving O(1) eviction. - `StoreContractEvent` in the reactor — stored in `contract_events` for the reconciliation walk (§4.4). @@ -396,4 +398,5 @@ The gate reads `eventLog.BlockTimestamp` directly from the `types.Log` it receiv `blockTimestamp` is part of the Ethereum execution JSON-RPC spec (execution-apis `receipt.yaml`, 2024) and is populated by current Geth (≥1.13.10), Erigon, Nethermind, Reth, Besu, recent `bnb-chain/bsc`, Bor, Arbitrum Nitro, and op-geth (Base, Optimism). It is **not** populated by Avalanche C-Chain (`ava-labs/libevm` does not define the field) and is unreliable on older `bsc-dataseed` nodes still in production rotation. -Therefore the **listener** — not the gate — owns the fallback. Before forwarding a non-removed event to the gate (or to the reactor on the historical bypass), the listener calls `ensureBlockTimestamp`, which uses `eventLog.BlockTimestamp` when present and falls back to one `HeaderByHash(blockHash)` RPC otherwise. A single-entry cache keyed on `lastBlockHash` elides repeat fetches for consecutive events from the same block, which — because the listener delivers events in block order — is the only relevant case. `Removed: true` logs skip `ensureBlockTimestamp` entirely; the gate's cancel path never reads `BlockTimestamp`. On `HeaderByHash` failure the listener logs a WARN and forwards the event through the gate anyway, where the zero-defense fallback above degrades the entry to a wall-clock delay rather than dropping it silently. +Therefore the **listener** — not the gate — owns the fallback. Before forwarding a non-removed event to the gate (or to the reactor on the historical bypass), the listener calls `ensureBlockTimestamp`, which uses `eventLog.BlockTimestamp` when present and falls back to one `HeaderByHash(blockHash)` RPC otherwise. A single-entry cache keyed on `lastBlockHash` elides repeat fetches for consecutive events from the same block, which — because the listener delivers events in block order — is the only relevant case. `Removed: true` logs skip `ensureBlockTimestamp` entirely; the gate's cancel path never reads `BlockTimestamp`. +On `HeaderByHash` failure the listener logs a WARN and forwards the event through the gate anyway, where the zero-defense fallback above degrades the entry to a wall-clock delay rather than dropping it silently. diff --git a/pkg/blockchain/evm/confirmation_gate.go b/pkg/blockchain/evm/confirmation_gate.go index 6b7f3a268..166ef727e 100644 --- a/pkg/blockchain/evm/confirmation_gate.go +++ b/pkg/blockchain/evm/confirmation_gate.go @@ -73,7 +73,7 @@ type ConfirmationGate struct { // NewConfirmationGate creates a ConfirmationGate that holds events for delay before // forwarding them to handler. delay must be > 0; delay <= 0 returns an error // (the wiring layer is responsible for skipping gate construction when the operator -// configured delay == 0; see nitronode/main.go). +// configured delay == 0). func NewConfirmationGate( delay time.Duration, chainID uint64, diff --git a/pkg/blockchain/evm/listener.go b/pkg/blockchain/evm/listener.go index 621e880cc..613e5b13b 100644 --- a/pkg/blockchain/evm/listener.go +++ b/pkg/blockchain/evm/listener.go @@ -459,12 +459,6 @@ func (l *Listener) reconcileBlockRange( // from the same block — the only relevant case because the listener delivers events // in block order. // -// Single-threaded use only: relies on the Listener's serial processEvents loop -// (Phase 1 historical fully drains before Phase 2 live; each phase processes one -// event at a time). No mutex on the cache fields. A future refactor that -// parallelizes event handling must add synchronization or switch to a thread-safe -// cache. -// // On HeaderByHash failure, returns the original eventLog and the error. Callers // decide whether to fall back to the gate (which is the conservative behavior; // see live-path and routeHistoricalEvent below). From 22159a9b35e087a586366039ea6d608671be2190 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Thu, 11 Jun 2026 16:12:55 +0200 Subject: [PATCH 15/23] docs(nitronode): move reorg-fix-spec.md to nitronode/docs/reorg-fix.md --- nitronode/{reorg-fix-spec.md => docs/reorg-fix.md} | 0 nitronode/store/database/interface.go | 2 +- pkg/blockchain/evm/channel_hub_reactor.go | 4 ++-- pkg/blockchain/evm/listener_test.go | 2 -- 4 files changed, 3 insertions(+), 5 deletions(-) rename nitronode/{reorg-fix-spec.md => docs/reorg-fix.md} (100%) diff --git a/nitronode/reorg-fix-spec.md b/nitronode/docs/reorg-fix.md similarity index 100% rename from nitronode/reorg-fix-spec.md rename to nitronode/docs/reorg-fix.md diff --git a/nitronode/store/database/interface.go b/nitronode/store/database/interface.go index 24fcbfef3..703dd112c 100644 --- a/nitronode/store/database/interface.go +++ b/nitronode/store/database/interface.go @@ -300,7 +300,7 @@ type DatabaseStore interface { // IsContractEventProcessed reports whether an event identified by (txHash, logIndex, blockchainID) // has already been committed, regardless of which block it appeared in. // NOTE: uses block-level logIndex — does not detect reorged events where the same tx - // re-mines with a different block-level log position (see reorg-fix-spec.md §6.6). + // re-mines with a different block-level log position (see nitronode/docs/reorg-fix.md §6.6). IsContractEventProcessed(txHash string, logIndex uint32, blockchainID uint64) (bool, error) // GetLatestContractEventBlockHashAndNumber returns the block_number and block_hash of diff --git a/pkg/blockchain/evm/channel_hub_reactor.go b/pkg/blockchain/evm/channel_hub_reactor.go index da99a6d8e..40f23ce69 100644 --- a/pkg/blockchain/evm/channel_hub_reactor.go +++ b/pkg/blockchain/evm/channel_hub_reactor.go @@ -116,7 +116,7 @@ type ChannelHubReactorStore interface { // IsContractEventProcessed reports whether an event identified by (txHash, logIndex, blockchainID) // has already been committed, regardless of which block it appeared in. // NOTE: uses block-level logIndex — does not detect reorged events where the same tx - // re-mines with a different block-level log position (see reorg-fix-spec.md §6.6). + // re-mines with a different block-level log position (see nitronode/docs/reorg-fix.md §6.6). IsContractEventProcessed(txHash string, logIndex uint32, blockchainID uint64) (bool, error) } @@ -190,7 +190,7 @@ func (r *ChannelHubReactor) HandleEvent(ctx context.Context, l types.Log) error // This converts the constraint-violation rollback path into a clean early exit and // is required for the reconciliation walk (§4.4) to replay events without errors. // Reorged events with a changed block-level logIndex pass through this check; - // they are handled by the reactor's business-logic idempotency (see reorg-fix-spec.md §6.6). + // they are handled by the reactor's business-logic idempotency (see nitronode/docs/reorg-fix.md §6.6). processed, err := r.store.IsContractEventProcessed(l.TxHash.String(), uint32(l.Index), r.blockchainID) if err != nil { logger.Warn("failed to check if contract event was already processed, proceeding", diff --git a/pkg/blockchain/evm/listener_test.go b/pkg/blockchain/evm/listener_test.go index b80d78c39..fb2da540a 100644 --- a/pkg/blockchain/evm/listener_test.go +++ b/pkg/blockchain/evm/listener_test.go @@ -321,8 +321,6 @@ func TestProcessEvents_SubscriptionErrorDuringPhase1(t *testing.T) { // - Historical events younger than confirmationDelay → handleEvent (through gate; still in reorg window) // - Live (Phase 2) events → handleEvent (always) // - HeaderByHash fetch failures → handleEvent (conservative fallback) -// -// See reorg-fix-spec.md §4.4 step 5. func TestListener_PhaseHandlerRouting(t *testing.T) { t.Parallel() logger := log.NewNoopLogger() From 2ead6119447719f6a4366fc180edc9953b2bc462 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Thu, 11 Jun 2026 16:18:09 +0200 Subject: [PATCH 16/23] docs(nitronode/reconciler): note empty-store vs full-depth-reorg conflation --- nitronode/docs/reorg-fix.md | 2 ++ pkg/blockchain/evm/reconciler.go | 6 ++++++ 2 files changed, 8 insertions(+) diff --git a/nitronode/docs/reorg-fix.md b/nitronode/docs/reorg-fix.md index 788e1d860..7dfe79f8c 100644 --- a/nitronode/docs/reorg-fix.md +++ b/nitronode/docs/reorg-fix.md @@ -126,6 +126,8 @@ On startup, for each chain, after the `block_hash` migration has been applied: > **Why walk stored hashes, not block numbers?** In normal operation most blocks contain no `ChannelHub` events, so `contract_events` has no row for them. A block-number walk would find nothing to compare at event-gap heights and could miss a reorg that occurred entirely within such a gap. Walking by stored block hashes ensures every comparison is against a block the reactor actually processed. + If the walk reaches genesis without finding a canonical stored block, this implies either an empty store or a full-depth reorg; the latter is treated as a chain-level incident outside the gate's scope, and the listener proceeds as if the store were empty. + 4. Set the scan start to `commonAncestorBlockNum`. Events between `commonAncestorBlockNum` and `latestBlockNum` that came from the reorged fork are still present in the DB. The reactor has no rollback mechanism for those rows — the re-scan below will re-apply canonical events over them where the transaction was re-mined (idempotent), and leave the orphaned DB state in place where the transaction was not re-mined (residual risk; see §2.1). State-setting operations (`UpdateChannel`, `RefreshUserEnforcedBalance`) will overwrite with canonical values for re-mined events; rows from dropped transactions remain as stale data with no automated cleanup. 5. Start the event scan from `commonAncestorBlockNum` (or genesis if step 1 found no rows). Replayed events are routed **per-event by block age**: - Events whose block timestamp is **older than `confirmation_delay_sec`** are routed directly to the reactor, bypassing the gate. Their block is past the reorg window — `eth_getLogs` returned them as canonical, and any reorg that could displace them would exceed the configured finality bound. There is no incremental reorg risk to guard against, and routing them through the gate would only add latency. diff --git a/pkg/blockchain/evm/reconciler.go b/pkg/blockchain/evm/reconciler.go index ac88fe672..08616799d 100644 --- a/pkg/blockchain/evm/reconciler.go +++ b/pkg/blockchain/evm/reconciler.go @@ -68,6 +68,12 @@ func findCommonAncestor( if prevHash == "" { // No older stored block (prevNum=0) or pre-migration row (prevNum>0). // Use prevNum as the safe canonical resume point. + // + // This branch conflates two distinct states: an empty store and a + // full-depth reorg where every stored block was reorged out. The latter + // requires a chain-level consensus failure that is outside the + // confirmation gate's scope, which is an incredibly unlikely scenario; + // both cases are treated identically here by proceeding as if the store were empty. logger.Info("reconciliation: reached pre-migration or genesis boundary", "blockchainID", blockchainID, "blockNumber", prevNum, From c398e28f3e90fe5bf9b1e5771d0fd314e071673c Mon Sep 17 00:00:00 2001 From: nksazonov Date: Thu, 11 Jun 2026 16:24:43 +0200 Subject: [PATCH 17/23] feat(nitronode/gate): propagate handler errors via fatal channel --- nitronode/docs/reorg-fix.md | 6 ++ nitronode/main.go | 5 +- pkg/blockchain/evm/confirmation_gate.go | 31 +++++++++- pkg/blockchain/evm/confirmation_gate_test.go | 62 ++++++++++++++++++++ pkg/blockchain/evm/listener.go | 22 ++++++- pkg/blockchain/evm/listener_test.go | 28 ++++----- 6 files changed, 135 insertions(+), 19 deletions(-) diff --git a/nitronode/docs/reorg-fix.md b/nitronode/docs/reorg-fix.md index 7dfe79f8c..600c14d44 100644 --- a/nitronode/docs/reorg-fix.md +++ b/nitronode/docs/reorg-fix.md @@ -402,3 +402,9 @@ The gate reads `eventLog.BlockTimestamp` directly from the `types.Log` it receiv Therefore the **listener** — not the gate — owns the fallback. Before forwarding a non-removed event to the gate (or to the reactor on the historical bypass), the listener calls `ensureBlockTimestamp`, which uses `eventLog.BlockTimestamp` when present and falls back to one `HeaderByHash(blockHash)` RPC otherwise. A single-entry cache keyed on `lastBlockHash` elides repeat fetches for consecutive events from the same block, which — because the listener delivers events in block order — is the only relevant case. `Removed: true` logs skip `ensureBlockTimestamp` entirely; the gate's cancel path never reads `BlockTimestamp`. On `HeaderByHash` failure the listener logs a WARN and forwards the event through the gate anyway, where the zero-defense fallback above degrades the entry to a wall-clock delay rather than dropping it silently. + +--- + +### 6.8 Handler error semantics + +When a downstream handler invoked after the confirmation delay returns an error, the gate sends the error on a buffered (size 1) fatal channel exposed via `FatalErrors()` and the goroutine exits without draining further pending entries. The listener receives the error in its Phase 2 (and Phase 1) select, unsubscribes, and returns the error to `Listen`'s closure, which in `nitronode/main.go` calls `logger.Fatal` → process exit. The supervisor restarts the process; the next `Listen` invocation re-fetches the unstored event via the DB cursor in `findCommonAncestor` + Phase 1 reconciliation, restoring the pre-PR crash-restart-replay invariant. The gate does **not** retry handler errors in-process; this is intentional and matches pre-PR behavior. Events queued behind the failed event are dropped on teardown and re-fetched after restart. diff --git a/nitronode/main.go b/nitronode/main.go index 4c9956dff..fb933e306 100644 --- a/nitronode/main.go +++ b/nitronode/main.go @@ -124,6 +124,7 @@ func main() { confirmationDelay := time.Duration(b.ConfirmationDelaySecs) * time.Second var liveHandler evm.HandleEvent + var fatalCh <-chan error if confirmationDelay > 0 { gate, err := evm.NewConfirmationGate(confirmationDelay, b.ID, reactor.HandleEvent, logger) if err != nil { @@ -131,6 +132,7 @@ func main() { } gate.Start(blockchainCtx) liveHandler = gate.HandleEvent + fatalCh = gate.FatalErrors() } else { liveHandler = reactor.HandleEvent } @@ -140,7 +142,8 @@ func main() { // based on block age: events older than confirmationDelay go directly to the reactor // (past the reorg window); recent events still flow through the live handler because // their blocks may still be reorged. - l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, confirmationDelay, logger, liveHandler, reactor.HandleEvent, bb.DbStore) + // fatalCh is nil when confirmationDelay == 0; a nil channel never selects. + l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, confirmationDelay, logger, liveHandler, reactor.HandleEvent, bb.DbStore, fatalCh) l.Listen(blockchainCtx, func(err error) { if err != nil { logger.Fatal("blockchain listener stopped", "error", err, "blockchainID", b.ID) diff --git a/pkg/blockchain/evm/confirmation_gate.go b/pkg/blockchain/evm/confirmation_gate.go index 166ef727e..9226c7347 100644 --- a/pkg/blockchain/evm/confirmation_gate.go +++ b/pkg/blockchain/evm/confirmation_gate.go @@ -66,8 +66,10 @@ type ConfirmationGate struct { forwardedSet map[forwardedKey]time.Time // key -> forwardedAt forwardedQueue []forwardedExpiry // FIFO of (key, forwardedAt) for O(1) eviction - kick chan struct{} // buffered 1; non-blocking sends - timer *time.Timer // created in Start(ctx) + kick chan struct{} // buffered 1; non-blocking sends + timer *time.Timer // created in Start(ctx) + fatalCh chan error // buffered 1; first handler error wins; non-blocking send + done chan struct{} // closed once on fatal; gates run's select so the goroutine exits } // NewConfirmationGate creates a ConfirmationGate that holds events for delay before @@ -92,9 +94,20 @@ func NewConfirmationGate( forwardedSet: make(map[forwardedKey]time.Time), forwardedQueue: nil, kick: make(chan struct{}, 1), + fatalCh: make(chan error, 1), + done: make(chan struct{}), }, nil } +// FatalErrors returns a read-only channel that receives the first handler error +// encountered after the confirmation delay. The channel is buffered (size 1); +// only the first error is delivered. When the channel fires, the gate's drain +// goroutine has already stopped forwarding. The listener should unsubscribe and +// return the error to trigger process restart and DB-cursor replay. +func (g *ConfirmationGate) FatalErrors() <-chan error { + return g.fatalCh +} + // Start begins the background goroutine that forwards matured entries to the // downstream handler. The timer is created here (tied to the goroutine's lifecycle) // and stopped on shutdown. The goroutine exits when ctx is cancelled. @@ -186,6 +199,8 @@ func (g *ConfirmationGate) run(ctx context.Context) { select { case <-ctx.Done(): return + case <-g.done: + return case <-g.kick: case <-g.timer.C: } @@ -233,10 +248,20 @@ func (g *ConfirmationGate) drainAndReschedule() { evCtx := log.SetContextLogger(context.Background(), g.logger) if err := g.handler(evCtx, entry.log); err != nil { - g.logger.Error("handler error after confirmation delay", + g.logger.Error("handler error after confirmation delay, signalling fatal", "error", err, "chainID", g.chainID, ) + select { + case g.fatalCh <- err: + default: + } + // Close done to signal the run goroutine to exit immediately. + // This is safe: only this fatal branch closes done, and it runs at most + // once — once done is closed the run loop exits and drainAndReschedule + // is no longer called. + close(g.done) + return } g.mu.Lock() diff --git a/pkg/blockchain/evm/confirmation_gate_test.go b/pkg/blockchain/evm/confirmation_gate_test.go index cefffc830..bbaea8750 100644 --- a/pkg/blockchain/evm/confirmation_gate_test.go +++ b/pkg/blockchain/evm/confirmation_gate_test.go @@ -2,6 +2,7 @@ package evm import ( "context" + "errors" "sync" "sync/atomic" "testing" @@ -786,3 +787,64 @@ func TestConfirmationGate_ShutdownWithNonEmptyQueue(t *testing.T) { default: } } + +// TestConfirmationGate_HandlerErrorPropagatesFatal: when the downstream handler +// returns an error after the confirmation delay, the gate signals the fatal channel +// exactly once and the run goroutine exits. A second FatalErrors() receive must +// block (buffer size == 1; only the first error is delivered). After the goroutine +// exits, a second event that matures must NOT invoke the handler again. +func TestConfirmationGate_HandlerErrorPropagatesFatal(t *testing.T) { + t.Parallel() + + sentinelErr := errors.New("handler sentinel error") + var handlerCalls atomic.Int64 + handler := func(_ context.Context, _ types.Log) error { + handlerCalls.Add(1) + return sentinelErr + } + + delay := 50 * time.Millisecond + g := newGate(t, delay, handler) + g.Start(t.Context()) + + tx := common.HexToHash("0xF1") + bh := common.HexToHash("0xF1") + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) + + // The fatal channel must receive the sentinel error within delay + generous timeout. + select { + case err := <-g.FatalErrors(): + assert.Equal(t, sentinelErr, err, "fatal channel must carry the sentinel error") + case <-time.After(delay + 200*time.Millisecond): + t.Fatal("fatal channel did not receive an error within timeout") + } + + // A second receive must block immediately — only one error per gate-lifetime. + select { + case extra := <-g.FatalErrors(): + t.Fatalf("unexpected second value on fatal channel: %v", extra) + default: + // correct: channel is empty after the first drain + } + + // Give the run goroutine a moment to exit via <-g.done. + time.Sleep(50 * time.Millisecond) + + // Enqueue a second event after the failure. The goroutine has exited, so even + // if the kick is queued in the buffered channel it will never be drained. + tx2 := common.HexToHash("0xF2") + bh2 := common.HexToHash("0xF2") + require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx2, bh2, 0, false))) + + // Wait well past the delay; the handler must NOT be called a second time. + time.Sleep(delay + 100*time.Millisecond) + + assert.Equal(t, int64(1), handlerCalls.Load(), "handler must be invoked exactly once across gate lifetime") + + select { + case <-g.FatalErrors(): + t.Fatal("unexpected second fatal send after goroutine exited") + default: + // correct: no second fatal + } +} diff --git a/pkg/blockchain/evm/listener.go b/pkg/blockchain/evm/listener.go index 613e5b13b..f20121ab9 100644 --- a/pkg/blockchain/evm/listener.go +++ b/pkg/blockchain/evm/listener.go @@ -33,6 +33,7 @@ type Listener struct { handleEvent HandleEvent // live events and recent historical events; typically the ConfirmationGate handleHistoricalEvent HandleEvent // historical events older than confirmationDelay; typically the reactor directly eventGetter ContractEventGetter + handleEventFatalCh <-chan error // gate fatal-error channel; nil when no gate is in use (nil channel never selects) // Single-entry block-timestamp cache for ensureBlockTimestamp. The listener's // processEvents loop is strictly serial (Phase 1 drains before Phase 2, each @@ -57,7 +58,11 @@ type Listener struct { // // eventHandler is typically the ConfirmationGate; historicalEventHandler is typically // the reactor directly. The two handlers may be the same function when no gate is in use. -func NewListener(contractAddress common.Address, client EVMClient, blockchainID uint64, blockStep uint64, confirmationDelay time.Duration, logger log.Logger, eventHandler HandleEvent, historicalEventHandler HandleEvent, eventGetter ContractEventGetter) *Listener { +// +// eventHandlerFatalCh is the read-only fatal-error channel from the ConfirmationGate +// (gate.FatalErrors()). Pass nil when no gate is in use (confirmationDelay == 0); +// a nil channel never selects, so the fatal-error case is a no-op on the no-gate path. +func NewListener(contractAddress common.Address, client EVMClient, blockchainID uint64, blockStep uint64, confirmationDelay time.Duration, logger log.Logger, eventHandler HandleEvent, historicalEventHandler HandleEvent, eventGetter ContractEventGetter, eventHandlerFatalCh <-chan error) *Listener { return &Listener{ contractAddress: contractAddress, client: client, @@ -68,6 +73,7 @@ func NewListener(contractAddress common.Address, client EVMClient, blockchainID handleEvent: eventHandler, handleHistoricalEvent: historicalEventHandler, eventGetter: eventGetter, + handleEventFatalCh: eventHandlerFatalCh, } } @@ -308,6 +314,13 @@ func (l *Listener) processEvents( eventSubscription.Unsubscribe() return err } + case err := <-l.handleEventFatalCh: + l.logger.Error("downstream gate signalled fatal error, unsubscribing", + "error", err, + "blockchainID", l.blockchainID, + "contractAddress", l.contractAddress.String()) + eventSubscription.Unsubscribe() + return err case err := <-eventSubscription.Err(): if err != nil { l.logger.Error("event subscription error", "error", err, "blockchainID", l.blockchainID, "contractAddress", l.contractAddress.String()) @@ -362,6 +375,13 @@ func (l *Listener) processEvents( eventSubscription.Unsubscribe() return err } + case err := <-l.handleEventFatalCh: + l.logger.Error("downstream gate signalled fatal error, unsubscribing", + "error", err, + "blockchainID", l.blockchainID, + "contractAddress", l.contractAddress.String()) + eventSubscription.Unsubscribe() + return err case err := <-eventSubscription.Err(): if err != nil { l.logger.Error("event subscription error", "error", err, "blockchainID", l.blockchainID, "contractAddress", l.contractAddress.String()) diff --git a/pkg/blockchain/evm/listener_test.go b/pkg/blockchain/evm/listener_test.go index fb2da540a..a5a8feaf5 100644 --- a/pkg/blockchain/evm/listener_test.go +++ b/pkg/blockchain/evm/listener_test.go @@ -45,7 +45,7 @@ func TestNewListener(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - l := NewListener(addr, mockClient, 1, 100, 0, logger, nil, nil, eventGetter) + l := NewListener(addr, mockClient, 1, 100, 0, logger, nil, nil, eventGetter, nil) require.NotNil(t, l) assert.Equal(t, addr, l.contractAddress) assert.Equal(t, uint64(1), l.blockchainID) @@ -73,7 +73,7 @@ func TestListener_Listen_CurrentEvents(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 100, 0, logger, handleEvent, handleEvent, eventGetter) + listener := NewListener(addr, mockClient, 1, 100, 0, logger, handleEvent, handleEvent, eventGetter, nil) // Mock SubscribeFilterLogs sub := &MockSubscription{ @@ -110,7 +110,7 @@ func TestListener_ReconcileBlockRange(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) // Setup FilterLogs mock // We expect a range fetch. start=100, step=10 -> end=110. current=120. @@ -187,7 +187,7 @@ func TestListener_Listen_HistoricalAndCurrent(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, handleEvent, handleEvent, eventGetter, nil) // findCommonAncestor: HeaderByNumber(100) returns the same header we hashed above, // so the stored hash matches and block 100 is confirmed canonical. @@ -234,7 +234,7 @@ func TestProcessEvents_DedupSkipsPresent(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) + listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter, nil) // Historical: 3 events. First 2 are present (skipped), 3rd is not (handled). // After the 3rd, the check should stop — no IsContractEventPresent call for events 4+. @@ -287,7 +287,7 @@ func TestProcessEvents_SubscriptionErrorDuringPhase1(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) + listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter, nil) // Historical channel with events that will block (not closed yet). BlockTimestamp // is set so ensureBlockTimestamp short-circuits. @@ -348,7 +348,7 @@ func TestListener_PhaseHandlerRouting(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 10, confirmationDelay, logger, liveHandler, historicalHandler, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, confirmationDelay, logger, liveHandler, historicalHandler, eventGetter, nil) // Old historical event (block timestamp 10 minutes ago) — should bypass the gate. oldHash := common.HexToHash("0xa1") @@ -437,7 +437,7 @@ func TestListener_PhaseHandlerRouting_DelayZero(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 10, 0, logger, liveHandler, historicalHandler, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, liveHandler, historicalHandler, eventGetter, nil) // BlockTimestamp populated by the upstream RPC — ensureBlockTimestamp short-circuits // and routeHistoricalEvent routes directly to historicalHandler because delay == 0. @@ -484,7 +484,7 @@ func TestListener_RemovedLog_ForwardedToHandler(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) + listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter, nil) // No historical events. historicalCh := make(chan types.Log) @@ -542,7 +542,7 @@ func TestReconcileBlockRange_ContextCancellation(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) ctx, cancel := context.WithCancel(context.Background()) @@ -582,7 +582,7 @@ func TestEnsureBlockTimestamp_Populated(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) originalTs := uint64(1700000000) eventLog := types.Log{ @@ -607,7 +607,7 @@ func TestEnsureBlockTimestamp_Fetch(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) bh := common.HexToHash("0xabc") headerTime := uint64(1700000000) @@ -632,7 +632,7 @@ func TestEnsureBlockTimestamp_CacheHit(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) bh := common.HexToHash("0xabc") headerTime := uint64(1700000000) @@ -666,7 +666,7 @@ func TestEnsureBlockTimestamp_FetchError(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) bh := common.HexToHash("0xabc") mockClient.On("HeaderByHash", mock.Anything, bh).Return(nil, fmt.Errorf("rpc failure")).Once() From 4e7b954c74def326876e1a6d251e79c9a7b1f75a Mon Sep 17 00:00:00 2001 From: nksazonov Date: Thu, 11 Jun 2026 16:26:34 +0200 Subject: [PATCH 18/23] refactor(nitronode/store): collapse IsContractEventPresent into IsContractEventProcessed --- nitronode/store/database/contract_event.go | 15 -------- .../store/database/contract_event_test.go | 18 +++------ nitronode/store/database/interface.go | 3 -- pkg/blockchain/evm/interface.go | 5 ++- pkg/blockchain/evm/listener.go | 18 ++++++--- pkg/blockchain/evm/listener_test.go | 38 ++++++++++--------- pkg/blockchain/evm/mock_test.go | 4 +- 7 files changed, 44 insertions(+), 57 deletions(-) diff --git a/nitronode/store/database/contract_event.go b/nitronode/store/database/contract_event.go index d0c634284..5c6b8fa64 100644 --- a/nitronode/store/database/contract_event.go +++ b/nitronode/store/database/contract_event.go @@ -106,18 +106,3 @@ func (s *DBStore) GetPreviousDistinctBlockHash(contractAddress string, blockchai } return ev.BlockNumber, ev.BlockHash, nil } - -// IsContractEventPresent checks whether a specific contract event has already been stored. -func (s *DBStore) IsContractEventPresent(blockchainID, blockNumber uint64, txHash string, logIndex uint32) (bool, error) { - var ev ContractEvent - err := s.db.Where("blockchain_id = ? AND block_number = ? AND transaction_hash = ? AND log_index = ?", - blockchainID, blockNumber, strings.ToLower(txHash), logIndex). - Take(&ev).Error - if errors.Is(err, gorm.ErrRecordNotFound) { - return false, nil - } - if err != nil { - return false, err - } - return true, nil -} diff --git a/nitronode/store/database/contract_event_test.go b/nitronode/store/database/contract_event_test.go index 11a28ab02..1e14e5127 100644 --- a/nitronode/store/database/contract_event_test.go +++ b/nitronode/store/database/contract_event_test.go @@ -82,7 +82,7 @@ func TestGetLatestContractEventBlockNumber(t *testing.T) { }) } -func TestIsContractEventPresent(t *testing.T) { +func TestIsContractEventProcessed(t *testing.T) { db, cleanup := SetupTestDB(t) defer cleanup() @@ -100,38 +100,32 @@ func TestIsContractEventPresent(t *testing.T) { require.NoError(t, store.StoreContractEvent(ev)) t.Run("existing event returns true", func(t *testing.T) { - present, err := store.IsContractEventPresent(1, 500, ev.TransactionHash, 3) + present, err := store.IsContractEventProcessed(ev.TransactionHash, 3, 1) require.NoError(t, err) assert.True(t, present) }) t.Run("case-insensitive txHash match", func(t *testing.T) { // Query with uppercase — stored value was lowercased by StoreContractEvent - present, err := store.IsContractEventPresent(1, 500, "0xABCDEF1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF1234567890", 3) + present, err := store.IsContractEventProcessed("0xABCDEF1234567890ABCDEF1234567890ABCDEF1234567890ABCDEF1234567890", 3, 1) require.NoError(t, err) assert.True(t, present) }) - t.Run("wrong block number returns false", func(t *testing.T) { - present, err := store.IsContractEventPresent(1, 501, ev.TransactionHash, 3) - require.NoError(t, err) - assert.False(t, present) - }) - t.Run("wrong log index returns false", func(t *testing.T) { - present, err := store.IsContractEventPresent(1, 500, ev.TransactionHash, 4) + present, err := store.IsContractEventProcessed(ev.TransactionHash, 4, 1) require.NoError(t, err) assert.False(t, present) }) t.Run("wrong blockchain returns false", func(t *testing.T) { - present, err := store.IsContractEventPresent(2, 500, ev.TransactionHash, 3) + present, err := store.IsContractEventProcessed(ev.TransactionHash, 3, 2) require.NoError(t, err) assert.False(t, present) }) t.Run("wrong txHash returns false", func(t *testing.T) { - present, err := store.IsContractEventPresent(1, 500, "0x0000000000000000000000000000000000000000000000000000000000000000", 3) + present, err := store.IsContractEventProcessed("0x0000000000000000000000000000000000000000000000000000000000000000", 3, 1) require.NoError(t, err) assert.False(t, present) }) diff --git a/nitronode/store/database/interface.go b/nitronode/store/database/interface.go index 703dd112c..b832e1c05 100644 --- a/nitronode/store/database/interface.go +++ b/nitronode/store/database/interface.go @@ -294,9 +294,6 @@ type DatabaseStore interface { // GetLatestContractEventBlockNumber returns the highest block number for a given contract. GetLatestContractEventBlockNumber(contractAddress string, blockchainID uint64) (lastBlock uint64, err error) - // IsContractEventPresent checks if a specific contract event has already been stored. - IsContractEventPresent(blockchainID, blockNumber uint64, txHash string, logIndex uint32) (isPresent bool, err error) - // IsContractEventProcessed reports whether an event identified by (txHash, logIndex, blockchainID) // has already been committed, regardless of which block it appeared in. // NOTE: uses block-level logIndex — does not detect reorged events where the same tx diff --git a/pkg/blockchain/evm/interface.go b/pkg/blockchain/evm/interface.go index ceadec511..936107393 100644 --- a/pkg/blockchain/evm/interface.go +++ b/pkg/blockchain/evm/interface.go @@ -16,8 +16,9 @@ type HandleEvent func(ctx context.Context, eventLog types.Log) error type ContractEventGetter interface { // GetLatestContractEventBlockNumber returns the block to resume from (0 = start fresh). GetLatestContractEventBlockNumber(contractAddress string, blockchainID uint64) (lastBlock uint64, err error) - // IsContractEventPresent checks whether a specific event was already processed. - IsContractEventPresent(blockchainID, blockNumber uint64, txHash string, logIndex uint32) (isPresent bool, err error) + // IsContractEventProcessed reports whether an event identified by (txHash, logIndex, blockchainID) + // has already been committed, regardless of which block it appeared in. + IsContractEventProcessed(txHash string, logIndex uint32, blockchainID uint64) (bool, error) // GetLatestContractEventBlockHashAndNumber returns the block_number and block_hash of // the highest stored event. Returns (0, "", nil) when no rows exist. GetLatestContractEventBlockHashAndNumber(contractAddress string, blockchainID uint64) (blockNumber uint64, blockHash string, err error) diff --git a/pkg/blockchain/evm/listener.go b/pkg/blockchain/evm/listener.go index f20121ab9..8fb77bc21 100644 --- a/pkg/blockchain/evm/listener.go +++ b/pkg/blockchain/evm/listener.go @@ -215,10 +215,15 @@ func (l *Listener) listenEvents(ctx context.Context) error { // processEvents runs two sequential phases: historical (historicalCh until closed), // then live (currentCh until ctx or subscription death). In each phase the first -// events are checked via IsContractEventPresent; once a non-present event is found +// events are checked via IsContractEventProcessed; once a non-present event is found // the check is skipped for the rest of that phase (events are strictly ordered). // Returns nil on subscription loss (reconnect), non-nil on handler/check failure. // +// Both the listener (here) and the reactor (channel_hub_reactor.go) call +// IsContractEventProcessed, so both share a dependency on DB availability. A +// transient Postgres hiccup at either call site surfaces the error, unsubscribes, +// and restarts the process — consistent behavior across the pipeline. +// // Listener ordering & idempotency invariant // ----------------------------------------- // Downstream handlers (and any code reasoning about the relative arrival order @@ -233,10 +238,13 @@ func (l *Listener) listenEvents(ctx context.Context) error { // reconcileBlockRange + live subscription preserve chain order within each // phase. // -// 2. Idempotent resume. On restart, IsContractEventPresent gates the first +// 2. Idempotent resume. On restart, IsContractEventProcessed gates the first // event of each phase: events already persisted in a prior run are skipped // rather than reprocessed. Once a non-present event is seen the check is // dropped for the remainder of the phase (safe because of guarantee 1). +// The dedup check identifies events by (txHash, logIndex, blockchainID); +// reorged events with a re-shuffled block-level log index are not detected +// here and rely on reactor business-logic idempotency. // // 3. Cursor advances only on handler success. lastBlock is updated on each // live event, but a non-nil return from handleEvent unsubscribes and @@ -249,7 +257,7 @@ func (l *Listener) listenEvents(ctx context.Context) error { // gate can cancel any pending confirmation timer for that event. The // reactor never sees Removed=true logs directly; the gate filters them // before forwarding confirmed events. The lastBlock cursor and -// IsContractEventPresent dedup check are skipped for Removed=true events +// IsContractEventProcessed dedup check are skipped for Removed=true events // so neither the resume cursor nor the idempotency guard is corrupted // by a reorg signal. // @@ -282,7 +290,7 @@ func (l *Listener) processEvents( break } if !historicalCheckDone { - present, err := l.eventGetter.IsContractEventPresent(l.blockchainID, eventLog.BlockNumber, eventLog.TxHash.Hex(), uint32(eventLog.Index)) + present, err := l.eventGetter.IsContractEventProcessed(eventLog.TxHash.Hex(), uint32(eventLog.Index), l.blockchainID) if err != nil { eventSubscription.Unsubscribe() return fmt.Errorf("failed to check historical event presence: %w", err) @@ -344,7 +352,7 @@ func (l *Listener) processEvents( if !eventLog.Removed { *lastBlock = eventLog.BlockNumber if !currentCheckDone { - present, err := l.eventGetter.IsContractEventPresent(l.blockchainID, eventLog.BlockNumber, eventLog.TxHash.Hex(), uint32(eventLog.Index)) + present, err := l.eventGetter.IsContractEventProcessed(eventLog.TxHash.Hex(), uint32(eventLog.Index), l.blockchainID) if err != nil { eventSubscription.Unsubscribe() return fmt.Errorf("failed to check current event presence: %w", err) diff --git a/pkg/blockchain/evm/listener_test.go b/pkg/blockchain/evm/listener_test.go index a5a8feaf5..3f3a1a3a8 100644 --- a/pkg/blockchain/evm/listener_test.go +++ b/pkg/blockchain/evm/listener_test.go @@ -90,8 +90,8 @@ func TestListener_Listen_CurrentEvents(t *testing.T) { }). Return(sub, nil) - // The first current event will trigger IsContractEventPresent check - eventGetter.On("IsContractEventPresent", uint64(1), uint64(10), mock.Anything, uint32(1)).Return(false, nil) + // The first current event will trigger IsContractEventProcessed check + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(1), uint64(1)).Return(false, nil) go listener.Listen(ctx, func(err error) {}) @@ -163,9 +163,9 @@ func TestListener_Listen_HistoricalAndCurrent(t *testing.T) { eventGetter := new(MockContractEventGetter) eventGetter.On("GetLatestContractEventBlockHashAndNumber", addr.String(), uint64(1)).Return(uint64(100), blockHash100.Hex(), nil) // Historical event at block 105 is not present - eventGetter.On("IsContractEventPresent", uint64(1), uint64(105), mock.Anything, uint32(0)).Return(false, nil) + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(false, nil) // Current event at block 111 — after historical is done, first current event triggers check - eventGetter.On("IsContractEventPresent", uint64(1), uint64(111), mock.Anything, uint32(0)).Return(false, nil) + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(false, nil) ctx, cancel := context.WithCancel(context.Background()) t.Cleanup(cancel) @@ -237,7 +237,7 @@ func TestProcessEvents_DedupSkipsPresent(t *testing.T) { listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter, nil) // Historical: 3 events. First 2 are present (skipped), 3rd is not (handled). - // After the 3rd, the check should stop — no IsContractEventPresent call for events 4+. + // After the 3rd, the check should stop — no IsContractEventProcessed call for events 4+. // BlockTimestamp is set so ensureBlockTimestamp short-circuits. ts := uint64(time.Now().Unix()) historicalCh := make(chan types.Log, 5) @@ -249,9 +249,9 @@ func TestProcessEvents_DedupSkipsPresent(t *testing.T) { close(historicalCh) // First two are present, third is not - eventGetter.On("IsContractEventPresent", uint64(1), uint64(100), mock.Anything, uint32(0)).Return(true, nil).Once() - eventGetter.On("IsContractEventPresent", uint64(1), uint64(101), mock.Anything, uint32(0)).Return(true, nil).Once() - eventGetter.On("IsContractEventPresent", uint64(1), uint64(102), mock.Anything, uint32(0)).Return(false, nil).Once() + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(true, nil).Once() + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(true, nil).Once() + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(false, nil).Once() // No mock for 103, 104 — if called, mock will panic, proving the check stopped sub := &MockSubscription{errChan: make(chan error)} @@ -294,7 +294,7 @@ func TestProcessEvents_SubscriptionErrorDuringPhase1(t *testing.T) { historicalCh := make(chan types.Log, 2) historicalCh <- types.Log{BlockNumber: 100, Index: 0, TxHash: common.HexToHash("0xaa"), BlockTimestamp: uint64(time.Now().Unix())} - eventGetter.On("IsContractEventPresent", uint64(1), uint64(100), mock.Anything, uint32(0)).Return(false, nil) + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(false, nil) // Subscription that will error shortly subErrCh := make(chan error, 1) @@ -321,6 +321,8 @@ func TestProcessEvents_SubscriptionErrorDuringPhase1(t *testing.T) { // - Historical events younger than confirmationDelay → handleEvent (through gate; still in reorg window) // - Live (Phase 2) events → handleEvent (always) // - HeaderByHash fetch failures → handleEvent (conservative fallback) +// +// See nitronode/docs/reorg-fix.md §4.4 step 5. func TestListener_PhaseHandlerRouting(t *testing.T) { t.Parallel() logger := log.NewNoopLogger() @@ -381,10 +383,10 @@ func TestListener_PhaseHandlerRouting(t *testing.T) { currentCh := make(chan types.Log, 1) currentCh <- currentLog - // Only the first historical event triggers IsContractEventPresent (then the check is dropped for the phase); + // Only the first historical event triggers IsContractEventProcessed (then the check is dropped for the phase); // the first live event triggers it again for Phase 2. - eventGetter.On("IsContractEventPresent", uint64(1), uint64(100), mock.Anything, uint32(0)).Return(false, nil).Once() - eventGetter.On("IsContractEventPresent", uint64(1), uint64(200), mock.Anything, uint32(0)).Return(false, nil).Once() + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(false, nil).Once() + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(false, nil).Once() sub := &MockSubscription{errChan: make(chan error, 1), unsub: func() {}} @@ -447,7 +449,7 @@ func TestListener_PhaseHandlerRouting_DelayZero(t *testing.T) { close(historicalCh) currentCh := make(chan types.Log) - eventGetter.On("IsContractEventPresent", uint64(1), uint64(100), mock.Anything, uint32(0)).Return(false, nil).Once() + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(false, nil).Once() sub := &MockSubscription{errChan: make(chan error, 1), unsub: func() {}} @@ -492,14 +494,14 @@ func TestListener_RemovedLog_ForwardedToHandler(t *testing.T) { currentCh := make(chan types.Log, 2) - // Event 1: non-Removed at block 10 — triggers IsContractEventPresent check, + // Event 1: non-Removed at block 10 — triggers IsContractEventProcessed check, // advances lastBlock, sets currentCheckDone = true. BlockTimestamp is set so // ensureBlockTimestamp short-circuits. normalLog := types.Log{BlockNumber: 10, Index: 0, TxHash: common.HexToHash("0xabc"), BlockTimestamp: uint64(time.Now().Unix())} - eventGetter.On("IsContractEventPresent", uint64(1), uint64(10), mock.Anything, uint32(0)).Return(false, nil).Once() + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(false, nil).Once() // Event 2: Removed=true at block 11 — must NOT advance lastBlock, must NOT call - // IsContractEventPresent, but MUST reach handleEvent. + // IsContractEventProcessed, but MUST reach handleEvent. removedLog := types.Log{BlockNumber: 11, Index: 0, TxHash: common.HexToHash("0xdef"), Removed: true} currentCh <- normalLog @@ -530,8 +532,8 @@ func TestListener_RemovedLog_ForwardedToHandler(t *testing.T) { // lastBlock must NOT have advanced past the normal event's block. assert.Equal(t, uint64(10), lastBlock, "lastBlock must not be advanced by a Removed=true event") - // IsContractEventPresent must have been called exactly once (for the normal log only). - eventGetter.AssertNumberOfCalls(t, "IsContractEventPresent", 1) + // IsContractEventProcessed must have been called exactly once (for the normal log only). + eventGetter.AssertNumberOfCalls(t, "IsContractEventProcessed", 1) eventGetter.AssertExpectations(t) } diff --git a/pkg/blockchain/evm/mock_test.go b/pkg/blockchain/evm/mock_test.go index e562a0b55..b81388a8f 100644 --- a/pkg/blockchain/evm/mock_test.go +++ b/pkg/blockchain/evm/mock_test.go @@ -138,8 +138,8 @@ func (m *MockContractEventGetter) GetLatestContractEventBlockNumber(contractAddr return args.Get(0).(uint64), args.Error(1) } -func (m *MockContractEventGetter) IsContractEventPresent(blockchainID, blockNumber uint64, txHash string, logIndex uint32) (bool, error) { - args := m.Called(blockchainID, blockNumber, txHash, logIndex) +func (m *MockContractEventGetter) IsContractEventProcessed(txHash string, logIndex uint32, blockchainID uint64) (bool, error) { + args := m.Called(txHash, logIndex, blockchainID) return args.Bool(0), args.Error(1) } From 3512884bad9d07946c89d8dc2768feb4cd36e401 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Thu, 11 Jun 2026 16:35:28 +0200 Subject: [PATCH 19/23] fix(nitronode/reactor): return pre-check error and unify onEventProcessed via defer --- nitronode/api/rpc_router.go | 6 +-- nitronode/docs/reorg-fix.md | 4 +- pkg/blockchain/evm/channel_hub_reactor.go | 20 ++++----- .../evm/channel_hub_reactor_test.go | 41 ++++++++++++++----- 4 files changed, 46 insertions(+), 25 deletions(-) diff --git a/nitronode/api/rpc_router.go b/nitronode/api/rpc_router.go index be748051e..69822d9c5 100644 --- a/nitronode/api/rpc_router.go +++ b/nitronode/api/rpc_router.go @@ -23,9 +23,9 @@ type RPCRouter struct { } type RPCRouterConfig struct { - NodeVersion string - MinChallenge uint32 - MaxChallenge uint32 + NodeVersion string + MinChallenge uint32 + MaxChallenge uint32 MaxParticipants int MaxSessionDataLen int MaxSessionKeyIDs int diff --git a/nitronode/docs/reorg-fix.md b/nitronode/docs/reorg-fix.md index 600c14d44..b9937a3ef 100644 --- a/nitronode/docs/reorg-fix.md +++ b/nitronode/docs/reorg-fix.md @@ -135,7 +135,7 @@ On startup, for each chain, after the `block_hash` migration has been applied: The `Listener` accepts two handlers (`eventHandler` for live events and recent historical events, `historicalEventHandler` for mature historical events) and makes the per-event routing decision from `eventLog.BlockTimestamp`. To guarantee that field is populated regardless of the RPC provider's behavior, the listener calls `ensureBlockTimestamp` once per event, which uses `eventLog.BlockTimestamp` when present and falls back to `HeaderByHash` otherwise (at most one fetch per block regardless of event count). When `confirmation_delay_sec` is `0` the gate is disabled and every historical event is routed to `historicalEventHandler`. On an `ensureBlockTimestamp` failure the Listener falls back to `eventHandler` (the gate) — the conservative choice that preserves the reorg-protection invariant at the cost of a small delay. -6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)`. If a duplicate is inserted, Postgres returns a constraint-violation error, causing the entire transaction (including all state mutations in the same `useStoreInTx` call) to roll back. The reactor therefore cannot double-apply state changes for an event it has already committed. +6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. Before opening a transaction, `HandleEvent` calls `IsContractEventProcessed`; if the event is already committed, it returns `nil` immediately with no DB transaction opened. If `IsContractEventProcessed` returns an error, `HandleEvent` returns the wrapped error; the listener unsubscribes and the process restarts (per the fatal-channel in §6.8), re-fetching the same range via the DB cursor so the pre-check retries. For events that pass the pre-check, `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)` as a final backstop. 7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay, and replay does not flow through the gate (step 5). Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and reach the gate only after the historical replay phase completes; if they cancel a re-mined event that has already been forwarded by the live path, the post-gate reorg detection in §6.5 logs them. --- @@ -345,7 +345,7 @@ Add a new method to `ChannelHubReactorStore`: IsContractEventProcessed(txHash string, logIndex uint, blockchainID uint64) (bool, error) ``` -At the top of `HandleEvent`, before entering `useStoreInTx`, call this method. If the event is already committed, log at **`INFO`** ("skipping re-delivered event, already committed") and return `nil` immediately. No transaction is opened; no state is touched. +At the top of `HandleEvent`, before entering `useStoreInTx`, call this method. If the event is already committed, log at **`INFO`** ("skipping re-delivered event, already committed") and return `nil` immediately. No transaction is opened; no state is touched. If `IsContractEventProcessed` itself returns an error, `HandleEvent` returns the wrapped error immediately; the listener unsubscribes and the process restarts (per the fatal-channel in §6.8). On restart, the DB cursor re-fetches the same range and the pre-check retries. Reorged events that pass through this check are still neutralized by the reactor's **business-logic idempotency**: diff --git a/pkg/blockchain/evm/channel_hub_reactor.go b/pkg/blockchain/evm/channel_hub_reactor.go index 40f23ce69..2ff230a85 100644 --- a/pkg/blockchain/evm/channel_hub_reactor.go +++ b/pkg/blockchain/evm/channel_hub_reactor.go @@ -175,7 +175,13 @@ func (r *ChannelHubReactor) SetOnEventProcessed(fn func(blockchainID uint64, suc r.onEventProcessed = fn } -func (r *ChannelHubReactor) HandleEvent(ctx context.Context, l types.Log) error { +func (r *ChannelHubReactor) HandleEvent(ctx context.Context, l types.Log) (err error) { + defer func() { + if r.onEventProcessed != nil { + r.onEventProcessed(r.blockchainID, err == nil) + } + }() + logger := log.FromContext(ctx) eventID := l.Topics[0] @@ -193,14 +199,11 @@ func (r *ChannelHubReactor) HandleEvent(ctx context.Context, l types.Log) error // they are handled by the reactor's business-logic idempotency (see nitronode/docs/reorg-fix.md §6.6). processed, err := r.store.IsContractEventProcessed(l.TxHash.String(), uint32(l.Index), r.blockchainID) if err != nil { - logger.Warn("failed to check if contract event was already processed, proceeding", - "error", err, "txHash", l.TxHash.String(), "logIndex", l.Index, "chainID", r.blockchainID) - } else if processed { + return errors.Wrap(err, "pre-check IsContractEventProcessed failed") + } + if processed { logger.Info("skipping re-delivered event, already committed", "event", eventName, "txHash", l.TxHash.String(), "logIndex", l.Index, "chainID", r.blockchainID) - if r.onEventProcessed != nil { - r.onEventProcessed(r.blockchainID, true) - } return nil } @@ -280,9 +283,6 @@ func (r *ChannelHubReactor) HandleEvent(ctx context.Context, l types.Log) error logger.Info("processed event", "event", eventName, "blockNumber", l.BlockNumber, "txHash", l.TxHash.String(), "logIndex", l.Index) return nil }) - if r.onEventProcessed != nil { - r.onEventProcessed(r.blockchainID, err == nil) - } return err } diff --git a/pkg/blockchain/evm/channel_hub_reactor_test.go b/pkg/blockchain/evm/channel_hub_reactor_test.go index 1a5a26f31..6893bf232 100644 --- a/pkg/blockchain/evm/channel_hub_reactor_test.go +++ b/pkg/blockchain/evm/channel_hub_reactor_test.go @@ -1047,7 +1047,7 @@ func TestChannelHubReactor_HandleEscrowDepositsPurged(t *testing.T) { store.AssertExpectations(t) } -func TestChannelHubReactor_HandleEvent_PreCheckError(t *testing.T) { +func TestChannelHubReactor_HandleEvent_PreCheckError_ReturnsError(t *testing.T) { blockchainID := uint64(1) nodeAddr := "0x1111111111111111111111111111111111111111" tokenAddr := common.HexToAddress("0xA0b86991c6218b36c1d19D4a2e9Eb0cE3606eB48") @@ -1068,22 +1068,19 @@ func TestChannelHubReactor_HandleEvent_PreCheckError(t *testing.T) { handler := new(mockChannelHubEventHandler) assetStore := new(MockAssetStore) - // Pre-check returns an error — reactor must fall through and process normally. + // Pre-check returns an error — reactor must return it immediately. store.On("IsContractEventProcessed", mock.Anything, mock.Anything, mock.Anything).Return(false, assert.AnError) - assetStore.On("GetTokenAsset", blockchainID, tokenAddr.String()).Return("usdc", nil) - assetStore.On("GetTokenDecimals", blockchainID, tokenAddr.String()).Return(uint8(6), nil) - handler.On("HandleNodeBalanceUpdated", mock.Anything, mock.Anything, mock.Anything).Return(nil) - expectStoreContractEvent(store, "NodeBalanceUpdated", 100, blockchainID) useStoreInTx := func(fn ChannelHubReactorStoreTxHandler) error { return fn(store) } reactor := NewChannelHubReactor(blockchainID, nodeAddr, handler, assetStore, useStoreInTx, store) err := reactor.HandleEvent(context.Background(), logEntry) - require.NoError(t, err) + require.Error(t, err) + require.ErrorContains(t, err, "pre-check IsContractEventProcessed failed") - // Business logic and StoreContractEvent must still be called. - handler.AssertCalled(t, "HandleNodeBalanceUpdated", mock.Anything, mock.Anything, mock.Anything) - store.AssertExpectations(t) + // Neither business logic nor StoreContractEvent should be called. + handler.AssertNotCalled(t, "HandleNodeBalanceUpdated", mock.Anything, mock.Anything, mock.Anything) + store.AssertNotCalled(t, "StoreContractEvent", mock.Anything) } func TestChannelHubReactor_HandleEvent_AlreadyProcessed(t *testing.T) { @@ -1196,4 +1193,28 @@ func TestChannelHubReactor_OnEventProcessedCallback(t *testing.T) { require.Error(t, err) assert.False(t, cbSuccess) }) + + t.Run("callback receives false on pre-check error", func(t *testing.T) { + store := new(mockChannelHubStore) + handler := new(mockChannelHubEventHandler) + assetStore := new(MockAssetStore) + + // Pre-check returns an error — deferred callback must still fire with success=false. + store.On("IsContractEventProcessed", mock.Anything, mock.Anything, mock.Anything).Return(false, assert.AnError) + + useStoreInTx := func(fn ChannelHubReactorStoreTxHandler) error { return fn(store) } + reactor := NewChannelHubReactor(blockchainID, nodeAddr, handler, assetStore, useStoreInTx, store) + + var cbCalled bool + var cbSuccess bool + reactor.SetOnEventProcessed(func(_ uint64, success bool) { + cbCalled = true + cbSuccess = success + }) + + err := reactor.HandleEvent(context.Background(), logEntry) + require.Error(t, err) + assert.True(t, cbCalled, "callback must be invoked on pre-check error") + assert.False(t, cbSuccess) + }) } From fcc4f8e8cb4e079cf7f41c5c5ca083653abb500e Mon Sep 17 00:00:00 2001 From: nksazonov Date: Sun, 14 Jun 2026 12:47:10 +0200 Subject: [PATCH 20/23] refactor(nitronode/gate): replace fatal channel with closure-pattern lifecycle --- nitronode/docs/reorg-fix.md | 6 +- nitronode/main.go | 11 +-- pkg/blockchain/evm/confirmation_gate.go | 85 +++++++++++--------- pkg/blockchain/evm/confirmation_gate_test.go | 76 ++++++++--------- pkg/blockchain/evm/listener.go | 22 +---- pkg/blockchain/evm/listener_test.go | 28 +++---- 6 files changed, 103 insertions(+), 125 deletions(-) diff --git a/nitronode/docs/reorg-fix.md b/nitronode/docs/reorg-fix.md index b9937a3ef..6b069652f 100644 --- a/nitronode/docs/reorg-fix.md +++ b/nitronode/docs/reorg-fix.md @@ -135,7 +135,7 @@ On startup, for each chain, after the `block_hash` migration has been applied: The `Listener` accepts two handlers (`eventHandler` for live events and recent historical events, `historicalEventHandler` for mature historical events) and makes the per-event routing decision from `eventLog.BlockTimestamp`. To guarantee that field is populated regardless of the RPC provider's behavior, the listener calls `ensureBlockTimestamp` once per event, which uses `eventLog.BlockTimestamp` when present and falls back to `HeaderByHash` otherwise (at most one fetch per block regardless of event count). When `confirmation_delay_sec` is `0` the gate is disabled and every historical event is routed to `historicalEventHandler`. On an `ensureBlockTimestamp` failure the Listener falls back to `eventHandler` (the gate) — the conservative choice that preserves the reorg-protection invariant at the cost of a small delay. -6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. Before opening a transaction, `HandleEvent` calls `IsContractEventProcessed`; if the event is already committed, it returns `nil` immediately with no DB transaction opened. If `IsContractEventProcessed` returns an error, `HandleEvent` returns the wrapped error; the listener unsubscribes and the process restarts (per the fatal-channel in §6.8), re-fetching the same range via the DB cursor so the pre-check retries. For events that pass the pre-check, `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)` as a final backstop. +6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. Before opening a transaction, `HandleEvent` calls `IsContractEventProcessed`; if the event is already committed, it returns `nil` immediately with no DB transaction opened. If `IsContractEventProcessed` returns an error, `HandleEvent` returns the wrapped error; the listener unsubscribes and the process restarts (per the lifecycle closure in §6.8), re-fetching the same range via the DB cursor so the pre-check retries. For events that pass the pre-check, `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)` as a final backstop. 7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay, and replay does not flow through the gate (step 5). Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and reach the gate only after the historical replay phase completes; if they cancel a re-mined event that has already been forwarded by the live path, the post-gate reorg detection in §6.5 logs them. --- @@ -345,7 +345,7 @@ Add a new method to `ChannelHubReactorStore`: IsContractEventProcessed(txHash string, logIndex uint, blockchainID uint64) (bool, error) ``` -At the top of `HandleEvent`, before entering `useStoreInTx`, call this method. If the event is already committed, log at **`INFO`** ("skipping re-delivered event, already committed") and return `nil` immediately. No transaction is opened; no state is touched. If `IsContractEventProcessed` itself returns an error, `HandleEvent` returns the wrapped error immediately; the listener unsubscribes and the process restarts (per the fatal-channel in §6.8). On restart, the DB cursor re-fetches the same range and the pre-check retries. +At the top of `HandleEvent`, before entering `useStoreInTx`, call this method. If the event is already committed, log at **`INFO`** ("skipping re-delivered event, already committed") and return `nil` immediately. No transaction is opened; no state is touched. If `IsContractEventProcessed` itself returns an error, `HandleEvent` returns the wrapped error immediately; the listener unsubscribes and the process restarts (per the lifecycle closure in §6.8). On restart, the DB cursor re-fetches the same range and the pre-check retries. Reorged events that pass through this check are still neutralized by the reactor's **business-logic idempotency**: @@ -407,4 +407,4 @@ On `HeaderByHash` failure the listener logs a WARN and forwards the event throug ### 6.8 Handler error semantics -When a downstream handler invoked after the confirmation delay returns an error, the gate sends the error on a buffered (size 1) fatal channel exposed via `FatalErrors()` and the goroutine exits without draining further pending entries. The listener receives the error in its Phase 2 (and Phase 1) select, unsubscribes, and returns the error to `Listen`'s closure, which in `nitronode/main.go` calls `logger.Fatal` → process exit. The supervisor restarts the process; the next `Listen` invocation re-fetches the unstored event via the DB cursor in `findCommonAncestor` + Phase 1 reconciliation, restoring the pre-PR crash-restart-replay invariant. The gate does **not** retry handler errors in-process; this is intentional and matches pre-PR behavior. Events queued behind the failed event are dropped on teardown and re-fetched after restart. +When a downstream handler invoked after the confirmation delay returns an error, the gate's `run` goroutine returns the error and the gate's lifecycle closure (passed to `Start`) is invoked with it. In `nitronode/main.go`, that closure calls `logger.Fatal` → process exit. The supervisor restarts the process; the next `Listen` invocation re-fetches the unstored event via the DB cursor in `findCommonAncestor` + Phase 1 reconciliation, restoring the pre-PR crash-restart-replay invariant. The gate does **not** retry handler errors in-process; this is intentional and matches pre-PR behavior. Events queued behind the failed event are dropped on teardown and re-fetched after restart. The gate's lifecycle (`Start(ctx, handleClosure)`) is identical to `Listener.Listen` and `BlockchainWorker.Start`; the listener does not know that its downstream handler may fail asynchronously — error propagation is handled by the supervisor (`main.go`), where it already is for the other two components. diff --git a/nitronode/main.go b/nitronode/main.go index fb933e306..64475f370 100644 --- a/nitronode/main.go +++ b/nitronode/main.go @@ -124,15 +124,17 @@ func main() { confirmationDelay := time.Duration(b.ConfirmationDelaySecs) * time.Second var liveHandler evm.HandleEvent - var fatalCh <-chan error if confirmationDelay > 0 { gate, err := evm.NewConfirmationGate(confirmationDelay, b.ID, reactor.HandleEvent, logger) if err != nil { logger.Fatal("failed to create confirmation gate", "error", err, "blockchainID", b.ID) } - gate.Start(blockchainCtx) + gate.Start(blockchainCtx, func(err error) { + if err != nil { + logger.Fatal("confirmation gate stopped", "error", err, "blockchainID", b.ID) + } + }) liveHandler = gate.HandleEvent - fatalCh = gate.FatalErrors() } else { liveHandler = reactor.HandleEvent } @@ -142,8 +144,7 @@ func main() { // based on block age: events older than confirmationDelay go directly to the reactor // (past the reorg window); recent events still flow through the live handler because // their blocks may still be reorged. - // fatalCh is nil when confirmationDelay == 0; a nil channel never selects. - l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, confirmationDelay, logger, liveHandler, reactor.HandleEvent, bb.DbStore, fatalCh) + l := evm.NewListener(common.HexToAddress(b.ChannelHubAddress), client, b.ID, b.BlockStep, confirmationDelay, logger, liveHandler, reactor.HandleEvent, bb.DbStore) l.Listen(blockchainCtx, func(err error) { if err != nil { logger.Fatal("blockchain listener stopped", "error", err, "blockchainID", b.ID) diff --git a/pkg/blockchain/evm/confirmation_gate.go b/pkg/blockchain/evm/confirmation_gate.go index 9226c7347..1a829716a 100644 --- a/pkg/blockchain/evm/confirmation_gate.go +++ b/pkg/blockchain/evm/confirmation_gate.go @@ -66,10 +66,8 @@ type ConfirmationGate struct { forwardedSet map[forwardedKey]time.Time // key -> forwardedAt forwardedQueue []forwardedExpiry // FIFO of (key, forwardedAt) for O(1) eviction - kick chan struct{} // buffered 1; non-blocking sends - timer *time.Timer // created in Start(ctx) - fatalCh chan error // buffered 1; first handler error wins; non-blocking send - done chan struct{} // closed once on fatal; gates run's select so the goroutine exits + kick chan struct{} // buffered 1; non-blocking sends + timer *time.Timer // created in Start(ctx) } // NewConfirmationGate creates a ConfirmationGate that holds events for delay before @@ -94,29 +92,44 @@ func NewConfirmationGate( forwardedSet: make(map[forwardedKey]time.Time), forwardedQueue: nil, kick: make(chan struct{}, 1), - fatalCh: make(chan error, 1), - done: make(chan struct{}), }, nil } -// FatalErrors returns a read-only channel that receives the first handler error -// encountered after the confirmation delay. The channel is buffered (size 1); -// only the first error is delivered. When the channel fires, the gate's drain -// goroutine has already stopped forwarding. The listener should unsubscribe and -// return the error to trigger process restart and DB-cursor replay. -func (g *ConfirmationGate) FatalErrors() <-chan error { - return g.fatalCh -} - // Start begins the background goroutine that forwards matured entries to the -// downstream handler. The timer is created here (tied to the goroutine's lifecycle) -// and stopped on shutdown. The goroutine exits when ctx is cancelled. -func (g *ConfirmationGate) Start(ctx context.Context) { - g.timer = time.NewTimer(time.Hour) // arbitrary long initial; will be reset on first drain +// downstream handler. handleClosure is called exactly once after the goroutine +// exits; err is non-nil only when the downstream handler returned an error +// after the confirmation delay. The timer is created here (tied to the +// goroutine's lifecycle) and stopped on shutdown. +func (g *ConfirmationGate) Start(ctx context.Context, handleClosure func(err error)) { + g.timer = time.NewTimer(time.Hour) // arbitrary long initial; reset on first drain if !g.timer.Stop() { <-g.timer.C } - go g.run(ctx) + + childCtx, cancel := context.WithCancel(ctx) + wg := sync.WaitGroup{} + wg.Add(1) + + var closureErr error + var closureErrMu sync.Mutex + childHandleClosure := func(err error) { + closureErrMu.Lock() + defer closureErrMu.Unlock() + if err != nil && closureErr == nil { + closureErr = err + } + cancel() + wg.Done() + } + + go func() { childHandleClosure(g.run(childCtx)) }() + + go func() { + wg.Wait() + closureErrMu.Lock() + defer closureErrMu.Unlock() + handleClosure(closureErr) + }() } // HandleEvent is the entry point called by the upstream Listener for each event. @@ -192,26 +205,28 @@ func (g *ConfirmationGate) HandleEvent(_ context.Context, eventLog types.Log) er // run is the background goroutine that wakes on a kick, on the timer firing, or on // ctx cancellation. It forwards matured entries, evicts stale forwardedSet entries, -// and reschedules the timer for the next head deadline. -func (g *ConfirmationGate) run(ctx context.Context) { +// and reschedules the timer for the next head deadline. Returns a non-nil error if +// the downstream handler failed; returns nil on clean shutdown. +func (g *ConfirmationGate) run(ctx context.Context) error { defer g.timer.Stop() for { select { case <-ctx.Done(): - return - case <-g.done: - return + return nil case <-g.kick: case <-g.timer.C: } - g.drainAndReschedule() + if err := g.drainAndReschedule(); err != nil { + return err + } } } // drainAndReschedule forwards all queue entries whose confirmation delay has // elapsed, evicts forwardedSet entries older than (recentMultiplier × delay), -// and resets the timer to the next head deadline. -func (g *ConfirmationGate) drainAndReschedule() { +// and resets the timer to the next head deadline. Returns a non-nil error if the +// downstream handler failed; the caller (run) propagates it to the lifecycle closure. +func (g *ConfirmationGate) drainAndReschedule() error { g.mu.Lock() now := time.Now() @@ -248,20 +263,11 @@ func (g *ConfirmationGate) drainAndReschedule() { evCtx := log.SetContextLogger(context.Background(), g.logger) if err := g.handler(evCtx, entry.log); err != nil { - g.logger.Error("handler error after confirmation delay, signalling fatal", + g.logger.Error("handler error after confirmation delay, stopping gate", "error", err, "chainID", g.chainID, ) - select { - case g.fatalCh <- err: - default: - } - // Close done to signal the run goroutine to exit immediately. - // This is safe: only this fatal branch closes done, and it runs at most - // once — once done is closed the run loop exits and drainAndReschedule - // is no longer called. - close(g.done) - return + return err // mu already released before the handler call; no relock needed. } g.mu.Lock() @@ -295,4 +301,5 @@ func (g *ConfirmationGate) drainAndReschedule() { // else: leave the timer stopped; the next kick recomputes. g.mu.Unlock() + return nil } diff --git a/pkg/blockchain/evm/confirmation_gate_test.go b/pkg/blockchain/evm/confirmation_gate_test.go index bbaea8750..c95b7e6ce 100644 --- a/pkg/blockchain/evm/confirmation_gate_test.go +++ b/pkg/blockchain/evm/confirmation_gate_test.go @@ -91,7 +91,7 @@ func TestConfirmationGate_NormalPath(t *testing.T) { } g := newGate(t, 5*time.Millisecond, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x02") bh := common.HexToHash("0xBB") @@ -132,7 +132,7 @@ func TestConfirmationGate_ReorgCancel(t *testing.T) { } g := newGate(t, 10*time.Millisecond, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x03") bh := common.HexToHash("0xCC") @@ -157,7 +157,7 @@ func TestConfirmationGate_OutOfOrder(t *testing.T) { } g := newGate(t, 10*time.Millisecond, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x04") bhOld := common.HexToHash("0xAA") @@ -198,7 +198,7 @@ func TestConfirmationGate_PostGateReorg(t *testing.T) { } g := newGate(t, 2*time.Millisecond, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x05") bh := common.HexToHash("0xDD") @@ -238,7 +238,7 @@ func TestConfirmationGate_UnknownRemoval(t *testing.T) { } g := newGate(t, 10*time.Millisecond, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x06") bh := common.HexToHash("0xEE") @@ -261,7 +261,7 @@ func TestConfirmationGate_BlockTimestampBypass(t *testing.T) { } g := newGate(t, 10*time.Millisecond, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x07") bh := common.HexToHash("0xFF") @@ -299,7 +299,7 @@ func TestConfirmationGate_BlockTimestampPartialDelay(t *testing.T) { } g := newGate(t, 5*time.Second, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x08") bh := common.HexToHash("0x08") @@ -336,7 +336,7 @@ func TestConfirmationGate_BlockTimestampZeroFallback(t *testing.T) { } g := newGate(t, 10*time.Millisecond, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x09") bh := common.HexToHash("0x09") @@ -373,7 +373,7 @@ func TestConfirmationGate_Shutdown(t *testing.T) { g := newGate(t, 50*time.Millisecond, handler) ctx, cancel := context.WithCancel(t.Context()) - g.Start(ctx) + g.Start(ctx, func(error) {}) for i := range 3 { tx := common.HexToHash(string(rune(0x20 + i))) @@ -402,7 +402,7 @@ func TestConfirmationGate_ForwardedSetEviction(t *testing.T) { delay := 5 * time.Millisecond g := newGate(t, delay, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x12") bh := common.HexToHash("0x12") @@ -470,7 +470,7 @@ func TestConfirmationGate_MultipleEvents_Ordering(t *testing.T) { } g := newGate(t, 5*time.Millisecond, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) txHashes := []common.Hash{ common.HexToHash("0xA1"), @@ -527,7 +527,7 @@ func TestConfirmationGate_TombstoneSkip(t *testing.T) { delay := 30 * time.Millisecond g := newGate(t, delay, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x20") bhA := common.HexToHash("0xAAA") @@ -583,7 +583,7 @@ func TestConfirmationGate_FIFOEviction_ToleratesEarlyDelete(t *testing.T) { delay := 5 * time.Millisecond g := newGate(t, delay, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x30") bh := common.HexToHash("0x30") @@ -660,7 +660,7 @@ func TestConfirmationGate_TimerReschedule(t *testing.T) { } g := newGate(t, 20*time.Millisecond, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) tx := common.HexToHash("0x40") bh := common.HexToHash("0x40") @@ -706,7 +706,7 @@ func TestConfirmationGate_KickDuringPendingTimer(t *testing.T) { delay := 100 * time.Millisecond g := newGate(t, delay, handler) - g.Start(t.Context()) + g.Start(t.Context(), func(error) {}) txA := common.HexToHash("0x50") bhA := common.HexToHash("0x50") @@ -762,7 +762,7 @@ func TestConfirmationGate_ShutdownWithNonEmptyQueue(t *testing.T) { g := newGate(t, 200*time.Millisecond, handler) ctx, cancel := context.WithCancel(t.Context()) - g.Start(ctx) + g.Start(ctx, func(error) {}) // Enqueue multiple events far in the future. for i := range 4 { @@ -789,10 +789,9 @@ func TestConfirmationGate_ShutdownWithNonEmptyQueue(t *testing.T) { } // TestConfirmationGate_HandlerErrorPropagatesFatal: when the downstream handler -// returns an error after the confirmation delay, the gate signals the fatal channel -// exactly once and the run goroutine exits. A second FatalErrors() receive must -// block (buffer size == 1; only the first error is delivered). After the goroutine -// exits, a second event that matures must NOT invoke the handler again. +// returns an error after the confirmation delay, the gate's lifecycle closure +// receives the sentinel error exactly once and the run goroutine exits. +// Subsequent events that mature must NOT invoke the handler again. func TestConfirmationGate_HandlerErrorPropagatesFatal(t *testing.T) { t.Parallel() @@ -805,46 +804,37 @@ func TestConfirmationGate_HandlerErrorPropagatesFatal(t *testing.T) { delay := 50 * time.Millisecond g := newGate(t, delay, handler) - g.Start(t.Context()) + + closureCh := make(chan error, 2) // size 2 to catch a buggy double-invocation + g.Start(t.Context(), func(err error) { closureCh <- err }) tx := common.HexToHash("0xF1") bh := common.HexToHash("0xF1") require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx, bh, 0, false))) - // The fatal channel must receive the sentinel error within delay + generous timeout. + // The closure must be invoked once with the sentinel error. select { - case err := <-g.FatalErrors(): - assert.Equal(t, sentinelErr, err, "fatal channel must carry the sentinel error") + case err := <-closureCh: + assert.Equal(t, sentinelErr, err, "closure must receive the sentinel error") case <-time.After(delay + 200*time.Millisecond): - t.Fatal("fatal channel did not receive an error within timeout") + t.Fatal("closure was not invoked within timeout") } - // A second receive must block immediately — only one error per gate-lifetime. + // A second invocation must not occur — the run goroutine has exited. select { - case extra := <-g.FatalErrors(): - t.Fatalf("unexpected second value on fatal channel: %v", extra) - default: - // correct: channel is empty after the first drain + case extra := <-closureCh: + t.Fatalf("unexpected second closure invocation: %v", extra) + case <-time.After(50 * time.Millisecond): + // correct: no second invocation } - // Give the run goroutine a moment to exit via <-g.done. - time.Sleep(50 * time.Millisecond) - - // Enqueue a second event after the failure. The goroutine has exited, so even + // Enqueue a second event after the failure. The goroutine has exited; even // if the kick is queued in the buffered channel it will never be drained. tx2 := common.HexToHash("0xF2") bh2 := common.HexToHash("0xF2") require.NoError(t, g.HandleEvent(context.Background(), makeLog(tx2, bh2, 0, false))) - // Wait well past the delay; the handler must NOT be called a second time. + // Wait past the delay; the handler must NOT be called a second time. time.Sleep(delay + 100*time.Millisecond) - assert.Equal(t, int64(1), handlerCalls.Load(), "handler must be invoked exactly once across gate lifetime") - - select { - case <-g.FatalErrors(): - t.Fatal("unexpected second fatal send after goroutine exited") - default: - // correct: no second fatal - } } diff --git a/pkg/blockchain/evm/listener.go b/pkg/blockchain/evm/listener.go index 8fb77bc21..580e11792 100644 --- a/pkg/blockchain/evm/listener.go +++ b/pkg/blockchain/evm/listener.go @@ -33,7 +33,6 @@ type Listener struct { handleEvent HandleEvent // live events and recent historical events; typically the ConfirmationGate handleHistoricalEvent HandleEvent // historical events older than confirmationDelay; typically the reactor directly eventGetter ContractEventGetter - handleEventFatalCh <-chan error // gate fatal-error channel; nil when no gate is in use (nil channel never selects) // Single-entry block-timestamp cache for ensureBlockTimestamp. The listener's // processEvents loop is strictly serial (Phase 1 drains before Phase 2, each @@ -58,11 +57,7 @@ type Listener struct { // // eventHandler is typically the ConfirmationGate; historicalEventHandler is typically // the reactor directly. The two handlers may be the same function when no gate is in use. -// -// eventHandlerFatalCh is the read-only fatal-error channel from the ConfirmationGate -// (gate.FatalErrors()). Pass nil when no gate is in use (confirmationDelay == 0); -// a nil channel never selects, so the fatal-error case is a no-op on the no-gate path. -func NewListener(contractAddress common.Address, client EVMClient, blockchainID uint64, blockStep uint64, confirmationDelay time.Duration, logger log.Logger, eventHandler HandleEvent, historicalEventHandler HandleEvent, eventGetter ContractEventGetter, eventHandlerFatalCh <-chan error) *Listener { +func NewListener(contractAddress common.Address, client EVMClient, blockchainID uint64, blockStep uint64, confirmationDelay time.Duration, logger log.Logger, eventHandler HandleEvent, historicalEventHandler HandleEvent, eventGetter ContractEventGetter) *Listener { return &Listener{ contractAddress: contractAddress, client: client, @@ -73,7 +68,6 @@ func NewListener(contractAddress common.Address, client EVMClient, blockchainID handleEvent: eventHandler, handleHistoricalEvent: historicalEventHandler, eventGetter: eventGetter, - handleEventFatalCh: eventHandlerFatalCh, } } @@ -322,13 +316,6 @@ func (l *Listener) processEvents( eventSubscription.Unsubscribe() return err } - case err := <-l.handleEventFatalCh: - l.logger.Error("downstream gate signalled fatal error, unsubscribing", - "error", err, - "blockchainID", l.blockchainID, - "contractAddress", l.contractAddress.String()) - eventSubscription.Unsubscribe() - return err case err := <-eventSubscription.Err(): if err != nil { l.logger.Error("event subscription error", "error", err, "blockchainID", l.blockchainID, "contractAddress", l.contractAddress.String()) @@ -383,13 +370,6 @@ func (l *Listener) processEvents( eventSubscription.Unsubscribe() return err } - case err := <-l.handleEventFatalCh: - l.logger.Error("downstream gate signalled fatal error, unsubscribing", - "error", err, - "blockchainID", l.blockchainID, - "contractAddress", l.contractAddress.String()) - eventSubscription.Unsubscribe() - return err case err := <-eventSubscription.Err(): if err != nil { l.logger.Error("event subscription error", "error", err, "blockchainID", l.blockchainID, "contractAddress", l.contractAddress.String()) diff --git a/pkg/blockchain/evm/listener_test.go b/pkg/blockchain/evm/listener_test.go index 3f3a1a3a8..fbd7ae002 100644 --- a/pkg/blockchain/evm/listener_test.go +++ b/pkg/blockchain/evm/listener_test.go @@ -45,7 +45,7 @@ func TestNewListener(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - l := NewListener(addr, mockClient, 1, 100, 0, logger, nil, nil, eventGetter, nil) + l := NewListener(addr, mockClient, 1, 100, 0, logger, nil, nil, eventGetter) require.NotNil(t, l) assert.Equal(t, addr, l.contractAddress) assert.Equal(t, uint64(1), l.blockchainID) @@ -73,7 +73,7 @@ func TestListener_Listen_CurrentEvents(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 100, 0, logger, handleEvent, handleEvent, eventGetter, nil) + listener := NewListener(addr, mockClient, 1, 100, 0, logger, handleEvent, handleEvent, eventGetter) // Mock SubscribeFilterLogs sub := &MockSubscription{ @@ -110,7 +110,7 @@ func TestListener_ReconcileBlockRange(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) // Setup FilterLogs mock // We expect a range fetch. start=100, step=10 -> end=110. current=120. @@ -187,7 +187,7 @@ func TestListener_Listen_HistoricalAndCurrent(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 10, 0, logger, handleEvent, handleEvent, eventGetter, nil) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) // findCommonAncestor: HeaderByNumber(100) returns the same header we hashed above, // so the stored hash matches and block 100 is confirmed canonical. @@ -234,7 +234,7 @@ func TestProcessEvents_DedupSkipsPresent(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter, nil) + listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) // Historical: 3 events. First 2 are present (skipped), 3rd is not (handled). // After the 3rd, the check should stop — no IsContractEventProcessed call for events 4+. @@ -287,7 +287,7 @@ func TestProcessEvents_SubscriptionErrorDuringPhase1(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter, nil) + listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) // Historical channel with events that will block (not closed yet). BlockTimestamp // is set so ensureBlockTimestamp short-circuits. @@ -350,7 +350,7 @@ func TestListener_PhaseHandlerRouting(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 10, confirmationDelay, logger, liveHandler, historicalHandler, eventGetter, nil) + listener := NewListener(addr, mockClient, 1, 10, confirmationDelay, logger, liveHandler, historicalHandler, eventGetter) // Old historical event (block timestamp 10 minutes ago) — should bypass the gate. oldHash := common.HexToHash("0xa1") @@ -439,7 +439,7 @@ func TestListener_PhaseHandlerRouting_DelayZero(t *testing.T) { return nil } - listener := NewListener(addr, mockClient, 1, 10, 0, logger, liveHandler, historicalHandler, eventGetter, nil) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, liveHandler, historicalHandler, eventGetter) // BlockTimestamp populated by the upstream RPC — ensureBlockTimestamp short-circuits // and routeHistoricalEvent routes directly to historicalHandler because delay == 0. @@ -486,7 +486,7 @@ func TestListener_RemovedLog_ForwardedToHandler(t *testing.T) { return nil } - listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter, nil) + listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) // No historical events. historicalCh := make(chan types.Log) @@ -544,7 +544,7 @@ func TestReconcileBlockRange_ContextCancellation(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) ctx, cancel := context.WithCancel(context.Background()) @@ -584,7 +584,7 @@ func TestEnsureBlockTimestamp_Populated(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) originalTs := uint64(1700000000) eventLog := types.Log{ @@ -609,7 +609,7 @@ func TestEnsureBlockTimestamp_Fetch(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) bh := common.HexToHash("0xabc") headerTime := uint64(1700000000) @@ -634,7 +634,7 @@ func TestEnsureBlockTimestamp_CacheHit(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) bh := common.HexToHash("0xabc") headerTime := uint64(1700000000) @@ -668,7 +668,7 @@ func TestEnsureBlockTimestamp_FetchError(t *testing.T) { addr := common.HexToAddress("0x123") eventGetter := new(MockContractEventGetter) - listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter, nil) + listener := NewListener(addr, mockClient, 1, 10, 0, logger, nil, nil, eventGetter) bh := common.HexToHash("0xabc") mockClient.On("HeaderByHash", mock.Anything, bh).Return(nil, fmt.Errorf("rpc failure")).Once() From 33c4b3a6c0766365dd7b5806a249b4a2d2a1f134 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Sun, 14 Jun 2026 13:23:36 +0200 Subject: [PATCH 21/23] fix(nitronode/listener): drop Removed:true live logs on no-gate path --- nitronode/docs/reorg-fix.md | 1 + pkg/blockchain/evm/listener.go | 31 ++++-- pkg/blockchain/evm/listener_test.go | 166 ++++++++++++++++++++-------- 3 files changed, 144 insertions(+), 54 deletions(-) diff --git a/nitronode/docs/reorg-fix.md b/nitronode/docs/reorg-fix.md index 6b069652f..7cba82ed7 100644 --- a/nitronode/docs/reorg-fix.md +++ b/nitronode/docs/reorg-fix.md @@ -137,6 +137,7 @@ On startup, for each chain, after the `block_hash` migration has been applied: When `confirmation_delay_sec` is `0` the gate is disabled and every historical event is routed to `historicalEventHandler`. On an `ensureBlockTimestamp` failure the Listener falls back to `eventHandler` (the gate) — the conservative choice that preserves the reorg-protection invariant at the cost of a small delay. 6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. Before opening a transaction, `HandleEvent` calls `IsContractEventProcessed`; if the event is already committed, it returns `nil` immediately with no DB transaction opened. If `IsContractEventProcessed` returns an error, `HandleEvent` returns the wrapped error; the listener unsubscribes and the process restarts (per the lifecycle closure in §6.8), re-fetching the same range via the DB cursor so the pre-check retries. For events that pass the pre-check, `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)` as a final backstop. 7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay, and replay does not flow through the gate (step 5). Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and reach the gate only after the historical replay phase completes; if they cancel a re-mined event that has already been forwarded by the live path, the post-gate reorg detection in §6.5 logs them. +8. When `confirmation_delay_secs == 0`, the listener drops `Removed:true` live logs at the Phase 2 boundary because there is no downstream gate to consume them; the reactor never receives `Removed:true` logs in either mode. --- diff --git a/pkg/blockchain/evm/listener.go b/pkg/blockchain/evm/listener.go index 580e11792..51c35f5ac 100644 --- a/pkg/blockchain/evm/listener.go +++ b/pkg/blockchain/evm/listener.go @@ -246,14 +246,18 @@ func (l *Listener) listenEvents(ctx context.Context) error { // failed event; the next Listen invocation re-fetches from the same // cursor. Transient handler failures retry instead of silently dropping. // -// 4. Reorged-out logs are forwarded to the handler (ConfirmationGate). -// Live deliveries with Removed=true are passed to the handler so the -// gate can cancel any pending confirmation timer for that event. The -// reactor never sees Removed=true logs directly; the gate filters them -// before forwarding confirmed events. The lastBlock cursor and -// IsContractEventProcessed dedup check are skipped for Removed=true events -// so neither the resume cursor nor the idempotency guard is corrupted -// by a reorg signal. +// 4. Reorged-out logs are routed by delay configuration. +// When confirmationDelay > 0, live deliveries with Removed=true are +// forwarded to the handler (ConfirmationGate) so the gate can cancel +// any pending confirmation timer for that event; the gate filters them +// before forwarding confirmed events to the reactor. When +// confirmationDelay == 0, there is no gate to consume the removal +// signal, so the listener drops Removed=true logs at the Phase 2 +// boundary — matching pre-PR behavior. In both modes the reactor +// never sees Removed=true logs directly. The lastBlock cursor and +// IsContractEventProcessed dedup check are skipped for Removed=true +// events so neither the resume cursor nor the idempotency guard is +// corrupted by a reorg signal. // // A consequence used by the nitronode event handlers: for any channel that // closes via Path-1 (challenge-timeout, ChannelHub Closed-from-DISPUTED), @@ -336,6 +340,17 @@ func (l *Listener) processEvents( eventSubscription.Unsubscribe() return nil case eventLog := <-currentCh: + if eventLog.Removed && l.confirmationDelay == 0 { + l.logger.Warn("dropping Removed=true live event on no-gate path", + "blockchainID", l.blockchainID, + "contractAddress", l.contractAddress.String(), + "blockNumber", eventLog.BlockNumber, + "blockHash", eventLog.BlockHash.Hex(), + "txHash", eventLog.TxHash.Hex(), + "logIndex", eventLog.Index, + ) + continue + } if !eventLog.Removed { *lastBlock = eventLog.BlockNumber if !currentCheckDone { diff --git a/pkg/blockchain/evm/listener_test.go b/pkg/blockchain/evm/listener_test.go index fbd7ae002..57305e67f 100644 --- a/pkg/blockchain/evm/listener_test.go +++ b/pkg/blockchain/evm/listener_test.go @@ -475,66 +475,140 @@ func TestListener_PhaseHandlerRouting_DelayZero(t *testing.T) { func TestListener_RemovedLog_ForwardedToHandler(t *testing.T) { t.Parallel() - logger := log.NewNoopLogger() - addr := common.HexToAddress("0x123") - eventGetter := new(MockContractEventGetter) - // Track which logs reached handleEvent. - var handledLogs []types.Log - handleEvent := func(ctx context.Context, eventLog types.Log) error { - handledLogs = append(handledLogs, eventLog) - return nil - } + t.Run("WithGate", func(t *testing.T) { + t.Parallel() + logger := log.NewNoopLogger() + addr := common.HexToAddress("0x123") + eventGetter := new(MockContractEventGetter) + + // Track which logs reached handleEvent. + var handledLogs []types.Log + handleEvent := func(ctx context.Context, eventLog types.Log) error { + handledLogs = append(handledLogs, eventLog) + return nil + } - listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) + // confirmationDelay > 0: the gate is active; Removed=true logs MUST be forwarded. + const delay = 30 * time.Second + listener := NewListener(addr, new(MockEVMClient), 1, 10, delay, logger, handleEvent, handleEvent, eventGetter) - // No historical events. - historicalCh := make(chan types.Log) - close(historicalCh) + // No historical events. + historicalCh := make(chan types.Log) + close(historicalCh) - currentCh := make(chan types.Log, 2) + currentCh := make(chan types.Log, 2) - // Event 1: non-Removed at block 10 — triggers IsContractEventProcessed check, - // advances lastBlock, sets currentCheckDone = true. BlockTimestamp is set so - // ensureBlockTimestamp short-circuits. - normalLog := types.Log{BlockNumber: 10, Index: 0, TxHash: common.HexToHash("0xabc"), BlockTimestamp: uint64(time.Now().Unix())} - eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(false, nil).Once() + // Event 1: non-Removed at block 10 — triggers IsContractEventProcessed check, + // advances lastBlock, sets currentCheckDone = true. BlockTimestamp is set so + // ensureBlockTimestamp short-circuits. + normalLog := types.Log{BlockNumber: 10, Index: 0, TxHash: common.HexToHash("0xabc"), BlockTimestamp: uint64(time.Now().Unix())} + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(false, nil).Once() - // Event 2: Removed=true at block 11 — must NOT advance lastBlock, must NOT call - // IsContractEventProcessed, but MUST reach handleEvent. - removedLog := types.Log{BlockNumber: 11, Index: 0, TxHash: common.HexToHash("0xdef"), Removed: true} + // Event 2: Removed=true at block 11 — must NOT advance lastBlock, must NOT call + // IsContractEventProcessed, but MUST reach handleEvent (gate needs the removal signal). + removedLog := types.Log{BlockNumber: 11, Index: 0, TxHash: common.HexToHash("0xdef"), Removed: true} - currentCh <- normalLog - currentCh <- removedLog + currentCh <- normalLog + currentCh <- removedLog - sub := &MockSubscription{errChan: make(chan error, 1), unsub: func() {}} + sub := &MockSubscription{errChan: make(chan error, 1), unsub: func() {}} - ctx, cancel := context.WithCancel(context.Background()) - go func() { - // Give processEvents enough time to drain both buffered events, then cancel. - time.Sleep(100 * time.Millisecond) - cancel() - }() + ctx, cancel := context.WithCancel(context.Background()) + go func() { + // Give processEvents enough time to drain both buffered events, then cancel. + time.Sleep(100 * time.Millisecond) + cancel() + }() - var lastBlock uint64 - err := listener.processEvents(ctx, sub, historicalCh, currentCh, &lastBlock) - require.NoError(t, err) + var lastBlock uint64 + err := listener.processEvents(ctx, sub, historicalCh, currentCh, &lastBlock) + require.NoError(t, err) - // Both events must have reached handleEvent. - require.Len(t, handledLogs, 2, "handleEvent must be called for both the normal and the Removed event") + // Both events must have reached handleEvent. + require.Len(t, handledLogs, 2, "handleEvent must be called for both the normal and the Removed event when gate is active") - // Verify first call was the normal log and second was the removed log. - assert.Equal(t, uint64(10), handledLogs[0].BlockNumber) - assert.False(t, handledLogs[0].Removed) - assert.Equal(t, uint64(11), handledLogs[1].BlockNumber) - assert.True(t, handledLogs[1].Removed) + // Verify first call was the normal log and second was the removed log. + assert.Equal(t, uint64(10), handledLogs[0].BlockNumber) + assert.False(t, handledLogs[0].Removed) + assert.Equal(t, uint64(11), handledLogs[1].BlockNumber) + assert.True(t, handledLogs[1].Removed) - // lastBlock must NOT have advanced past the normal event's block. - assert.Equal(t, uint64(10), lastBlock, "lastBlock must not be advanced by a Removed=true event") + // lastBlock must NOT have advanced past the normal event's block. + assert.Equal(t, uint64(10), lastBlock, "lastBlock must not be advanced by a Removed=true event") - // IsContractEventProcessed must have been called exactly once (for the normal log only). - eventGetter.AssertNumberOfCalls(t, "IsContractEventProcessed", 1) - eventGetter.AssertExpectations(t) + // IsContractEventProcessed must have been called exactly once (for the normal log only). + eventGetter.AssertNumberOfCalls(t, "IsContractEventProcessed", 1) + eventGetter.AssertExpectations(t) + }) + + t.Run("NoGate", func(t *testing.T) { + t.Parallel() + logger := log.NewNoopLogger() + addr := common.HexToAddress("0x123") + eventGetter := new(MockContractEventGetter) + + // Track which logs reached handleEvent. + var handledLogs []types.Log + handleEvent := func(ctx context.Context, eventLog types.Log) error { + handledLogs = append(handledLogs, eventLog) + return nil + } + + // confirmationDelay == 0: no gate; Removed=true logs must be dropped at Phase 2 boundary. + listener := NewListener(addr, new(MockEVMClient), 1, 10, 0, logger, handleEvent, handleEvent, eventGetter) + + // No historical events. + historicalCh := make(chan types.Log) + close(historicalCh) + + currentCh := make(chan types.Log, 3) + + // Event 1: non-Removed at block 10 — advances lastBlock, triggers dedup check. + // BlockTimestamp is set so ensureBlockTimestamp short-circuits. + normalLog := types.Log{BlockNumber: 10, Index: 0, TxHash: common.HexToHash("0xabc"), BlockTimestamp: uint64(time.Now().Unix())} + eventGetter.On("IsContractEventProcessed", mock.Anything, uint32(0), uint64(1)).Return(false, nil).Once() + + // Event 2: Removed=true at block 11 — must be dropped; must NOT reach handleEvent, + // must NOT advance lastBlock. + removedLog := types.Log{BlockNumber: 11, Index: 0, TxHash: common.HexToHash("0xdef"), Removed: true} + + // Event 3: another non-Removed at block 12 — must flow normally after the dropped removal. + // BlockTimestamp is set so ensureBlockTimestamp short-circuits. + followupLog := types.Log{BlockNumber: 12, Index: 1, TxHash: common.HexToHash("0xghi"), BlockTimestamp: uint64(time.Now().Unix())} + + currentCh <- normalLog + currentCh <- removedLog + currentCh <- followupLog + + sub := &MockSubscription{errChan: make(chan error, 1), unsub: func() {}} + + ctx, cancel := context.WithCancel(context.Background()) + go func() { + // Give processEvents enough time to drain all three buffered events, then cancel. + time.Sleep(100 * time.Millisecond) + cancel() + }() + + var lastBlock uint64 + err := listener.processEvents(ctx, sub, historicalCh, currentCh, &lastBlock) + require.NoError(t, err) + + // Only the two non-Removed events must have reached handleEvent. + require.Len(t, handledLogs, 2, "handleEvent must NOT be called for Removed=true when no gate is active") + assert.Equal(t, uint64(10), handledLogs[0].BlockNumber) + assert.False(t, handledLogs[0].Removed) + assert.Equal(t, uint64(12), handledLogs[1].BlockNumber) + assert.False(t, handledLogs[1].Removed) + + // lastBlock must reflect the last non-Removed event, not the removed one. + assert.Equal(t, uint64(12), lastBlock, "lastBlock must not be advanced by a Removed=true event") + + // IsContractEventProcessed must have been called exactly once (for the first normal log only; + // the follow-up log skips the check because currentCheckDone is already true). + eventGetter.AssertNumberOfCalls(t, "IsContractEventProcessed", 1) + eventGetter.AssertExpectations(t) + }) } func TestReconcileBlockRange_ContextCancellation(t *testing.T) { From 4709879978e4833504d4af4c7130b723319509d4 Mon Sep 17 00:00:00 2001 From: nksazonov Date: Sun, 14 Jun 2026 13:39:09 +0200 Subject: [PATCH 22/23] fix(nitronode/reconciler): resume from orphaned latest when all stored blocks reorged --- nitronode/docs/reorg-fix.md | 4 +-- pkg/blockchain/evm/reconciler.go | 40 ++++++++++++++++----------- pkg/blockchain/evm/reconciler_test.go | 13 +++++---- 3 files changed, 34 insertions(+), 23 deletions(-) diff --git a/nitronode/docs/reorg-fix.md b/nitronode/docs/reorg-fix.md index 7cba82ed7..2f028733f 100644 --- a/nitronode/docs/reorg-fix.md +++ b/nitronode/docs/reorg-fix.md @@ -122,11 +122,11 @@ On startup, for each chain, after the `block_hash` migration has been applied: - **Hash matches** → the stored block is the current canonical block at that height; no reorg above it. Proceed to step 4. - **Hash differs** → a different block now occupies that height; the stored block has been reorged out. Proceed to step 3. - **`ethereum.NotFound`** (RPC has no canonical block at that number, e.g. the height was pruned) → treat as reorged-out and proceed to step 3 rather than failing startup. -3. **Common-ancestor walk using stored block hashes:** query `contract_events` for the next-older distinct `block_hash` (the highest `block_number` strictly below the current candidate). Repeat step 2 with this (number, hash) pair. Continue until a stored block is confirmed canonical, or until no older stored hash exists (treat genesis as the fallback). This height is the **common ancestor**. +3. **Common-ancestor walk using stored block hashes:** query `contract_events` for the next-older distinct `block_hash` (the highest `block_number` strictly below the current candidate). Repeat step 2 with this (number, hash) pair. Continue until a stored block is confirmed canonical, or until no older stored hash exists. This height is the **common ancestor**. > **Why walk stored hashes, not block numbers?** In normal operation most blocks contain no `ChannelHub` events, so `contract_events` has no row for them. A block-number walk would find nothing to compare at event-gap heights and could miss a reorg that occurred entirely within such a gap. Walking by stored block hashes ensures every comparison is against a block the reactor actually processed. - If the walk reaches genesis without finding a canonical stored block, this implies either an empty store or a full-depth reorg; the latter is treated as a chain-level incident outside the gate's scope, and the listener proceeds as if the store were empty. + If the walk exhausts stored rows without finding a canonical one **and** no older row exists (`prevNum == 0` with `prevHash == ""`), the listener resumes from the *original* latest stored block number. The orphaned hash is discarded; `eth_getLogs` is a canonical-chain range query, so canonical-replacement logs between that height and the current tip are re-fetched normally. The empty-store case (`latestNum == 0`) continues to skip historical replay and tracks the chain from the live subscription. 4. Set the scan start to `commonAncestorBlockNum`. Events between `commonAncestorBlockNum` and `latestBlockNum` that came from the reorged fork are still present in the DB. The reactor has no rollback mechanism for those rows — the re-scan below will re-apply canonical events over them where the transaction was re-mined (idempotent), and leave the orphaned DB state in place where the transaction was not re-mined (residual risk; see §2.1). State-setting operations (`UpdateChannel`, `RefreshUserEnforcedBalance`) will overwrite with canonical values for re-mined events; rows from dropped transactions remain as stale data with no automated cleanup. 5. Start the event scan from `commonAncestorBlockNum` (or genesis if step 1 found no rows). Replayed events are routed **per-event by block age**: diff --git a/pkg/blockchain/evm/reconciler.go b/pkg/blockchain/evm/reconciler.go index 08616799d..aedbc80bf 100644 --- a/pkg/blockchain/evm/reconciler.go +++ b/pkg/blockchain/evm/reconciler.go @@ -16,8 +16,9 @@ import ( // finds a stored hash that matches the canonical chain's hash at that height, // then returns that block number as the safe replay start point. // -// Returns 0 when no stored events exist or when every stored block has been -// reorged out — in both cases the caller should replay from genesis/start-block. +// Returns 0 only when no stored events exist (empty store). When every stored +// block has been reorged out but a latest row exists, returns that row's block +// number so the caller can replay canonical logs from that height via eth_getLogs. func findCommonAncestor( ctx context.Context, client EVMClient, @@ -26,16 +27,18 @@ func findCommonAncestor( blockchainID uint64, logger log.Logger, ) (uint64, error) { - blockNum, blockHash, err := getter.GetLatestContractEventBlockHashAndNumber(contractAddress, blockchainID) + latestNum, latestHash, err := getter.GetLatestContractEventBlockHashAndNumber(contractAddress, blockchainID) if err != nil { return 0, fmt.Errorf("get latest contract event block hash: %w", err) } - if blockHash == "" { - // No stored events (blockNum=0) or pre-migration row with no hash (blockNum>0). - // Either way, treat blockNum as the safe canonical resume point. - return blockNum, nil + if latestHash == "" { + // No stored events (latestNum=0) or pre-migration row with no hash (latestNum>0). + // Either way, treat latestNum as the safe canonical resume point. + return latestNum, nil } + blockNum, blockHash := latestNum, latestHash + for { if ctx.Err() != nil { return 0, ctx.Err() @@ -66,15 +69,20 @@ func findCommonAncestor( return 0, fmt.Errorf("get previous distinct block hash below %d: %w", blockNum, err) } if prevHash == "" { - // No older stored block (prevNum=0) or pre-migration row (prevNum>0). - // Use prevNum as the safe canonical resume point. - // - // This branch conflates two distinct states: an empty store and a - // full-depth reorg where every stored block was reorged out. The latter - // requires a chain-level consensus failure that is outside the - // confirmation gate's scope, which is an incredibly unlikely scenario; - // both cases are treated identically here by proceeding as if the store were empty. - logger.Info("reconciliation: reached pre-migration or genesis boundary", + if prevNum == 0 { + // All stored event blocks have been reorged out and no older stored + // row exists. Resume from the orphaned latest stored block: eth_getLogs + // is a canonical-chain range query, so the canonical replacement logs + // between latestNum and the current tip will be re-fetched. The orphaned + // hash is irrelevant — only the height drives the range query. + logger.Info("reconciliation: all stored blocks reorged, resuming from orphaned latest", + "blockchainID", blockchainID, + "blockNumber", latestNum, + ) + return latestNum, nil + } + // Pre-migration row mid-walk (prevNum > 0, no hash recorded): trust it. + logger.Info("reconciliation: reached pre-migration boundary", "blockchainID", blockchainID, "blockNumber", prevNum, ) diff --git a/pkg/blockchain/evm/reconciler_test.go b/pkg/blockchain/evm/reconciler_test.go index 4037dfb95..f9d709be6 100644 --- a/pkg/blockchain/evm/reconciler_test.go +++ b/pkg/blockchain/evm/reconciler_test.go @@ -138,10 +138,11 @@ func TestFindCommonAncestor_NotFoundTreatedAsReorg(t *testing.T) { assert.Equal(t, uint64(190), result) } -// TestFindCommonAncestor_WalkToGenesis verifies that when all stored blocks have been -// reorged out (canonical hashes differ at every stored height), findCommonAncestor returns -// 0 (genesis fallback). -func TestFindCommonAncestor_WalkToGenesis(t *testing.T) { +// TestFindCommonAncestor_AllStoredReorged_ResumesFromOrphanedLatest verifies that when all +// stored blocks have been reorged out (canonical hashes differ at every stored height) and +// no older stored row exists, findCommonAncestor returns the original latestBlockNum so the +// caller can replay canonical logs from that height via eth_getLogs. +func TestFindCommonAncestor_AllStoredReorged_ResumesFromOrphanedLatest(t *testing.T) { t.Parallel() client := new(MockEVMClient) @@ -163,7 +164,9 @@ func TestFindCommonAncestor_WalkToGenesis(t *testing.T) { result, err := findCommonAncestor(context.Background(), client, getter, testContract, testBlockchainID, newTestLogger()) require.NoError(t, err) - assert.Equal(t, uint64(0), result) + // Returns the original latestBlockNum (300), not 0: the caller uses eth_getLogs from + // that height to re-fetch canonical replacement logs. + assert.Equal(t, uint64(300), result) } // TestFindCommonAncestor_PreMigrationLatestRow verifies that when the latest stored row has From 1bc0d5655f011851cf3c9bd6b48a2021a1597d9d Mon Sep 17 00:00:00 2001 From: nksazonov Date: Sun, 14 Jun 2026 14:00:37 +0200 Subject: [PATCH 23/23] docs: cross-link confirmation gate and normalize confirmation_delay_secs --- docs/api.yaml | 3 ++ docs/protocol/security-and-limitations.md | 1 + nitronode/README.md | 4 +++ nitronode/docs/reorg-fix.md | 34 +++++++++++------------ pkg/core/README.md | 10 ++++++- 5 files changed, 34 insertions(+), 18 deletions(-) diff --git a/docs/api.yaml b/docs/api.yaml index 782999a20..cbbbf824f 100644 --- a/docs/api.yaml +++ b/docs/api.yaml @@ -270,6 +270,9 @@ types: - name: contract_address type: string description: Address of the main contract on this blockchain + - name: confirmation_delay_secs + type: integer + description: Per-chain reorg-protection window in seconds; events are buffered for this duration before being committed. 0 disables the gate. See nitronode/docs/reorg-fix.md. - balance_entry: description: Balance for a specific asset diff --git a/docs/protocol/security-and-limitations.md b/docs/protocol/security-and-limitations.md index 36cbbad15..611a795a5 100644 --- a/docs/protocol/security-and-limitations.md +++ b/docs/protocol/security-and-limitations.md @@ -66,6 +66,7 @@ In the current protocol version, participants MUST trust nodes for: - **Off-chain transfer routing** — when a user sends funds off-chain to another party, the node must countersign both the sender's state (decreasing their allocation) and the receiver's credit state (increasing theirs); the on-chain contract cannot enforce atomicity between two independent channel updates. A malicious node could apply the sender's state while withholding the receiver's credit, capturing the transferred funds. Users must trust the node to faithfully execute both legs of every off-chain transfer. - **Asset-symbol equivalence** — the node operator controls which chain-specific tokens are configured under a single unified asset symbol. The protocol treats all tokens sharing a symbol as fully fungible 1:1 representations of the same asset, so off-chain credit denominated in that asset can be redeemed from any of those token inventories regardless of which one originally backed it (the validator binds unchanneled credit to the chain/token chosen at first channel creation, enforcing only that the asset symbol matches). This is intended behaviour that enables cross-chain redemption. Operators MUST therefore configure only economically equivalent (1:1 redeemable) tokens under one symbol; grouping non-equivalent tokens (e.g. a test token and production USDC) under the same symbol would let credit sourced from the cheap inventory be redeemed against the valuable one. Token equivalence cannot be verified programmatically and is an operator configuration responsibility. - **Signature validator registry** — the node operator controls which additional signature validators are registered on the ChannelHub contract. A malicious or compromised node could register a validator that approves forged user signatures, then use it to create channels or close them without the user's knowledge. A 1-day activation delay (`VALIDATOR_ACTIVATION_DELAY`) creates an observable window before any newly registered validator can be used. Users MUST monitor the `ValidatorRegistered` event on the ChannelHub contract and SHOULD revoke all ERC20 approvals granted to ChannelHub immediately upon detecting an unexpected registration. Once registered, a validator cannot be deactivated — the 1-day window is the entire response budget. Users SHOULD avoid granting large standing ERC20 approvals to ChannelHub to cap worst-case exposure. +- **Chain reorg depth** — the node credits off-chain balances after observing on-chain events. To bound reorg risk, each chain has a `confirmation_delay_secs` window before events are committed; events whose block is reorged out within that window are discarded. When the configured delay is set below the chain's hard finality time, a residual risk remains: a deeper reorg can leave the off-chain state with no on-chain backing. Operators MUST set `confirmation_delay_secs` to at least the chain's finality time when this residual exposure is unacceptable. See [Reorg-Protection Confirmation Gate](../../nitronode/docs/reorg-fix.md). Participants do not need to trust nodes for: diff --git a/nitronode/README.md b/nitronode/README.md index cb505b00b..632abad99 100644 --- a/nitronode/README.md +++ b/nitronode/README.md @@ -19,6 +19,7 @@ Nitronode is built with a modular architecture: - **RPC Server**: WebSocket-based JSON-RPC server handling client requests. - **Blockchain Listeners**: Monitors on-chain events from Nitrolite `ChannelHub` contracts across multiple chains. +- **Confirmation Gate**: Per-chain reorg-protection buffer between the listener and event handlers. Delays event delivery by `confirmation_delay_secs` so that events whose blocks are reorged out before the window elapses are dropped instead of committed. See [docs/reorg-fix.md](docs/reorg-fix.md). - **Event Handlers**: Processes blockchain events to update internal channel and user states. - **Storage Layer**: - **Database Store**: Persistent storage for channels, states, and transactions (supports SQLite and PostgreSQL). @@ -53,6 +54,7 @@ blockchains: id: 80002 contract_address: "0x9d1E88627884e066B81A02d69BCB2437a520534C" block_step: 1000 + confirmation_delay_secs: 10 # reorg-protection window; 0 disables. See docs/reorg-fix.md. - name: base_sepolia id: 84532 @@ -128,6 +130,7 @@ docker run -p 7824:7824 -e NITRONODE_SIGNER_KEY=... nitronode nitronode/ ├── api/ # JSON-RPC request handlers ├── config/ # Default configurations and migrations +├── docs/ # Component design notes (e.g. reorg-fix.md) ├── event_handlers/ # Logic for reacting to blockchain events ├── metrics/ # Prometheus telemetry implementation ├── store/ # Persistence layer (SQL and Memory) @@ -157,6 +160,7 @@ The following protocol operations are fully specified in [protocol-description.m - [Nitrolite Protocol Overview](../protocol-description.md) - [Communication Flows](../docs/communication_flows/) - [API Reference](../docs/api.yaml) +- [Reorg-Protection Confirmation Gate](docs/reorg-fix.md) ## License diff --git a/nitronode/docs/reorg-fix.md b/nitronode/docs/reorg-fix.md index 2f028733f..142710aca 100644 --- a/nitronode/docs/reorg-fix.md +++ b/nitronode/docs/reorg-fix.md @@ -15,7 +15,7 @@ This risk is meaningful on any chain where head-level reorgs occur naturally or A **per-chain confirmation window** is introduced between raw event delivery and handler invocation. When the listener observes any event on chain C: - It does **not** invoke the handler immediately. -- It waits for `confirmation_delay_sec` seconds (configured per chain in `blockchains.yaml`). +- It waits for `confirmation_delay_secs` seconds (configured per chain in `blockchains.yaml`). - If no reorg of the event's block occurs during that window, the handler is invoked normally. - If the event's block is reorged out (`removed: true` log arrives), the pending invocation is cancelled with no side effects. - If the reorged transaction is re-included (the same event appears again), the confirmation window restarts from zero. @@ -24,39 +24,39 @@ The delay applies uniformly to **all** events, not only deposit-class ones. Sele ### 2.1 Residual risk and the finality trade-off -The confirmation window eliminates the reorg risk only when `confirmation_delay_sec` is set to or above the chain's cryptoeconomic finality time. For the representative values in §3: +The confirmation window eliminates the reorg risk only when `confirmation_delay_secs` is set to or above the chain's cryptoeconomic finality time. For the representative values in §3: - **Ethereum at 780s (~13 min):** matches Casper FFG hard finality. Reorging past this point requires ≥1/3 of total stake to be slashed. No residual risk. - **Polygon at 10s, BNB at 5s:** exceeds the empirical reorg tail depth. Residual risk is negligible but not cryptoeconomically eliminated. - **Ethereum at 36s (3 blocks, "quick" finality):** P(reorg depth ≥ 4) ≈ 10⁻⁵–10⁻⁶ per event. Residual risk is real. -When `confirmation_delay_sec` is set *below* the chain's finality time, **this specification acknowledges a residual risk**: it is possible — with low but non-zero probability — that an event passes the gate, the reactor commits it to the database, and the block containing that event is subsequently reorged out by a reorg deeper than the gate window. +When `confirmation_delay_secs` is set *below* the chain's finality time, **this specification acknowledges a residual risk**: it is possible — with low but non-zero probability — that an event passes the gate, the reactor commits it to the database, and the block containing that event is subsequently reorged out by a reorg deeper than the gate window. When this occurs, the committed state (balance credit, channel open) has no corresponding on-chain event in the canonical chain. If the transaction is re-mined in the new canonical block, the reactor's idempotency guard (§6.6) handles the re-delivery cleanly. If it is not re-mined, the DB retains stale state that can only be partially corrected on the next node restart via the reconciliation walk (§4.4). There is no automated rollback; the exposure scales with the deposit value and is bounded by the probability of deep reorgs on the target chain. -Operators who cannot accept this residual exposure should set `confirmation_delay_sec` to the chain's hard-finality time (Ethereum: 780s; Polygon: `finalized` tag resolves to ~5s; L2s: `finalized` maps to L1 Casper FFG at ~13 min). The gate's detection mechanisms (§6.5, §6.6) provide observability when the residual-risk scenario occurs. +Operators who cannot accept this residual exposure should set `confirmation_delay_secs` to the chain's hard-finality time (Ethereum: 780s; Polygon: `finalized` tag resolves to ~5s; L2s: `finalized` maps to L1 Casper FFG at ~13 min). The gate's detection mechanisms (§6.5, §6.6) provide observability when the residual-risk scenario occurs. --- ## 3. Configuration -A new `confirmation_delay_sec` field is added per chain in `blockchains.yaml`. Representative values: +A new `confirmation_delay_secs` field is added per chain in `blockchains.yaml`. Representative values: ```yaml chains: - id: 1 # Ethereum mainnet - confirmation_delay_sec: 780 # ~13 min — Casper FFG hard finality + confirmation_delay_secs: 780 # ~13 min — Casper FFG hard finality - id: 137 # Polygon PoS (post-Heimdall v2 / Rio) - confirmation_delay_sec: 10 # 5 blocks × ~2s; empirical reorg tail is sub-10s + confirmation_delay_secs: 10 # 5 blocks × ~2s; empirical reorg tail is sub-10s - id: 56 # BNB Smart Chain - confirmation_delay_sec: 5 # fast-finality, ~3-4 blocks + confirmation_delay_secs: 5 # fast-finality, ~3-4 blocks - id: 42161 # Arbitrum One - confirmation_delay_sec: 120 # L2 `safe` tag (L1-posted batch), ~1-2 min + confirmation_delay_secs: 120 # L2 `safe` tag (L1-posted batch), ~1-2 min - id: 8453 # Base - confirmation_delay_sec: 120 # same L2 `safe` semantics + confirmation_delay_secs: 120 # same L2 `safe` semantics ``` -`confirmation_delay_sec: 0` disables the gate — events are processed immediately. Appropriate for BFT single-slot chains where the node operator accepts the negligible residual risk, or for chains using a finality-tag subscription rather than a block-count gate. +`confirmation_delay_secs: 0` disables the gate — events are processed immediately. Appropriate for BFT single-slot chains where the node operator accepts the negligible residual risk, or for chains using a finality-tag subscription rather than a block-count gate. --- @@ -67,7 +67,7 @@ chains: When a log `E` arrives (without `Removed: true`): 1. Record the event in the live-entry map under `(txHash, logIndex)` with its `blockHash` as the tombstone discriminator, and append it to the FIFO drain queue with its block timestamp as `arrivedAt`. -2. The gate's drain goroutine (single shared timer per gate; see §6.3) treats the entry as eligible once `arrivedAt + confirmation_delay_sec` has elapsed. +2. The gate's drain goroutine (single shared timer per gate; see §6.3) treats the entry as eligible once `arrivedAt + confirmation_delay_secs` has elapsed. 3. When the entry matures, invoke the event handler. ### 4.2 Reorg path @@ -130,11 +130,11 @@ On startup, for each chain, after the `block_hash` migration has been applied: 4. Set the scan start to `commonAncestorBlockNum`. Events between `commonAncestorBlockNum` and `latestBlockNum` that came from the reorged fork are still present in the DB. The reactor has no rollback mechanism for those rows — the re-scan below will re-apply canonical events over them where the transaction was re-mined (idempotent), and leave the orphaned DB state in place where the transaction was not re-mined (residual risk; see §2.1). State-setting operations (`UpdateChannel`, `RefreshUserEnforcedBalance`) will overwrite with canonical values for re-mined events; rows from dropped transactions remain as stale data with no automated cleanup. 5. Start the event scan from `commonAncestorBlockNum` (or genesis if step 1 found no rows). Replayed events are routed **per-event by block age**: - - Events whose block timestamp is **older than `confirmation_delay_sec`** are routed directly to the reactor, bypassing the gate. Their block is past the reorg window — `eth_getLogs` returned them as canonical, and any reorg that could displace them would exceed the configured finality bound. There is no incremental reorg risk to guard against, and routing them through the gate would only add latency. - - Events whose block timestamp is **younger than `confirmation_delay_sec`** are routed through the gate, the same path live events take. The common-ancestor walk only confirms the *starting* block is canonical; replay can fetch logs from blocks all the way up to the current chain tip, some of which are still inside the reorg window. Forwarding those directly to the reactor would re-introduce the very double-spend window the gate was built to close. + - Events whose block timestamp is **older than `confirmation_delay_secs`** are routed directly to the reactor, bypassing the gate. Their block is past the reorg window — `eth_getLogs` returned them as canonical, and any reorg that could displace them would exceed the configured finality bound. There is no incremental reorg risk to guard against, and routing them through the gate would only add latency. + - Events whose block timestamp is **younger than `confirmation_delay_secs`** are routed through the gate, the same path live events take. The common-ancestor walk only confirms the *starting* block is canonical; replay can fetch logs from blocks all the way up to the current chain tip, some of which are still inside the reorg window. Forwarding those directly to the reactor would re-introduce the very double-spend window the gate was built to close. The `Listener` accepts two handlers (`eventHandler` for live events and recent historical events, `historicalEventHandler` for mature historical events) and makes the per-event routing decision from `eventLog.BlockTimestamp`. To guarantee that field is populated regardless of the RPC provider's behavior, the listener calls `ensureBlockTimestamp` once per event, which uses `eventLog.BlockTimestamp` when present and falls back to `HeaderByHash` otherwise (at most one fetch per block regardless of event count). - When `confirmation_delay_sec` is `0` the gate is disabled and every historical event is routed to `historicalEventHandler`. On an `ensureBlockTimestamp` failure the Listener falls back to `eventHandler` (the gate) — the conservative choice that preserves the reorg-protection invariant at the cost of a small delay. + When `confirmation_delay_secs` is `0` the gate is disabled and every historical event is routed to `historicalEventHandler`. On an `ensureBlockTimestamp` failure the Listener falls back to `eventHandler` (the gate) — the conservative choice that preserves the reorg-protection invariant at the cost of a small delay. 6. The reactor is idempotent for replayed events: `HandleHomeChannelCreated` has an explicit early-return guard when the channel is already open; `HandleHomeChannelCheckpointed` and `RefreshUserEnforcedBalance` use set-semantics (not accumulation) and recompute from the latest DB state. Before opening a transaction, `HandleEvent` calls `IsContractEventProcessed`; if the event is already committed, it returns `nil` immediately with no DB transaction opened. If `IsContractEventProcessed` returns an error, `HandleEvent` returns the wrapped error; the listener unsubscribes and the process restarts (per the lifecycle closure in §6.8), re-fetching the same range via the DB cursor so the pre-check retries. For events that pass the pre-check, `StoreContractEvent` is called last inside the DB transaction and enforces a unique constraint on `(transaction_hash, log_index, blockchain_id)` as a final backstop. 7. Historical log queries (`eth_getLogs`) return only canonical chain events — there are no `Removed: true` signals during replay, and replay does not flow through the gate (step 5). Removal signals from the live WebSocket subscription that arrive during the replay phase are buffered in the listener's `currentCh` and reach the gate only after the historical replay phase completes; if they cancel a re-mined event that has already been forwarded by the live path, the post-gate reorg detection in §6.5 logs them. 8. When `confirmation_delay_secs == 0`, the listener drops `Removed:true` live logs at the Phase 2 boundary because there is no downstream gate to consume them; the reactor never receives `Removed:true` logs in either mode. @@ -178,7 +178,7 @@ if confirmationDelay > 0 { l := evm.NewListener(..., liveHandler, reactor.HandleEvent, ...) ``` -The constructor returns an error for `delay <= 0`; the wiring layer is responsible for skipping gate construction when the operator configured `confirmation_delay_sec: 0` and routing live events straight to the reactor. +The constructor returns an error for `delay <= 0`; the wiring layer is responsible for skipping gate construction when the operator configured `confirmation_delay_secs: 0` and routing live events straight to the reactor. The reactor itself does not change. All the listener's existing logic — subscription management, cursor tracking, reconnection, historical replay — is unaffected. @@ -306,7 +306,7 @@ Files to update: - `pkg/rpc/types.go` — add `ConfirmationDelaySecs uint64` to `BlockchainInfoV1`. - `nitronode/api/node_v1/utils.go` — populate the new field in `mapBlockchainV1` from the chain's loaded config. -- `pkg/core/types.go` (or wherever `core.Blockchain` is defined) — add `ConfirmationDelaySec uint64` so the value flows from `blockchains.yaml` through config loading into the API handler. +- `pkg/core/types.go` (or wherever `core.Blockchain` is defined) — add `ConfirmationDelaySecs uint64` so the value flows from `blockchains.yaml` through config loading into the API handler. No new endpoint is needed. The field appears alongside existing per-chain fields (contract addresses, asset list, block time) and is read-only from the client's perspective. diff --git a/pkg/core/README.md b/pkg/core/README.md index 466b885df..ceb12aff4 100644 --- a/pkg/core/README.md +++ b/pkg/core/README.md @@ -52,7 +52,15 @@ The `Client` interface abstracts the communication with the `ChannelsHub` smart ### Listener Interface -The `Listener` allows applications to react to on-chain state changes by registering handlers for events like `HomeChannelCreatedEvent` or `EscrowDepositFinalizedEvent`. +The `Listener` exposes events via a **two-handler model**. A `liveHandler` receives live events plus any historical events still within the reorg window, while a `historicalEventHandler` receives mature historical events past the configured `confirmationDelay`. Per-event routing is decided by the listener itself: it compares `eventLog.BlockTimestamp` against `confirmationDelay` to choose which handler an event flows into. This makes the listener delay-aware rather than pushing that decision down to consumers. + +The typical `liveHandler` is the **`ConfirmationGate`**, which implements the reorg-protection window. The gate buffers each event for `confirmation_delay_secs` before forwarding it to the reactor; if the event's block is reorged out within that window, the gate silently drops it instead of committing it downstream. With the gate in place, the reactor only ever sees events whose blocks have survived the configured confirmation window. + +To make this work, the listener owns timestamp population. **`ensureBlockTimestamp`** guarantees `BlockTimestamp` is set on every non-removed event before it is forwarded: it uses `eventLog.BlockTimestamp` directly when present, and otherwise falls back to a cached `HeaderByHash` lookup. The gate relies on this to compute each event's `arrivedAt` correctly. **`Removed: true`** logs are handled exclusively at the listener boundary: in the live (Phase 2) path with a gate, removed logs are forwarded so the gate can cancel a pending timer; with no gate configured (`confirmation_delay_secs == 0`), the listener drops removed logs at Phase 2 and the reactor never sees them. Historical (Phase 1) replays use `eth_getLogs`, which never emits removals, so that path is simpler by construction. + +On startup, the listener reconciles against possible reorgs that happened while the node was down. **`findCommonAncestor`** walks stored block hashes backward to locate a still-canonical resume point. If every stored block has been reorged out, it returns the orphaned-latest height so `eth_getLogs` re-fetches canonical replacements from that range; the orphan hash itself is discarded — only the height matters because `eth_getLogs` is a canonical-chain range query. + +See [`nitronode/docs/reorg-fix.md`](../../nitronode/docs/reorg-fix.md) for the full design. ### State Advancer