Add auto-repair options for broken AOF tail on startup by tomerqodo · Pull Request #20 · qodo-benchmark/redis-combined-coderabbit

tomerqodo · 2026-01-22T06:49:13Z

Summary by CodeRabbit

New Features
- Added aof-load-broken configuration option to enable automatic recovery from corrupted AOF tail files (disabled by default).
- Added aof-load-broken-max-size configuration to specify the maximum size of broken AOF content to tolerate before recovery.
- Implemented automatic truncation and recovery of mildly corrupted AOF tails, reducing fatal startup failures in certain corruption scenarios.
Tests
- Added comprehensive integration tests for AOF corruption recovery behavior.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…oken

coderabbitai · 2026-01-22T06:49:41Z

Walkthrough

Redis implements automatic recovery from corrupted AOF (Append-Only File) tails through new configuration options aof-load-broken and aof-load-broken-max-size. When enabled and corruption tail size is below the threshold, the system truncates the file to the last valid state and continues loading, treating this as recoverable rather than fatal.

Changes

Cohort / File(s)	Summary
AOF Configuration `redis.conf`	Added configuration parameters for broken AOF handling: `aof-load-broken` (boolean, default no) and `aof-load-broken-max-size` (bytes, default 4096). Includes detailed comments on recovery behavior and interaction with redis-check-aof.
AOF Recovery Logic `src/aof.c`	Implements automatic truncation and recovery when AOF tail corruption is detected and recovery is enabled. Tracks valid command offsets, returns new `AOF_BROKEN_RECOVERED` status, and treats this state as acceptable in applicable loading scenarios. Preserves existing fatal paths for stricter error conditions.
Configuration Management `src/config.c`	Registered new boolean config `aof-load-broken` and offset config `aof-load-broken-max-size` in the static configuration set with appropriate bounds and defaults.
Server State `src/server.h`	Added AOF recovery constants (`AOF_BROKEN_RECOVERED` macro), new server struct fields (`aof_load_broken`, `aof_load_broken_max_size`), and updated header guard macro names.
Test Coverage `tests/integration/aof.tcl`	Added 136 lines of integration tests validating recovery behavior across single and multi-file AOF scenarios, corruption positioning, log assertions, and interaction with configuration thresholds.

Sequence Diagram

sequenceDiagram
    participant Server as Server Init
    participant Config as Config Loader
    participant AOFLoader as AOF Loader
    participant FileOps as File Operations
    participant Logger as Logger

    Server->>Config: Load configuration
    Config->>Server: aof-load-broken = enabled, max-size = 4096
    Server->>AOFLoader: Start AOF loading
    AOFLoader->>FileOps: Read AOF file
    FileOps->>AOFLoader: Format error at tail
    AOFLoader->>AOFLoader: Check: tail size < max-size?
    alt Tail within threshold
        AOFLoader->>Logger: Log recovery attempt
        AOFLoader->>FileOps: Truncate to last valid offset
        FileOps->>AOFLoader: Truncation complete
        AOFLoader->>Logger: Log broken tail size removed
        AOFLoader->>Server: Return AOF_BROKEN_RECOVERED
        Server->>Server: Accept recovery, continue
    else Tail exceeds threshold
        AOFLoader->>Logger: Log fatal corruption
        AOFLoader->>Server: Return error
        Server->>Server: Halt loading
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Oh, when the AOF tales grow worn and cracked,
With broken tails that cannot be intact,
A gentle fix now mends the shattered file,
Recovering lost hops with Redis style!
No longer must we fail on damage small,
This fuzzy feature helps us bounce back tall! 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add auto-repair options for broken AOF tail on startup' is directly related to and accurately summarizes the main change in the pull request. All modifications across configuration files, source code, and tests align with adding auto-repair functionality via aof-load-broken and aof-load-broken-max-size options.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@src/aof.c`:
- Around line 1830-1842: The stopLoading call currently treats only AOF_OK and
AOF_TRUNCATED as success but omits AOF_BROKEN_RECOVERED, causing stopLoading(0)
to emit REDISMODULE_SUBEVENT_LOADING_FAILED for a recovery that should be
considered successful; update the condition(s) where stopLoading is invoked
(look for calls using the local ret variable and checks against AOF_OK ||
AOF_TRUNCATED) to include AOF_BROKEN_RECOVERED (i.e., treat ret == AOF_OK || ret
== AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED as success), and make this
change for both corresponding code paths in aof.c so modules receive the correct
loading event.

In `@src/config.c`:
- Around line 3256-3259: The "aof-load-broken-max-size" config created by
createOffTConfig uses LONG_MAX and INTEGER_CONFIG which can cap values and
reject unit suffixes; update the createOffTConfig call for
"aof-load-broken-max-size" to use LLONG_MAX as the upper bound and MEMORY_CONFIG
as the type (matching server.aof_load_broken_max_size semantics) so large values
and byte-unit suffixes are accepted.

In `@tests/integration/aof.tcl`:
- Around line 723-735: The current rm -f $aof_dirpath/appendonly.aof.* will
remove the base and manifest too; change the cleanup to only remove INCR files
(e.g. target filenames matching the INCR suffix pattern while excluding
appendonly.aof and the manifest) and ensure the AOF manifest is rewritten so it
points only to the base file (making the base the last file) before calling
start_server_aof; update the shell command that removes files and add a small
step to recreate/update the manifest (referencing aof_dirpath, the
appendonly.aof naming, and the start_server_aof/wait_done_loading test block) so
the tests assert against a recoverable base-only AOF.

🧹 Nitpick comments (1)

src/server.h (1)
2009-2011: Fix minor comment typo for clarity.
“than” → “that”.
📝 Suggested edit
-    off_t aof_load_broken_max_size; /* The max size of broken AOF tail than can be ignored. */
+    off_t aof_load_broken_max_size; /* The max size of broken AOF tail that can be ignored. */

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d363891 and a9b1640.

📒 Files selected for processing (5)

redis.conf
src/aof.c
src/config.c
src/server.h
tests/integration/aof.tcl

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)

GitHub Check: code-coverage
GitHub Check: test-external-cluster
GitHub Check: build-32bit
GitHub Check: test-external-standalone
GitHub Check: test-external-nodebug
GitHub Check: Analyze (cpp)
GitHub Check: build-old-chain-jemalloc
GitHub Check: test-ubuntu-latest
GitHub Check: test-sanitizer-address
GitHub Check: build-centos-jemalloc
GitHub Check: build-old-chain-jemalloc
GitHub Check: test-sanitizer-address
GitHub Check: build-centos-jemalloc
GitHub Check: code-coverage
GitHub Check: test-ubuntu-latest
GitHub Check: test-external-cluster
GitHub Check: test-external-standalone
GitHub Check: test-external-nodebug

🔇 Additional comments (6)

src/server.h (2)

15-16: Header guard update looks good.

348-348: New AOF load return code is clear and consistent.

src/config.c (1)

3092-3094: LGTM: config wiring is consistent with other AOF load toggles.
Default off aligns with conservative recovery behavior.

redis.conf (1)

1532-1549: Docs are clear and appropriately cautionary.
The explanation of risks and behavior is concise and user-friendly.

src/aof.c (2)

1658-1662: Track valid offsets only when recovery/truncation is enabled

Keeps valid_up_to aligned with the recovery path without extra ftello work when both options are off.

1722-1756: Auto-repair gating looks solid

The recovery path is correctly guarded by config and size threshold, and keeps non-recoverable cases on the fatal path.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-22T06:58:08Z

+        if (ret == AOF_OK || ((ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED) && last_file)) {
            serverLog(LL_NOTICE, "DB loaded from base file %s: %.3f seconds",
                aof_name, (float)(ustime()-start)/1000000);
        }

        /* If the truncated file is not the last file, we consider this to be a fatal error. */
-        if (ret == AOF_TRUNCATED && !last_file) {
+        if ((ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED) && !last_file) {
            ret = AOF_FAILED;
            serverLog(LL_WARNING, "Fatal error: the truncated file is not the last file");
        }

-        if (ret == AOF_OPEN_ERR || ret == AOF_FAILED) {
+        if (ret == AOF_OPEN_ERR || ret == AOF_FAILED || ret == AOF_BROKEN_RECOVERED) {
            goto cleanup;


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check AOF_BROKEN_RECOVERED usage and related functions echo "=== AOF_BROKEN_RECOVERED occurrences ===" rg -n "AOF_BROKEN_RECOVERED" src echo "" echo "=== loadAppendOnlyFiles callers ===" rg -n "loadAppendOnlyFiles\(" src echo "" echo "=== stopLoading usage ===" rg -n "stopLoading\(" src echo "" echo "=== Check AOF return value enum ===" rg -n "AOF_OK|AOF_TRUNCATED|AOF_FAILED|AOF_OPEN_ERR" src/aof.h

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 1909

🏁 Script executed:

#!/bin/bash # Check context around stopLoading at line 1894 echo "=== Context around stopLoading at aof.c:1894 ===" sed -n '1880,1900p' src/aof.c echo "" echo "=== Check loadAppendOnlyFiles caller in server.c:7036 ===" sed -n '7030,7050p' src/server.c echo "" echo "=== Check if AOF_BROKEN_RECOVERED should be treated as success ===" sed -n '1740,1750p' src/aof.c

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 2857

🏁 Script executed:

#!/bin/bash # Check stopLoading implementation echo "=== stopLoading function ===" sed -n '3228,3260p' src/rdb.c echo "" echo "=== Look for server.loading usage that might be affected ===" rg -n "server\.loading" src/server.c | head -20

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 2937

Include AOF_BROKEN_RECOVERED in the stopLoading success condition

At line 1894, stopLoading() is called with ret == AOF_OK || ret == AOF_TRUNCATED, but this omits AOF_BROKEN_RECOVERED. The caller in server.c:7036 treats AOF_BROKEN_RECOVERED as a successful recovery (doesn't exit, logs success), yet stopLoading(0) fires REDISMODULE_SUBEVENT_LOADING_FAILED to modules. This inconsistency can cause modules to react incorrectly to what is actually a successful recovery.

Fix

- stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED); + stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED);

This also applies to the corresponding code at lines 1860-1873.

🤖 Prompt for AI Agents

In `@src/aof.c` around lines 1830 - 1842, The stopLoading call currently treats only AOF_OK and AOF_TRUNCATED as success but omits AOF_BROKEN_RECOVERED, causing stopLoading(0) to emit REDISMODULE_SUBEVENT_LOADING_FAILED for a recovery that should be considered successful; update the condition(s) where stopLoading is invoked (look for calls using the local ret variable and checks against AOF_OK || AOF_TRUNCATED) to include AOF_BROKEN_RECOVERED (i.e., treat ret == AOF_OK || ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED as success), and make this change for both corresponding code paths in aof.c so modules receive the correct loading event.

coderabbitai · 2026-01-22T06:58:08Z

    createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL),
    createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL),
+    createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),



⚠️ Potential issue | 🟡 Minor

Use a wider upper bound and allow size units.
Line 3258 uses LONG_MAX (can be 32‑bit) and INTEGER_CONFIG, which may reject large values and common unit suffixes for a byte-size setting. Prefer LLONG_MAX and MEMORY_CONFIG to align with other size configs and avoid unintended caps.

🛠️ Proposed fix

- createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL), + createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL),

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL),

createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL),

createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),

createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL),

createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL),

createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL),

🤖 Prompt for AI Agents

In `@src/config.c` around lines 3256 - 3259, The "aof-load-broken-max-size" config created by createOffTConfig uses LONG_MAX and INTEGER_CONFIG which can cap values and reject unit suffixes; update the createOffTConfig call for "aof-load-broken-max-size" to use LLONG_MAX as the upper bound and MEMORY_CONFIG as the type (matching server.aof_load_broken_max_size semantics) so large values and byte-unit suffixes are accepted.

coderabbitai · 2026-01-22T06:58:08Z

+    # Remove all incr AOF files to make the base file being the last file
+    exec rm -f $aof_dirpath/appendonly.aof.*
+    start_server_aof [list dir $server_path aof-load-broken yes] {
+        test "Corrupted base AOF (last file): should recover" {
+            assert_equal 1 [is_alive [srv pid]]
+        }
+
+        test "param should be 'ok'" {
+            set client [redis [srv host] [srv port] 0 $::tls]
+            wait_done_loading $client
+            assert {[$client get param] eq "ok"}
+        }
+    }


⚠️ Potential issue | 🟡 Minor

Avoid deleting the base/manifest when making the base the last file

rm -f $aof_dirpath/appendonly.aof.* also matches the base and manifest, which would leave no recoverable AOF for the next assertions. Prefer deleting only INCR files and rewrite the manifest to a base-only entry.

🧩 Suggested fix

- exec rm -f $aof_dirpath/appendonly.aof.* + # Remove only INCR files and reset the manifest so the base is last. + exec rm -f $aof_dirpath/*$::incr_aof_sufix$::aof_format_suffix + create_aof_manifest $aof_dirpath $aof_manifest_file { + append_to_manifest "file ${aof_basename}.1$::base_aof_sufix$::aof_format_suffix seq 1 type b\n" + }

🤖 Prompt for AI Agents

In `@tests/integration/aof.tcl` around lines 723 - 735, The current rm -f $aof_dirpath/appendonly.aof.* will remove the base and manifest too; change the cleanup to only remove INCR files (e.g. target filenames matching the INCR suffix pattern while excluding appendonly.aof and the manifest) and ensure the AOF manifest is rewritten so it points only to the base file (making the base the last file) before calling start_server_aof; update the shell command that removes files and add a small step to recreate/update the manifest (referencing aof_dirpath, the appendonly.aof naming, and the start_server_aof/wait_done_loading test block) so the tests assert against a recoverable base-only AOF.

StavRLevi and others added 22 commits January 21, 2026 15:58

Add auto-repair options for broken AOF tail on startup

e806a17

change aof-load-broken-max-size configuration

6d1baaa

fix of-load-broken-max-size configuration

acde9f1

fix unnecessary changes

6d9e785

remove unnecessary change

c4a03b0

Fix format specifier in AOF error message

a8ba2bd

Fix format specifier in AOF error message

0cb9ced

Fix tcl error msg

b1fd5c2

Add incr file support + remove spaces

56b5707

add tests for base aof

6c1bb33

fix spaces

63f818f

add support to incr aof + add test for an intermediate AOF file is br…

97c4edc

…oken

fix indentation and add more tests

187f338

add tls

2290ae6

add space + file for test

35a7595

try to remove tls for tests

6654b8d

fix tls

fce074d

add aof-load-broken to redis.conf

b2555bb

add expansion about aof-load-broken-max-size

5bc675e

add expansion about aof-load-broken-max-size

e73a7dc

add aof-load-broken-max-size to conf file

a3698e2

update pr

a9b1640

coderabbitai Bot reviewed Jan 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add auto-repair options for broken AOF tail on startup#20

tomerqodo commented Jan 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jan 22, 2026

Uh oh!

coderabbitai Bot Jan 22, 2026

Uh oh!

coderabbitai Bot Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tomerqodo commented Jan 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomerqodo commented Jan 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 22, 2026 •

edited

Loading