Skip to content

Add auto-repair options for broken AOF tail on startup#20

Open
tomerqodo wants to merge 22 commits into
coderabbit_combined_20260121_augment_sentry_coderabbit_1_base_add_auto-repair_options_for_broken_aof_tail_on_startup_pr102from
coderabbit_combined_20260121_augment_sentry_coderabbit_1_head_add_auto-repair_options_for_broken_aof_tail_on_startup_pr102
Open

Add auto-repair options for broken AOF tail on startup#20
tomerqodo wants to merge 22 commits into
coderabbit_combined_20260121_augment_sentry_coderabbit_1_base_add_auto-repair_options_for_broken_aof_tail_on_startup_pr102from
coderabbit_combined_20260121_augment_sentry_coderabbit_1_head_add_auto-repair_options_for_broken_aof_tail_on_startup_pr102

Conversation

@tomerqodo

@tomerqodo tomerqodo commented Jan 22, 2026

Copy link
Copy Markdown

Benchmark PR from qodo-benchmark#102

Summary by CodeRabbit

  • New Features

    • Added aof-load-broken configuration option to enable automatic recovery from corrupted AOF tail files (disabled by default).
    • Added aof-load-broken-max-size configuration to specify the maximum size of broken AOF content to tolerate before recovery.
    • Implemented automatic truncation and recovery of mildly corrupted AOF tails, reducing fatal startup failures in certain corruption scenarios.
  • Tests

    • Added comprehensive integration tests for AOF corruption recovery behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai

coderabbitai Bot commented Jan 22, 2026

Copy link
Copy Markdown

Walkthrough

Redis implements automatic recovery from corrupted AOF (Append-Only File) tails through new configuration options aof-load-broken and aof-load-broken-max-size. When enabled and corruption tail size is below the threshold, the system truncates the file to the last valid state and continues loading, treating this as recoverable rather than fatal.

Changes

Cohort / File(s) Summary
AOF Configuration
redis.conf
Added configuration parameters for broken AOF handling: aof-load-broken (boolean, default no) and aof-load-broken-max-size (bytes, default 4096). Includes detailed comments on recovery behavior and interaction with redis-check-aof.
AOF Recovery Logic
src/aof.c
Implements automatic truncation and recovery when AOF tail corruption is detected and recovery is enabled. Tracks valid command offsets, returns new AOF_BROKEN_RECOVERED status, and treats this state as acceptable in applicable loading scenarios. Preserves existing fatal paths for stricter error conditions.
Configuration Management
src/config.c
Registered new boolean config aof-load-broken and offset config aof-load-broken-max-size in the static configuration set with appropriate bounds and defaults.
Server State
src/server.h
Added AOF recovery constants (AOF_BROKEN_RECOVERED macro), new server struct fields (aof_load_broken, aof_load_broken_max_size), and updated header guard macro names.
Test Coverage
tests/integration/aof.tcl
Added 136 lines of integration tests validating recovery behavior across single and multi-file AOF scenarios, corruption positioning, log assertions, and interaction with configuration thresholds.

Sequence Diagram

sequenceDiagram
    participant Server as Server Init
    participant Config as Config Loader
    participant AOFLoader as AOF Loader
    participant FileOps as File Operations
    participant Logger as Logger

    Server->>Config: Load configuration
    Config->>Server: aof-load-broken = enabled, max-size = 4096
    Server->>AOFLoader: Start AOF loading
    AOFLoader->>FileOps: Read AOF file
    FileOps->>AOFLoader: Format error at tail
    AOFLoader->>AOFLoader: Check: tail size < max-size?
    alt Tail within threshold
        AOFLoader->>Logger: Log recovery attempt
        AOFLoader->>FileOps: Truncate to last valid offset
        FileOps->>AOFLoader: Truncation complete
        AOFLoader->>Logger: Log broken tail size removed
        AOFLoader->>Server: Return AOF_BROKEN_RECOVERED
        Server->>Server: Accept recovery, continue
    else Tail exceeds threshold
        AOFLoader->>Logger: Log fatal corruption
        AOFLoader->>Server: Return error
        Server->>Server: Halt loading
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Oh, when the AOF tales grow worn and cracked,
With broken tails that cannot be intact,
A gentle fix now mends the shattered file,
Recovering lost hops with Redis style!
No longer must we fail on damage small,
This fuzzy feature helps us bounce back tall! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add auto-repair options for broken AOF tail on startup' is directly related to and accurately summarizes the main change in the pull request. All modifications across configuration files, source code, and tests align with adding auto-repair functionality via aof-load-broken and aof-load-broken-max-size options.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/aof.c`:
- Around line 1830-1842: The stopLoading call currently treats only AOF_OK and
AOF_TRUNCATED as success but omits AOF_BROKEN_RECOVERED, causing stopLoading(0)
to emit REDISMODULE_SUBEVENT_LOADING_FAILED for a recovery that should be
considered successful; update the condition(s) where stopLoading is invoked
(look for calls using the local ret variable and checks against AOF_OK ||
AOF_TRUNCATED) to include AOF_BROKEN_RECOVERED (i.e., treat ret == AOF_OK || ret
== AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED as success), and make this
change for both corresponding code paths in aof.c so modules receive the correct
loading event.

In `@src/config.c`:
- Around line 3256-3259: The "aof-load-broken-max-size" config created by
createOffTConfig uses LONG_MAX and INTEGER_CONFIG which can cap values and
reject unit suffixes; update the createOffTConfig call for
"aof-load-broken-max-size" to use LLONG_MAX as the upper bound and MEMORY_CONFIG
as the type (matching server.aof_load_broken_max_size semantics) so large values
and byte-unit suffixes are accepted.

In `@tests/integration/aof.tcl`:
- Around line 723-735: The current rm -f $aof_dirpath/appendonly.aof.* will
remove the base and manifest too; change the cleanup to only remove INCR files
(e.g. target filenames matching the INCR suffix pattern while excluding
appendonly.aof and the manifest) and ensure the AOF manifest is rewritten so it
points only to the base file (making the base the last file) before calling
start_server_aof; update the shell command that removes files and add a small
step to recreate/update the manifest (referencing aof_dirpath, the
appendonly.aof naming, and the start_server_aof/wait_done_loading test block) so
the tests assert against a recoverable base-only AOF.
🧹 Nitpick comments (1)
src/server.h (1)

2009-2011: Fix minor comment typo for clarity.
“than” → “that”.

📝 Suggested edit
-    off_t aof_load_broken_max_size; /* The max size of broken AOF tail than can be ignored. */
+    off_t aof_load_broken_max_size; /* The max size of broken AOF tail that can be ignored. */
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d363891 and a9b1640.

📒 Files selected for processing (5)
  • redis.conf
  • src/aof.c
  • src/config.c
  • src/server.h
  • tests/integration/aof.tcl
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
  • GitHub Check: code-coverage
  • GitHub Check: test-external-cluster
  • GitHub Check: build-32bit
  • GitHub Check: test-external-standalone
  • GitHub Check: test-external-nodebug
  • GitHub Check: Analyze (cpp)
  • GitHub Check: build-old-chain-jemalloc
  • GitHub Check: test-ubuntu-latest
  • GitHub Check: test-sanitizer-address
  • GitHub Check: build-centos-jemalloc
  • GitHub Check: build-old-chain-jemalloc
  • GitHub Check: test-sanitizer-address
  • GitHub Check: build-centos-jemalloc
  • GitHub Check: code-coverage
  • GitHub Check: test-ubuntu-latest
  • GitHub Check: test-external-cluster
  • GitHub Check: test-external-standalone
  • GitHub Check: test-external-nodebug
🔇 Additional comments (6)
src/server.h (2)

15-16: Header guard update looks good.


348-348: New AOF load return code is clear and consistent.

src/config.c (1)

3092-3094: LGTM: config wiring is consistent with other AOF load toggles.
Default off aligns with conservative recovery behavior.

redis.conf (1)

1532-1549: Docs are clear and appropriately cautionary.
The explanation of risks and behavior is concise and user-friendly.

src/aof.c (2)

1658-1662: Track valid offsets only when recovery/truncation is enabled

Keeps valid_up_to aligned with the recovery path without extra ftello work when both options are off.


1722-1756: Auto-repair gating looks solid

The recovery path is correctly guarded by config and size threshold, and keeps non-recoverable cases on the fatal path.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment thread src/aof.c
Comment on lines +1830 to 1842
if (ret == AOF_OK || ((ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED) && last_file)) {
serverLog(LL_NOTICE, "DB loaded from base file %s: %.3f seconds",
aof_name, (float)(ustime()-start)/1000000);
}

/* If the truncated file is not the last file, we consider this to be a fatal error. */
if (ret == AOF_TRUNCATED && !last_file) {
if ((ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED) && !last_file) {
ret = AOF_FAILED;
serverLog(LL_WARNING, "Fatal error: the truncated file is not the last file");
}

if (ret == AOF_OPEN_ERR || ret == AOF_FAILED) {
if (ret == AOF_OPEN_ERR || ret == AOF_FAILED || ret == AOF_BROKEN_RECOVERED) {
goto cleanup;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check AOF_BROKEN_RECOVERED usage and related functions
echo "=== AOF_BROKEN_RECOVERED occurrences ==="
rg -n "AOF_BROKEN_RECOVERED" src

echo ""
echo "=== loadAppendOnlyFiles callers ==="
rg -n "loadAppendOnlyFiles\(" src

echo ""
echo "=== stopLoading usage ==="
rg -n "stopLoading\(" src

echo ""
echo "=== Check AOF return value enum ==="
rg -n "AOF_OK|AOF_TRUNCATED|AOF_FAILED|AOF_OPEN_ERR" src/aof.h

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 1909


🏁 Script executed:

#!/bin/bash
# Check context around stopLoading at line 1894
echo "=== Context around stopLoading at aof.c:1894 ==="
sed -n '1880,1900p' src/aof.c

echo ""
echo "=== Check loadAppendOnlyFiles caller in server.c:7036 ==="
sed -n '7030,7050p' src/server.c

echo ""
echo "=== Check if AOF_BROKEN_RECOVERED should be treated as success ==="
sed -n '1740,1750p' src/aof.c

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 2857


🏁 Script executed:

#!/bin/bash
# Check stopLoading implementation
echo "=== stopLoading function ==="
sed -n '3228,3260p' src/rdb.c

echo ""
echo "=== Look for server.loading usage that might be affected ==="
rg -n "server\.loading" src/server.c | head -20

Repository: qodo-benchmark/redis-combined-coderabbit

Length of output: 2937


Include AOF_BROKEN_RECOVERED in the stopLoading success condition

At line 1894, stopLoading() is called with ret == AOF_OK || ret == AOF_TRUNCATED, but this omits AOF_BROKEN_RECOVERED. The caller in server.c:7036 treats AOF_BROKEN_RECOVERED as a successful recovery (doesn't exit, logs success), yet stopLoading(0) fires REDISMODULE_SUBEVENT_LOADING_FAILED to modules. This inconsistency can cause modules to react incorrectly to what is actually a successful recovery.

Fix
-    stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED);
+    stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED);

This also applies to the corresponding code at lines 1860-1873.

🤖 Prompt for AI Agents
In `@src/aof.c` around lines 1830 - 1842, The stopLoading call currently treats
only AOF_OK and AOF_TRUNCATED as success but omits AOF_BROKEN_RECOVERED, causing
stopLoading(0) to emit REDISMODULE_SUBEVENT_LOADING_FAILED for a recovery that
should be considered successful; update the condition(s) where stopLoading is
invoked (look for calls using the local ret variable and checks against AOF_OK
|| AOF_TRUNCATED) to include AOF_BROKEN_RECOVERED (i.e., treat ret == AOF_OK ||
ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED as success), and make this
change for both corresponding code paths in aof.c so modules receive the correct
loading event.

Comment thread src/config.c
Comment on lines 3256 to 3259
createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL),
createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL),
createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use a wider upper bound and allow size units.
Line 3258 uses LONG_MAX (can be 32‑bit) and INTEGER_CONFIG, which may reject large values and common unit suffixes for a byte-size setting. Prefer LLONG_MAX and MEMORY_CONFIG to align with other size configs and avoid unintended caps.

🛠️ Proposed fix
-    createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),
+    createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL),
createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL),
createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),
createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL),
createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL),
createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL),
🤖 Prompt for AI Agents
In `@src/config.c` around lines 3256 - 3259, The "aof-load-broken-max-size" config
created by createOffTConfig uses LONG_MAX and INTEGER_CONFIG which can cap
values and reject unit suffixes; update the createOffTConfig call for
"aof-load-broken-max-size" to use LLONG_MAX as the upper bound and MEMORY_CONFIG
as the type (matching server.aof_load_broken_max_size semantics) so large values
and byte-unit suffixes are accepted.

Comment thread tests/integration/aof.tcl
Comment on lines +723 to +735
# Remove all incr AOF files to make the base file being the last file
exec rm -f $aof_dirpath/appendonly.aof.*
start_server_aof [list dir $server_path aof-load-broken yes] {
test "Corrupted base AOF (last file): should recover" {
assert_equal 1 [is_alive [srv pid]]
}

test "param should be 'ok'" {
set client [redis [srv host] [srv port] 0 $::tls]
wait_done_loading $client
assert {[$client get param] eq "ok"}
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid deleting the base/manifest when making the base the last file

rm -f $aof_dirpath/appendonly.aof.* also matches the base and manifest, which would leave no recoverable AOF for the next assertions. Prefer deleting only INCR files and rewrite the manifest to a base-only entry.

🧩 Suggested fix
-    exec rm -f $aof_dirpath/appendonly.aof.*
+    # Remove only INCR files and reset the manifest so the base is last.
+    exec rm -f $aof_dirpath/*$::incr_aof_sufix$::aof_format_suffix
+    create_aof_manifest $aof_dirpath $aof_manifest_file {
+        append_to_manifest "file ${aof_basename}.1$::base_aof_sufix$::aof_format_suffix seq 1 type b\n"
+    }
🤖 Prompt for AI Agents
In `@tests/integration/aof.tcl` around lines 723 - 735, The current rm -f
$aof_dirpath/appendonly.aof.* will remove the base and manifest too; change the
cleanup to only remove INCR files (e.g. target filenames matching the INCR
suffix pattern while excluding appendonly.aof and the manifest) and ensure the
AOF manifest is rewritten so it points only to the base file (making the base
the last file) before calling start_server_aof; update the shell command that
removes files and add a small step to recreate/update the manifest (referencing
aof_dirpath, the appendonly.aof naming, and the
start_server_aof/wait_done_loading test block) so the tests assert against a
recoverable base-only AOF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants