Add auto-repair options for broken AOF tail on startup#20
Conversation
WalkthroughRedis implements automatic recovery from corrupted AOF (Append-Only File) tails through new configuration options Changes
Sequence DiagramsequenceDiagram
participant Server as Server Init
participant Config as Config Loader
participant AOFLoader as AOF Loader
participant FileOps as File Operations
participant Logger as Logger
Server->>Config: Load configuration
Config->>Server: aof-load-broken = enabled, max-size = 4096
Server->>AOFLoader: Start AOF loading
AOFLoader->>FileOps: Read AOF file
FileOps->>AOFLoader: Format error at tail
AOFLoader->>AOFLoader: Check: tail size < max-size?
alt Tail within threshold
AOFLoader->>Logger: Log recovery attempt
AOFLoader->>FileOps: Truncate to last valid offset
FileOps->>AOFLoader: Truncation complete
AOFLoader->>Logger: Log broken tail size removed
AOFLoader->>Server: Return AOF_BROKEN_RECOVERED
Server->>Server: Accept recovery, continue
else Tail exceeds threshold
AOFLoader->>Logger: Log fatal corruption
AOFLoader->>Server: Return error
Server->>Server: Halt loading
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/aof.c`:
- Around line 1830-1842: The stopLoading call currently treats only AOF_OK and
AOF_TRUNCATED as success but omits AOF_BROKEN_RECOVERED, causing stopLoading(0)
to emit REDISMODULE_SUBEVENT_LOADING_FAILED for a recovery that should be
considered successful; update the condition(s) where stopLoading is invoked
(look for calls using the local ret variable and checks against AOF_OK ||
AOF_TRUNCATED) to include AOF_BROKEN_RECOVERED (i.e., treat ret == AOF_OK || ret
== AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED as success), and make this
change for both corresponding code paths in aof.c so modules receive the correct
loading event.
In `@src/config.c`:
- Around line 3256-3259: The "aof-load-broken-max-size" config created by
createOffTConfig uses LONG_MAX and INTEGER_CONFIG which can cap values and
reject unit suffixes; update the createOffTConfig call for
"aof-load-broken-max-size" to use LLONG_MAX as the upper bound and MEMORY_CONFIG
as the type (matching server.aof_load_broken_max_size semantics) so large values
and byte-unit suffixes are accepted.
In `@tests/integration/aof.tcl`:
- Around line 723-735: The current rm -f $aof_dirpath/appendonly.aof.* will
remove the base and manifest too; change the cleanup to only remove INCR files
(e.g. target filenames matching the INCR suffix pattern while excluding
appendonly.aof and the manifest) and ensure the AOF manifest is rewritten so it
points only to the base file (making the base the last file) before calling
start_server_aof; update the shell command that removes files and add a small
step to recreate/update the manifest (referencing aof_dirpath, the
appendonly.aof naming, and the start_server_aof/wait_done_loading test block) so
the tests assert against a recoverable base-only AOF.
🧹 Nitpick comments (1)
src/server.h (1)
2009-2011: Fix minor comment typo for clarity.
“than” → “that”.📝 Suggested edit
- off_t aof_load_broken_max_size; /* The max size of broken AOF tail than can be ignored. */ + off_t aof_load_broken_max_size; /* The max size of broken AOF tail that can be ignored. */
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
redis.confsrc/aof.csrc/config.csrc/server.htests/integration/aof.tcl
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (18)
- GitHub Check: code-coverage
- GitHub Check: test-external-cluster
- GitHub Check: build-32bit
- GitHub Check: test-external-standalone
- GitHub Check: test-external-nodebug
- GitHub Check: Analyze (cpp)
- GitHub Check: build-old-chain-jemalloc
- GitHub Check: test-ubuntu-latest
- GitHub Check: test-sanitizer-address
- GitHub Check: build-centos-jemalloc
- GitHub Check: build-old-chain-jemalloc
- GitHub Check: test-sanitizer-address
- GitHub Check: build-centos-jemalloc
- GitHub Check: code-coverage
- GitHub Check: test-ubuntu-latest
- GitHub Check: test-external-cluster
- GitHub Check: test-external-standalone
- GitHub Check: test-external-nodebug
🔇 Additional comments (6)
src/server.h (2)
15-16: Header guard update looks good.
348-348: New AOF load return code is clear and consistent.src/config.c (1)
3092-3094: LGTM: config wiring is consistent with other AOF load toggles.
Default off aligns with conservative recovery behavior.redis.conf (1)
1532-1549: Docs are clear and appropriately cautionary.
The explanation of risks and behavior is concise and user-friendly.src/aof.c (2)
1658-1662: Track valid offsets only when recovery/truncation is enabledKeeps
valid_up_toaligned with the recovery path without extraftellowork when both options are off.
1722-1756: Auto-repair gating looks solidThe recovery path is correctly guarded by config and size threshold, and keeps non-recoverable cases on the fatal path.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| if (ret == AOF_OK || ((ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED) && last_file)) { | ||
| serverLog(LL_NOTICE, "DB loaded from base file %s: %.3f seconds", | ||
| aof_name, (float)(ustime()-start)/1000000); | ||
| } | ||
|
|
||
| /* If the truncated file is not the last file, we consider this to be a fatal error. */ | ||
| if (ret == AOF_TRUNCATED && !last_file) { | ||
| if ((ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED) && !last_file) { | ||
| ret = AOF_FAILED; | ||
| serverLog(LL_WARNING, "Fatal error: the truncated file is not the last file"); | ||
| } | ||
|
|
||
| if (ret == AOF_OPEN_ERR || ret == AOF_FAILED) { | ||
| if (ret == AOF_OPEN_ERR || ret == AOF_FAILED || ret == AOF_BROKEN_RECOVERED) { | ||
| goto cleanup; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check AOF_BROKEN_RECOVERED usage and related functions
echo "=== AOF_BROKEN_RECOVERED occurrences ==="
rg -n "AOF_BROKEN_RECOVERED" src
echo ""
echo "=== loadAppendOnlyFiles callers ==="
rg -n "loadAppendOnlyFiles\(" src
echo ""
echo "=== stopLoading usage ==="
rg -n "stopLoading\(" src
echo ""
echo "=== Check AOF return value enum ==="
rg -n "AOF_OK|AOF_TRUNCATED|AOF_FAILED|AOF_OPEN_ERR" src/aof.hRepository: qodo-benchmark/redis-combined-coderabbit
Length of output: 1909
🏁 Script executed:
#!/bin/bash
# Check context around stopLoading at line 1894
echo "=== Context around stopLoading at aof.c:1894 ==="
sed -n '1880,1900p' src/aof.c
echo ""
echo "=== Check loadAppendOnlyFiles caller in server.c:7036 ==="
sed -n '7030,7050p' src/server.c
echo ""
echo "=== Check if AOF_BROKEN_RECOVERED should be treated as success ==="
sed -n '1740,1750p' src/aof.cRepository: qodo-benchmark/redis-combined-coderabbit
Length of output: 2857
🏁 Script executed:
#!/bin/bash
# Check stopLoading implementation
echo "=== stopLoading function ==="
sed -n '3228,3260p' src/rdb.c
echo ""
echo "=== Look for server.loading usage that might be affected ==="
rg -n "server\.loading" src/server.c | head -20Repository: qodo-benchmark/redis-combined-coderabbit
Length of output: 2937
Include AOF_BROKEN_RECOVERED in the stopLoading success condition
At line 1894, stopLoading() is called with ret == AOF_OK || ret == AOF_TRUNCATED, but this omits AOF_BROKEN_RECOVERED. The caller in server.c:7036 treats AOF_BROKEN_RECOVERED as a successful recovery (doesn't exit, logs success), yet stopLoading(0) fires REDISMODULE_SUBEVENT_LOADING_FAILED to modules. This inconsistency can cause modules to react incorrectly to what is actually a successful recovery.
Fix
- stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED);
+ stopLoading(ret == AOF_OK || ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED);This also applies to the corresponding code at lines 1860-1873.
🤖 Prompt for AI Agents
In `@src/aof.c` around lines 1830 - 1842, The stopLoading call currently treats
only AOF_OK and AOF_TRUNCATED as success but omits AOF_BROKEN_RECOVERED, causing
stopLoading(0) to emit REDISMODULE_SUBEVENT_LOADING_FAILED for a recovery that
should be considered successful; update the condition(s) where stopLoading is
invoked (look for calls using the local ret variable and checks against AOF_OK
|| AOF_TRUNCATED) to include AOF_BROKEN_RECOVERED (i.e., treat ret == AOF_OK ||
ret == AOF_TRUNCATED || ret == AOF_BROKEN_RECOVERED as success), and make this
change for both corresponding code paths in aof.c so modules receive the correct
loading event.
| createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL), | ||
| createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL), | ||
| createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL), | ||
|
|
There was a problem hiding this comment.
Use a wider upper bound and allow size units.
Line 3258 uses LONG_MAX (can be 32‑bit) and INTEGER_CONFIG, which may reject large values and common unit suffixes for a byte-size setting. Prefer LLONG_MAX and MEMORY_CONFIG to align with other size configs and avoid unintended caps.
🛠️ Proposed fix
- createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL),
+ createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL), | |
| createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL), | |
| createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.aof_load_broken_max_size, 4*1024, INTEGER_CONFIG, NULL, NULL), | |
| createOffTConfig("auto-aof-rewrite-min-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_rewrite_min_size, 64*1024*1024, MEMORY_CONFIG, NULL, NULL), | |
| createOffTConfig("loading-process-events-interval-bytes", NULL, MODIFIABLE_CONFIG | HIDDEN_CONFIG, 1024, INT_MAX, server.loading_process_events_interval_bytes, 1024*512, INTEGER_CONFIG, NULL, NULL), | |
| createOffTConfig("aof-load-broken-max-size", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.aof_load_broken_max_size, 4*1024, MEMORY_CONFIG, NULL, NULL), |
🤖 Prompt for AI Agents
In `@src/config.c` around lines 3256 - 3259, The "aof-load-broken-max-size" config
created by createOffTConfig uses LONG_MAX and INTEGER_CONFIG which can cap
values and reject unit suffixes; update the createOffTConfig call for
"aof-load-broken-max-size" to use LLONG_MAX as the upper bound and MEMORY_CONFIG
as the type (matching server.aof_load_broken_max_size semantics) so large values
and byte-unit suffixes are accepted.
| # Remove all incr AOF files to make the base file being the last file | ||
| exec rm -f $aof_dirpath/appendonly.aof.* | ||
| start_server_aof [list dir $server_path aof-load-broken yes] { | ||
| test "Corrupted base AOF (last file): should recover" { | ||
| assert_equal 1 [is_alive [srv pid]] | ||
| } | ||
|
|
||
| test "param should be 'ok'" { | ||
| set client [redis [srv host] [srv port] 0 $::tls] | ||
| wait_done_loading $client | ||
| assert {[$client get param] eq "ok"} | ||
| } | ||
| } |
There was a problem hiding this comment.
Avoid deleting the base/manifest when making the base the last file
rm -f $aof_dirpath/appendonly.aof.* also matches the base and manifest, which would leave no recoverable AOF for the next assertions. Prefer deleting only INCR files and rewrite the manifest to a base-only entry.
🧩 Suggested fix
- exec rm -f $aof_dirpath/appendonly.aof.*
+ # Remove only INCR files and reset the manifest so the base is last.
+ exec rm -f $aof_dirpath/*$::incr_aof_sufix$::aof_format_suffix
+ create_aof_manifest $aof_dirpath $aof_manifest_file {
+ append_to_manifest "file ${aof_basename}.1$::base_aof_sufix$::aof_format_suffix seq 1 type b\n"
+ }🤖 Prompt for AI Agents
In `@tests/integration/aof.tcl` around lines 723 - 735, The current rm -f
$aof_dirpath/appendonly.aof.* will remove the base and manifest too; change the
cleanup to only remove INCR files (e.g. target filenames matching the INCR
suffix pattern while excluding appendonly.aof and the manifest) and ensure the
AOF manifest is rewritten so it points only to the base file (making the base
the last file) before calling start_server_aof; update the shell command that
removes files and add a small step to recreate/update the manifest (referencing
aof_dirpath, the appendonly.aof naming, and the
start_server_aof/wait_done_loading test block) so the tests assert against a
recoverable base-only AOF.
Benchmark PR from qodo-benchmark#102
Summary by CodeRabbit
New Features
aof-load-brokenconfiguration option to enable automatic recovery from corrupted AOF tail files (disabled by default).aof-load-broken-max-sizeconfiguration to specify the maximum size of broken AOF content to tolerate before recovery.Tests
✏️ Tip: You can customize this high-level summary in your review settings.