Skip to content

fix: batch upserts in miner to prevent ChromaDB 1.5.x compaction crashes#796

Open
IzmanIzy wants to merge 1 commit intoMemPalace:mainfrom
IzmanIzy:fix/batch-upsert-chromadb-compaction
Open

fix: batch upserts in miner to prevent ChromaDB 1.5.x compaction crashes#796
IzmanIzy wants to merge 1 commit intoMemPalace:mainfrom
IzmanIzy:fix/batch-upsert-chromadb-compaction

Conversation

@IzmanIzy
Copy link
Copy Markdown

Summary

  • Batch all chunks per file into a single collection.upsert() call instead of upserting each chunk individually. This reduces WAL write pressure that causes the Rust compactor in ChromaDB >= 1.5 to crash.
  • Add periodic checkpoints (every 200 files) that release and re-acquire the collection, giving the compactor time to flush background work.
  • Release collection reference at the end of mining for a clean shutdown.

Problem

When mining projects with 100+ files, the miner issues thousands of individual upserts. On ChromaDB 1.5.x this causes:

  1. Segfault (exit code 139) — the Rust compactor corrupts the metadata segment during concurrent individual writes
  2. InternalError: Failed to apply logs to the metadata segment — WAL entries accumulate faster than compaction can process them

Both errors are intermittent and depend on project size, making them hard to reproduce in small test suites but consistent on real-world knowledge bases (300+ files).

Root Cause

ChromaDB's Rust compactor (introduced in 1.5.x) runs in a background thread. Individual upserts create one WAL entry each, and 2000+ entries in rapid succession overwhelm the compactor's ability to merge them atomically. The previous code already had a comment about hnswlib's thread-unsafe updatePoint path causing segfaults on macOS ARM — this is the same class of bug on the compaction side, now affecting all platforms.

Testing

  • 37 existing miner tests pass (pytest tests/ -k "mine" — 37 passed, 0 failed)
  • Verified on a real 406-file knowledge base (3026 drawers) with ChromaDB 1.5.7 — zero crashes, clean completion with two checkpoint flushes at file 200 and 400
  • Lint (ruff check) and format (ruff format --check) pass
  • No changes to public API or CLI interface

Test plan

  • ruff check . passes
  • ruff format --check . passes
  • pytest tests/ -v --ignore=tests/benchmarks -k "mine" — 37 passed
  • Manual test: mine 406-file project with ChromaDB 1.5.7 — 3026 drawers, 0 crashes
  • Verify re-mining (modified files) still works correctly
  • Test with ChromaDB 0.6.x to confirm backward compatibility

🤖 Generated with Claude Code

The project miner previously upserted each chunk individually, which
causes excessive WAL turnover in ChromaDB >= 1.5.  The Rust compactor
cannot keep up with the write rate, leading to either:

- `InternalError: Failed to apply logs to the metadata segment`
- Segfault (SIGSEGV, exit code 139) during background compaction

This change batches all chunks from a single file into one `upsert()`
call and introduces periodic checkpoints (every 200 files) that release
and re-acquire the collection, giving the compactor time to flush.

Tested on a 406-file knowledge base (3026 drawers) with ChromaDB 1.5.7
— zero crashes, clean completion with two checkpoint flushes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jphein
Copy link
Copy Markdown
Contributor

jphein commented Apr 13, 2026

This addresses the same WAL pressure issue we tackled in #629 — worth noting that PR also adds bulk mtime pre-fetch (bulk_check_mined()) and optional concurrent mining via --workers. The single-file batch approach here is cleaner for a targeted fix though. Might be worth coordinating so the two PRs don't conflict — happy to rebase #629 on top of this if it merges first.

1 similar comment
@IzmanIzy
Copy link
Copy Markdown
Author

This addresses the same WAL pressure issue we tackled in #629 — worth noting that PR also adds bulk mtime pre-fetch (bulk_check_mined()) and optional concurrent mining via --workers. The single-file batch approach here is cleaner for a targeted fix though. Might be worth coordinating so the two PRs don't conflict — happy to rebase #629 on top of this if it merges first.

@igorls igorls added area/mining File and conversation mining bug Something isn't working labels Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants