Skip to content

Replace SQLite with JSONL for client-side fallback buffer#495

Draft
abidlabs wants to merge 2 commits intomainfrom
jsonl-client-buffer
Draft

Replace SQLite with JSONL for client-side fallback buffer#495
abidlabs wants to merge 2 commits intomainfrom
jsonl-client-buffer

Conversation

@abidlabs
Copy link
Copy Markdown
Member

Summary

  • Replace SQLite with append-only JSONL files for the client-side fallback buffer (used when the remote Trackio Space is unreachable during training)
  • SQLite's file locking and WAL journal mode break on networked filesystems like Lustre, causing sqlite3.DatabaseError: database disk image is malformed during distributed training on /fsx
  • JSONL needs no locking, no journal mode, and no pragma configuration -- append-only writes are safe on any filesystem
  • All _persist_*_locally methods are now wrapped in try/except so a storage failure can never kill the background logging thread

Context

A distributed training run on 8 GPUs crashed when trackio's remote logging hit QueueError: Queue is full! and the local SQLite fallback hit a corrupted database on Lustre. While the SQLite corruption didn't cause the training crash itself (that was a separate SIGBUS from the filesystem), it did kill the background logging thread, causing all subsequent metrics to be silently lost.

Test plan

  • All existing tests pass (102 passed, 16 skipped)
  • Test fallback path: start a run with space_id, make remote server unreachable, verify JSONL files are created in ~/.cache/huggingface/trackio/pending/
  • Test flush path: reconnect server, verify buffered metrics are sent and JSONL files are cleaned up
  • Test on networked filesystem (Lustre/NFS) with multiple concurrent writers

🤖 Generated with Claude Code

The client-side fallback buffer (used when the remote Trackio Space is
unreachable) previously wrote to SQLite, which requires file locking and
WAL journal mode -- both of which break on networked filesystems like
Lustre. This caused crashes during distributed training on /fsx when the
SQLite DB became corrupted.

Replace with append-only JSONL files that need no locking or journal
mode. Each persist/flush operation is also wrapped in try/except so a
storage failure can never kill the background logging thread.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gradio-pr-bot
Copy link
Copy Markdown
Contributor

gradio-pr-bot commented Apr 11, 2026

🪼 branch checks and previews

Name Status URL
🦄 Changes detected! Details

@gradio-pr-bot
Copy link
Copy Markdown
Contributor

🦄 change detected

This Pull Request includes changes to the following packages.

Package Version
trackio minor

  • Replace SQLite with JSONL for client-side fallback buffer

‼️ Changeset not approved. Ensure the version bump is appropriate for all packages before approving.

  • Maintainers can approve the changeset by checking this checkbox.

Something isn't right?

  • Maintainers can change the version label to modify the version bump.
  • If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Apr 11, 2026

🪼 branch checks and previews

Name Status URL
Spaces ready! Spaces preview

Install Trackio from this PR (includes built frontend)

pip install "https://huggingface.co/buckets/trackio/trackio-wheels/resolve/bab9d8d0212bfdac65ae858e97527c155b070bac/trackio-0.22.0-py3-none-any.whl"

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants