Spec — Download X-post photos and render in notes (Phase A)
Problem
X posts often carry their actual content in attached images — screenshots of papers, charts, code, diagrams, capturas de hilos. Today XBrain extracts the URLs (Item.media is populated at extract time) but never downloads the bytes and never shows them in the generated Obsidian notes. The visual half of the corpus is invisible to anyone reading the wiki.
What gets delivered
A new command xbrain media that:
- Downloads every photo URL currently stored in
Item.media.
- Records the per-photo outcome (downloaded / failed / pending) on the item itself.
- Is idempotent: re-running skips already-downloaded photos.
- Triggers a pre-op snapshot.
After the next xbrain generate, every Obsidian note that had attached photos shows them inline in the post body.
Requirements
Functional
- The system MUST download photo media referenced in
Item.media and persist the bytes locally.
- Each photo MUST end up in exactly one of these states: downloaded (file exists on disk), failed (gave up with a categorized reason), pending (not yet attempted).
- Video URLs MUST be retained but NOT downloaded in this phase. They remain in a
video-pending state for a future phase.
- Failed downloads MUST record a categorized failure reason. Transient failures (server errors, timeouts, unknown errors) are eligible for retry on the next run; permanent failures (dead URL, format error) are not unless
--force is passed.
- The pipeline MUST be interruptible: a Ctrl-C mid-batch must leave the store in a coherent state, and resuming must pick up where it left off without losing progress.
- The generated note MUST render downloaded photos inline in the tweet section.
- Failed photos MUST render as a one-line warning showing the failure reason and original URL. Pending photos render silently (not an error, just "not yet processed"). Video-pending photos render as a "not downloaded" placeholder.
Non-functional
- The system MUST respect pbs.twimg.com: throttled requests, conservative concurrency, browser-style User-Agent.
- Existing data (the current
Item.media shape) MUST continue to load without a manual migration step.
- No silent data loss: every photo URL in the input leaves the run accounted for.
Scope
In
- X-attached photos via
pbs.twimg.com.
- Cascading quality fallback (highest available → next → next).
- CLI flags:
--force, --limit N, --items <ids>.
- Inline rendering of downloaded photos in Obsidian notes.
xbrain diff reports media state counts.
Out (deferred)
- Video download. HLS + ffmpeg is significantly different complexity; ship photos first.
- Article images. Trafilatura doesn't extract them; would need separate hero-image logic.
- LLM image description. That's Phase B.
Acceptance criteria
Success criteria (measurable)
- ≥95% of photo URLs in the existing corpus end in the "downloaded" state after one full run.
- Failure rate ≤5%, with every failure carrying a categorized reason.
- Total disk usage on
data/media/ ≤ 350 MB for the current ~1884-item corpus.
- Manual spot-check of 10 random notes: every photo renders inline in Obsidian.
Decisions taken
| Decision |
Choice |
Why |
| Phase A scope |
Photos only, no video |
Video is HLS + ffmpeg, ~3x complexity. Ship photos first. |
| Article images |
Deferred |
trafilatura doesn't extract them; separate work. |
| Image quality |
Highest available with cascading fallback |
Best signal preserved without extra round-trips. |
| Storage location |
data/media/ (gitignored), per-item subdirectory |
Keeps vault tree clean; recommended over vault-embedded media. |
| Data model |
Tagged union with explicit state per media entry |
Matches existing project direction (#20); no illegal states. |
| Migration |
Validator-based, no separate command |
Same pattern as #20. Zero downtime. |
| Failure categorization |
Mirrors the transient/permanent buckets from #19 |
Consistent retry semantics across the pipeline. |
| Render position |
Inline in the tweet section |
Natural read order in Obsidian. |
Open questions for Víctor
- Storage location final call:
data/media/ (gitignored, recommended) vs learnings/x-knowledge/_media/ (inside vault, notes self-contained).
- Photo cap per post: all inline vs cap at e.g. 4 + "+N more" link.
Dependencies
- None at the code level — Phase A builds on existing
Item.media extraction.
- Phase B depends on Phase A being merged.
Spec — Download X-post photos and render in notes (Phase A)
Problem
X posts often carry their actual content in attached images — screenshots of papers, charts, code, diagrams, capturas de hilos. Today XBrain extracts the URLs (
Item.mediais populated at extract time) but never downloads the bytes and never shows them in the generated Obsidian notes. The visual half of the corpus is invisible to anyone reading the wiki.What gets delivered
A new command
xbrain mediathat:Item.media.After the next
xbrain generate, every Obsidian note that had attached photos shows them inline in the post body.Requirements
Functional
Item.mediaand persist the bytes locally.video-pendingstate for a future phase.--forceis passed.Non-functional
Item.mediashape) MUST continue to load without a manual migration step.Scope
In
pbs.twimg.com.--force,--limit N,--items <ids>.xbrain diffreports media state counts.Out (deferred)
Acceptance criteria
xbrain mediaon the full corpus completes without unhandled exceptions.items.jsonis in a defined state (downloaded / failed / pending).xbrain mediais a no-op for already-downloaded photos (and reports them as skipped on the summary line).--forcere-downloads everything.xbrain generateproduces notes with photos visible in Obsidian — spot-check of 10 random notes confirms inline rendering.xbrain diff <a> <b>reports new-download counts between snapshots.items.jsonvalid; resuming completes the remaining work.Success criteria (measurable)
data/media/≤ 350 MB for the current ~1884-item corpus.Decisions taken
data/media/(gitignored), per-item subdirectoryOpen questions for Víctor
data/media/(gitignored, recommended) vslearnings/x-knowledge/_media/(inside vault, notes self-contained).Dependencies
Item.mediaextraction.