Phase A: Download X-post photos and render in Obsidian notes

# Spec — Download X-post photos and render in notes (Phase A)

## Problem

X posts often carry their actual content in attached images — screenshots of papers, charts, code, diagrams, capturas de hilos. Today XBrain extracts the URLs (`Item.media` is populated at extract time) but never downloads the bytes and never shows them in the generated Obsidian notes. The visual half of the corpus is invisible to anyone reading the wiki.

## What gets delivered

A new command `xbrain media` that:

- Downloads every photo URL currently stored in `Item.media`.
- Records the per-photo outcome (downloaded / failed / pending) on the item itself.
- Is idempotent: re-running skips already-downloaded photos.
- Triggers a pre-op snapshot.

After the next `xbrain generate`, every Obsidian note that had attached photos shows them inline in the post body.

## Requirements

**Functional**

- The system MUST download photo media referenced in `Item.media` and persist the bytes locally.
- Each photo MUST end up in exactly one of these states: downloaded (file exists on disk), failed (gave up with a categorized reason), pending (not yet attempted).
- Video URLs MUST be retained but NOT downloaded in this phase. They remain in a `video-pending` state for a future phase.
- Failed downloads MUST record a categorized failure reason. Transient failures (server errors, timeouts, unknown errors) are eligible for retry on the next run; permanent failures (dead URL, format error) are not unless `--force` is passed.
- The pipeline MUST be interruptible: a Ctrl-C mid-batch must leave the store in a coherent state, and resuming must pick up where it left off without losing progress.
- The generated note MUST render downloaded photos inline in the tweet section.
- Failed photos MUST render as a one-line warning showing the failure reason and original URL. Pending photos render silently (not an error, just "not yet processed"). Video-pending photos render as a "not downloaded" placeholder.

**Non-functional**

- The system MUST respect pbs.twimg.com: throttled requests, conservative concurrency, browser-style User-Agent.
- Existing data (the current `Item.media` shape) MUST continue to load without a manual migration step.
- No silent data loss: every photo URL in the input leaves the run accounted for.

## Scope

**In**

- X-attached photos via `pbs.twimg.com`.
- Cascading quality fallback (highest available → next → next).
- CLI flags: `--force`, `--limit N`, `--items <ids>`.
- Inline rendering of downloaded photos in Obsidian notes.
- `xbrain diff` reports media state counts.

**Out (deferred)**

- Video download. HLS + ffmpeg is significantly different complexity; ship photos first.
- Article images. Trafilatura doesn't extract them; would need separate hero-image logic.
- LLM image description. That's Phase B.

## Acceptance criteria

- [ ] Running `xbrain media` on the full corpus completes without unhandled exceptions.
- [ ] After the run, every photo URL in `items.json` is in a defined state (downloaded / failed / pending).
- [ ] Photos land on disk in a deterministic path that an Obsidian embed can reference.
- [ ] Re-running `xbrain media` is a no-op for already-downloaded photos (and reports them as skipped on the summary line).
- [ ] `--force` re-downloads everything.
- [ ] `xbrain generate` produces notes with photos visible in Obsidian — spot-check of 10 random notes confirms inline rendering.
- [ ] `xbrain diff <a> <b>` reports new-download counts between snapshots.
- [ ] Ctrl-C during a long run leaves `items.json` valid; resuming completes the remaining work.
- [ ] Existing data loads without manual migration.
- [ ] A snapshot is created automatically before the run.

## Success criteria (measurable)

- ≥95% of photo URLs in the existing corpus end in the "downloaded" state after one full run.
- Failure rate ≤5%, with every failure carrying a categorized reason.
- Total disk usage on `data/media/` ≤ 350 MB for the current ~1884-item corpus.
- Manual spot-check of 10 random notes: every photo renders inline in Obsidian.

## Decisions taken

| Decision | Choice | Why |
|---|---|---|
| Phase A scope | Photos only, no video | Video is HLS + ffmpeg, ~3x complexity. Ship photos first. |
| Article images | Deferred | trafilatura doesn't extract them; separate work. |
| Image quality | Highest available with cascading fallback | Best signal preserved without extra round-trips. |
| Storage location | `data/media/` (gitignored), per-item subdirectory | Keeps vault tree clean; recommended over vault-embedded media. |
| Data model | Tagged union with explicit state per media entry | Matches existing project direction (#20); no illegal states. |
| Migration | Validator-based, no separate command | Same pattern as #20. Zero downtime. |
| Failure categorization | Mirrors the transient/permanent buckets from #19 | Consistent retry semantics across the pipeline. |
| Render position | Inline in the tweet section | Natural read order in Obsidian. |

## Open questions for Víctor

- Storage location final call: `data/media/` (gitignored, recommended) vs `learnings/x-knowledge/_media/` (inside vault, notes self-contained).
- Photo cap per post: all inline vs cap at e.g. 4 + "+N more" link.

## Dependencies

- None at the code level — Phase A builds on existing `Item.media` extraction.
- Phase B depends on Phase A being merged.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase A: Download X-post photos and render in Obsidian notes #33

Spec — Download X-post photos and render in notes (Phase A)

Problem

What gets delivered

Requirements

Scope

Acceptance criteria

Success criteria (measurable)

Decisions taken

Open questions for Víctor

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Decision	Choice	Why
Phase A scope	Photos only, no video	Video is HLS + ffmpeg, ~3x complexity. Ship photos first.
Article images	Deferred	trafilatura doesn't extract them; separate work.
Image quality	Highest available with cascading fallback	Best signal preserved without extra round-trips.
Storage location	`data/media/` (gitignored), per-item subdirectory	Keeps vault tree clean; recommended over vault-embedded media.
Data model	Tagged union with explicit state per media entry	Matches existing project direction (#20); no illegal states.
Migration	Validator-based, no separate command	Same pattern as #20. Zero downtime.
Failure categorization	Mirrors the transient/permanent buckets from #19	Consistent retry semantics across the pipeline.
Render position	Inline in the tweet section	Natural read order in Obsidian.

Phase A: Download X-post photos and render in Obsidian notes #33

Description

Spec — Download X-post photos and render in notes (Phase A)

Problem

What gets delivered

Requirements

Scope

Acceptance criteria

Success criteria (measurable)

Decisions taken

Open questions for Víctor

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions