Skip to content

Phase A: Download X-post photos and render in Obsidian notes #33

@VGonPa

Description

@VGonPa

Spec — Download X-post photos and render in notes (Phase A)

Problem

X posts often carry their actual content in attached images — screenshots of papers, charts, code, diagrams, capturas de hilos. Today XBrain extracts the URLs (Item.media is populated at extract time) but never downloads the bytes and never shows them in the generated Obsidian notes. The visual half of the corpus is invisible to anyone reading the wiki.

What gets delivered

A new command xbrain media that:

  • Downloads every photo URL currently stored in Item.media.
  • Records the per-photo outcome (downloaded / failed / pending) on the item itself.
  • Is idempotent: re-running skips already-downloaded photos.
  • Triggers a pre-op snapshot.

After the next xbrain generate, every Obsidian note that had attached photos shows them inline in the post body.

Requirements

Functional

  • The system MUST download photo media referenced in Item.media and persist the bytes locally.
  • Each photo MUST end up in exactly one of these states: downloaded (file exists on disk), failed (gave up with a categorized reason), pending (not yet attempted).
  • Video URLs MUST be retained but NOT downloaded in this phase. They remain in a video-pending state for a future phase.
  • Failed downloads MUST record a categorized failure reason. Transient failures (server errors, timeouts, unknown errors) are eligible for retry on the next run; permanent failures (dead URL, format error) are not unless --force is passed.
  • The pipeline MUST be interruptible: a Ctrl-C mid-batch must leave the store in a coherent state, and resuming must pick up where it left off without losing progress.
  • The generated note MUST render downloaded photos inline in the tweet section.
  • Failed photos MUST render as a one-line warning showing the failure reason and original URL. Pending photos render silently (not an error, just "not yet processed"). Video-pending photos render as a "not downloaded" placeholder.

Non-functional

  • The system MUST respect pbs.twimg.com: throttled requests, conservative concurrency, browser-style User-Agent.
  • Existing data (the current Item.media shape) MUST continue to load without a manual migration step.
  • No silent data loss: every photo URL in the input leaves the run accounted for.

Scope

In

  • X-attached photos via pbs.twimg.com.
  • Cascading quality fallback (highest available → next → next).
  • CLI flags: --force, --limit N, --items <ids>.
  • Inline rendering of downloaded photos in Obsidian notes.
  • xbrain diff reports media state counts.

Out (deferred)

  • Video download. HLS + ffmpeg is significantly different complexity; ship photos first.
  • Article images. Trafilatura doesn't extract them; would need separate hero-image logic.
  • LLM image description. That's Phase B.

Acceptance criteria

  • Running xbrain media on the full corpus completes without unhandled exceptions.
  • After the run, every photo URL in items.json is in a defined state (downloaded / failed / pending).
  • Photos land on disk in a deterministic path that an Obsidian embed can reference.
  • Re-running xbrain media is a no-op for already-downloaded photos (and reports them as skipped on the summary line).
  • --force re-downloads everything.
  • xbrain generate produces notes with photos visible in Obsidian — spot-check of 10 random notes confirms inline rendering.
  • xbrain diff <a> <b> reports new-download counts between snapshots.
  • Ctrl-C during a long run leaves items.json valid; resuming completes the remaining work.
  • Existing data loads without manual migration.
  • A snapshot is created automatically before the run.

Success criteria (measurable)

  • ≥95% of photo URLs in the existing corpus end in the "downloaded" state after one full run.
  • Failure rate ≤5%, with every failure carrying a categorized reason.
  • Total disk usage on data/media/ ≤ 350 MB for the current ~1884-item corpus.
  • Manual spot-check of 10 random notes: every photo renders inline in Obsidian.

Decisions taken

Decision Choice Why
Phase A scope Photos only, no video Video is HLS + ffmpeg, ~3x complexity. Ship photos first.
Article images Deferred trafilatura doesn't extract them; separate work.
Image quality Highest available with cascading fallback Best signal preserved without extra round-trips.
Storage location data/media/ (gitignored), per-item subdirectory Keeps vault tree clean; recommended over vault-embedded media.
Data model Tagged union with explicit state per media entry Matches existing project direction (#20); no illegal states.
Migration Validator-based, no separate command Same pattern as #20. Zero downtime.
Failure categorization Mirrors the transient/permanent buckets from #19 Consistent retry semantics across the pipeline.
Render position Inline in the tweet section Natural read order in Obsidian.

Open questions for Víctor

  • Storage location final call: data/media/ (gitignored, recommended) vs learnings/x-knowledge/_media/ (inside vault, notes self-contained).
  • Photo cap per post: all inline vs cap at e.g. 4 + "+N more" link.

Dependencies

  • None at the code level — Phase A builds on existing Item.media extraction.
  • Phase B depends on Phase A being merged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions