Skip to content

Phase B: Describe images with vision LLM and feed into enrich #34

@VGonPa

Description

@VGonPa

Depends on: #33 (Phase A must be merged to main before this can start.)

Spec — Describe images with vision LLM and feed into enrich (Phase B)

Problem

After Phase A, XBrain has the bytes of every photo locally and notes render them in Obsidian. But the topic-enrich step still doesn't "see" the image content — a post that's mostly a screenshot of a paper or a chart still gets enriched based only on its (often empty) accompanying text. Image-only knowledge stays undiscoverable by topic search, and topic-page overviews miss the visual signal entirely.

What gets delivered

A new command xbrain describe that:

  • Sends each non-decorative downloaded photo to a Claude vision model.
  • Stores a short text description per image, plus a flag marking whether the image was content-bearing or purely decorative.
  • Causes subsequent xbrain enrich and xbrain topics runs to include the descriptions in the LLM input, so topic assignments and topic-page overviews reflect visual content.

Requirements

Functional

  • The system MUST classify each photo as decorative (avatar, reaction meme, pure aesthetic, abstract background) or content-bearing (screenshot of text/code/chart/diagram/paper/UI, photo with content).
  • The system MUST produce a short prose description (1-3 sentences) for each content-bearing photo, in the language configured by output_language.
  • Decorative photos MUST be classified as such, stored with an empty description, and excluded from the enrich prompt so they introduce no topic noise.
  • The system MUST be idempotent: re-running is a no-op for already-described photos unless a description-version bump invalidates them or --force is passed.
  • The item-enrich prompt MUST include the descriptions of content-bearing photos in an "Images in this post:" section.
  • The topic-synth prompt MUST include image descriptions of items belonging to the topic.
  • Refusals from the vision API (faces, NSFW, etc.) MUST be handled gracefully: mark the photo as decorative with empty description, continue. No hard failure.
  • Per-batch failures MUST be isolated — a failing batch does not abort the whole run.
  • A total-failure run (every batch errored) MUST exit non-zero.

Non-functional

  • A full-corpus run MUST stay under $20 total API cost.
  • Description language follows the output_language configuration so the vault stays consistent.
  • A description_version field is tracked per entry so prompt evolution can trigger targeted re-describe (no full re-run needed when only the prompt changes).

Scope

In

  • Vision-describe every photo that Phase A successfully downloaded.
  • Decorative-vs-content classification baked into the same call.
  • Inject content-image descriptions into both the item-enrich prompt and the topic-synth prompt.
  • CLI flags: --force, --limit N, --items <ids>, --model, --batch-size.

Out (deferred)

  • Describing videos (no Phase A video download).
  • Describing article images (no Phase A article fetch).
  • Alt-text generation for accessibility (descriptions are stored but not exposed as Obsidian alt-text in this phase).

Acceptance criteria

  • Running xbrain describe on the full corpus completes within budget.
  • After the run, every Phase-A-downloaded photo has an is_decorative flag and (if not decorative) a description.
  • Re-running xbrain describe is a no-op for already-described photos (skipped on summary).
  • Bumping the description-version triggers re-describe of stale entries on the next run.
  • --force re-describes everything.
  • xbrain enrich after xbrain describe produces user-prompt strings that include the "Images in this post:" section for items with content-bearing photos.
  • Decorative photos are absent from the enrich prompt.
  • xbrain topics after xbrain describe includes image descriptions when synthesizing topic-page overviews.
  • A batch that errors does not abort the run; the rest is still described.
  • A total-failure run exits non-zero.
  • Vision-API refusals are handled gracefully (marked decorative, empty description, no crash).
  • Descriptions are written in the language configured by output_language.

Success criteria (measurable)

  • Total cost of one full-corpus run ≤ $20 (expected $3-7 with Sonnet + Batch API).
  • ≥80% of Phase-A-downloaded photos end in the "described" state after one full run.
  • Manual evaluation: re-running xbrain enrich on 20 image-heavy items previously hard to classify produces measurably improved primary_topic assignments in ≥60% of cases (Víctor judges).

Decisions taken

Decision Choice Why
Model Sonnet 4.6 with Batch API by default Best quality/cost ratio; ~$3-5 for full corpus; Haiku saves ~$2 — not material.
Prompt shape One call per batch returning JSON list (is_decorative + description per image) One round-trip; decorative filter built into the same call.
Batching 5 images per call 12-15% token saving vs per-image; modest added complexity.
Description language Follows output_language config Consistency with #16.
Re-describe trigger Description-version bump Avoids re-describing the whole corpus on every prompt tweak.
Refusal handling Mark decorative + empty description Graceful — no special-case error handling needed downstream.
Decorative filter Excluded from enrich prompt Avoids topic noise from avatars / memes / reaction images.

Open questions for Víctor

  • Should xbrain describe auto-run as part of xbrain media, or stay manual/opt-in? Recommendation: opt-in, since vision has a real cost.
  • Default to Batch API for all runs, or use the streaming API for small (--limit < 100) runs and Batch API for large? Recommendation: smart default based on --limit.

Dependencies

  • Phase A MUST be merged to main — depends on photos being downloaded and tagged with a "downloaded" state.
  • Requires ANTHROPIC_API_KEY (already a project assumption).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions