Depends on: #33 (Phase A must be merged to main before this can start.)
Spec — Describe images with vision LLM and feed into enrich (Phase B)
Problem
After Phase A, XBrain has the bytes of every photo locally and notes render them in Obsidian. But the topic-enrich step still doesn't "see" the image content — a post that's mostly a screenshot of a paper or a chart still gets enriched based only on its (often empty) accompanying text. Image-only knowledge stays undiscoverable by topic search, and topic-page overviews miss the visual signal entirely.
What gets delivered
A new command xbrain describe that:
- Sends each non-decorative downloaded photo to a Claude vision model.
- Stores a short text description per image, plus a flag marking whether the image was content-bearing or purely decorative.
- Causes subsequent
xbrain enrich and xbrain topics runs to include the descriptions in the LLM input, so topic assignments and topic-page overviews reflect visual content.
Requirements
Functional
- The system MUST classify each photo as decorative (avatar, reaction meme, pure aesthetic, abstract background) or content-bearing (screenshot of text/code/chart/diagram/paper/UI, photo with content).
- The system MUST produce a short prose description (1-3 sentences) for each content-bearing photo, in the language configured by
output_language.
- Decorative photos MUST be classified as such, stored with an empty description, and excluded from the enrich prompt so they introduce no topic noise.
- The system MUST be idempotent: re-running is a no-op for already-described photos unless a description-version bump invalidates them or
--force is passed.
- The item-enrich prompt MUST include the descriptions of content-bearing photos in an "Images in this post:" section.
- The topic-synth prompt MUST include image descriptions of items belonging to the topic.
- Refusals from the vision API (faces, NSFW, etc.) MUST be handled gracefully: mark the photo as decorative with empty description, continue. No hard failure.
- Per-batch failures MUST be isolated — a failing batch does not abort the whole run.
- A total-failure run (every batch errored) MUST exit non-zero.
Non-functional
- A full-corpus run MUST stay under $20 total API cost.
- Description language follows the
output_language configuration so the vault stays consistent.
- A
description_version field is tracked per entry so prompt evolution can trigger targeted re-describe (no full re-run needed when only the prompt changes).
Scope
In
- Vision-describe every photo that Phase A successfully downloaded.
- Decorative-vs-content classification baked into the same call.
- Inject content-image descriptions into both the item-enrich prompt and the topic-synth prompt.
- CLI flags:
--force, --limit N, --items <ids>, --model, --batch-size.
Out (deferred)
- Describing videos (no Phase A video download).
- Describing article images (no Phase A article fetch).
- Alt-text generation for accessibility (descriptions are stored but not exposed as Obsidian alt-text in this phase).
Acceptance criteria
Success criteria (measurable)
- Total cost of one full-corpus run ≤ $20 (expected $3-7 with Sonnet + Batch API).
- ≥80% of Phase-A-downloaded photos end in the "described" state after one full run.
- Manual evaluation: re-running
xbrain enrich on 20 image-heavy items previously hard to classify produces measurably improved primary_topic assignments in ≥60% of cases (Víctor judges).
Decisions taken
| Decision |
Choice |
Why |
| Model |
Sonnet 4.6 with Batch API by default |
Best quality/cost ratio; ~$3-5 for full corpus; Haiku saves ~$2 — not material. |
| Prompt shape |
One call per batch returning JSON list (is_decorative + description per image) |
One round-trip; decorative filter built into the same call. |
| Batching |
5 images per call |
12-15% token saving vs per-image; modest added complexity. |
| Description language |
Follows output_language config |
Consistency with #16. |
| Re-describe trigger |
Description-version bump |
Avoids re-describing the whole corpus on every prompt tweak. |
| Refusal handling |
Mark decorative + empty description |
Graceful — no special-case error handling needed downstream. |
| Decorative filter |
Excluded from enrich prompt |
Avoids topic noise from avatars / memes / reaction images. |
Open questions for Víctor
- Should
xbrain describe auto-run as part of xbrain media, or stay manual/opt-in? Recommendation: opt-in, since vision has a real cost.
- Default to Batch API for all runs, or use the streaming API for small (
--limit < 100) runs and Batch API for large? Recommendation: smart default based on --limit.
Dependencies
- Phase A MUST be merged to
main — depends on photos being downloaded and tagged with a "downloaded" state.
- Requires
ANTHROPIC_API_KEY (already a project assumption).
Spec — Describe images with vision LLM and feed into enrich (Phase B)
Problem
After Phase A, XBrain has the bytes of every photo locally and notes render them in Obsidian. But the topic-enrich step still doesn't "see" the image content — a post that's mostly a screenshot of a paper or a chart still gets enriched based only on its (often empty) accompanying text. Image-only knowledge stays undiscoverable by topic search, and topic-page overviews miss the visual signal entirely.
What gets delivered
A new command
xbrain describethat:xbrain enrichandxbrain topicsruns to include the descriptions in the LLM input, so topic assignments and topic-page overviews reflect visual content.Requirements
Functional
output_language.--forceis passed.Non-functional
output_languageconfiguration so the vault stays consistent.description_versionfield is tracked per entry so prompt evolution can trigger targeted re-describe (no full re-run needed when only the prompt changes).Scope
In
--force,--limit N,--items <ids>,--model,--batch-size.Out (deferred)
Acceptance criteria
xbrain describeon the full corpus completes within budget.is_decorativeflag and (if not decorative) a description.xbrain describeis a no-op for already-described photos (skipped on summary).--forcere-describes everything.xbrain enrichafterxbrain describeproduces user-prompt strings that include the "Images in this post:" section for items with content-bearing photos.xbrain topicsafterxbrain describeincludes image descriptions when synthesizing topic-page overviews.output_language.Success criteria (measurable)
xbrain enrichon 20 image-heavy items previously hard to classify produces measurably improvedprimary_topicassignments in ≥60% of cases (Víctor judges).Decisions taken
is_decorative+descriptionper image)output_languageconfigOpen questions for Víctor
xbrain describeauto-run as part ofxbrain media, or stay manual/opt-in? Recommendation: opt-in, since vision has a real cost.--limit < 100) runs and Batch API for large? Recommendation: smart default based on--limit.Dependencies
main— depends on photos being downloaded and tagged with a "downloaded" state.ANTHROPIC_API_KEY(already a project assumption).