Phase B: Describe images with vision LLM and feed into enrich

> **Depends on:** #33 (Phase A must be merged to `main` before this can start.)

# Spec — Describe images with vision LLM and feed into enrich (Phase B)

## Problem

After Phase A, XBrain has the bytes of every photo locally and notes render them in Obsidian. But the topic-enrich step still doesn't "see" the image content — a post that's mostly a screenshot of a paper or a chart still gets enriched based only on its (often empty) accompanying text. Image-only knowledge stays undiscoverable by topic search, and topic-page overviews miss the visual signal entirely.

## What gets delivered

A new command `xbrain describe` that:

- Sends each non-decorative downloaded photo to a Claude vision model.
- Stores a short text description per image, plus a flag marking whether the image was content-bearing or purely decorative.
- Causes subsequent `xbrain enrich` and `xbrain topics` runs to include the descriptions in the LLM input, so topic assignments and topic-page overviews reflect visual content.

## Requirements

**Functional**

- The system MUST classify each photo as decorative (avatar, reaction meme, pure aesthetic, abstract background) or content-bearing (screenshot of text/code/chart/diagram/paper/UI, photo with content).
- The system MUST produce a short prose description (1-3 sentences) for each content-bearing photo, in the language configured by `output_language`.
- Decorative photos MUST be classified as such, stored with an empty description, and excluded from the enrich prompt so they introduce no topic noise.
- The system MUST be idempotent: re-running is a no-op for already-described photos unless a description-version bump invalidates them or `--force` is passed.
- The item-enrich prompt MUST include the descriptions of content-bearing photos in an "Images in this post:" section.
- The topic-synth prompt MUST include image descriptions of items belonging to the topic.
- Refusals from the vision API (faces, NSFW, etc.) MUST be handled gracefully: mark the photo as decorative with empty description, continue. No hard failure.
- Per-batch failures MUST be isolated — a failing batch does not abort the whole run.
- A total-failure run (every batch errored) MUST exit non-zero.

**Non-functional**

- A full-corpus run MUST stay under $20 total API cost.
- Description language follows the `output_language` configuration so the vault stays consistent.
- A `description_version` field is tracked per entry so prompt evolution can trigger targeted re-describe (no full re-run needed when only the prompt changes).

## Scope

**In**

- Vision-describe every photo that Phase A successfully downloaded.
- Decorative-vs-content classification baked into the same call.
- Inject content-image descriptions into both the item-enrich prompt and the topic-synth prompt.
- CLI flags: `--force`, `--limit N`, `--items <ids>`, `--model`, `--batch-size`.

**Out (deferred)**

- Describing videos (no Phase A video download).
- Describing article images (no Phase A article fetch).
- Alt-text generation for accessibility (descriptions are stored but not exposed as Obsidian alt-text in this phase).

## Acceptance criteria

- [ ] Running `xbrain describe` on the full corpus completes within budget.
- [ ] After the run, every Phase-A-downloaded photo has an `is_decorative` flag and (if not decorative) a description.
- [ ] Re-running `xbrain describe` is a no-op for already-described photos (skipped on summary).
- [ ] Bumping the description-version triggers re-describe of stale entries on the next run.
- [ ] `--force` re-describes everything.
- [ ] `xbrain enrich` after `xbrain describe` produces user-prompt strings that include the "Images in this post:" section for items with content-bearing photos.
- [ ] Decorative photos are absent from the enrich prompt.
- [ ] `xbrain topics` after `xbrain describe` includes image descriptions when synthesizing topic-page overviews.
- [ ] A batch that errors does not abort the run; the rest is still described.
- [ ] A total-failure run exits non-zero.
- [ ] Vision-API refusals are handled gracefully (marked decorative, empty description, no crash).
- [ ] Descriptions are written in the language configured by `output_language`.

## Success criteria (measurable)

- Total cost of one full-corpus run ≤ $20 (expected $3-7 with Sonnet + Batch API).
- ≥80% of Phase-A-downloaded photos end in the "described" state after one full run.
- Manual evaluation: re-running `xbrain enrich` on 20 image-heavy items previously hard to classify produces measurably improved `primary_topic` assignments in ≥60% of cases (Víctor judges).

## Decisions taken

| Decision | Choice | Why |
|---|---|---|
| Model | Sonnet 4.6 with Batch API by default | Best quality/cost ratio; ~$3-5 for full corpus; Haiku saves ~$2 — not material. |
| Prompt shape | One call per batch returning JSON list (`is_decorative` + `description` per image) | One round-trip; decorative filter built into the same call. |
| Batching | 5 images per call | 12-15% token saving vs per-image; modest added complexity. |
| Description language | Follows `output_language` config | Consistency with #16. |
| Re-describe trigger | Description-version bump | Avoids re-describing the whole corpus on every prompt tweak. |
| Refusal handling | Mark decorative + empty description | Graceful — no special-case error handling needed downstream. |
| Decorative filter | Excluded from enrich prompt | Avoids topic noise from avatars / memes / reaction images. |

## Open questions for Víctor

- Should `xbrain describe` auto-run as part of `xbrain media`, or stay manual/opt-in? Recommendation: opt-in, since vision has a real cost.
- Default to Batch API for all runs, or use the streaming API for small (`--limit < 100`) runs and Batch API for large? Recommendation: smart default based on `--limit`.

## Dependencies

- **Phase A MUST be merged to `main`** — depends on photos being downloaded and tagged with a "downloaded" state.
- Requires `ANTHROPIC_API_KEY` (already a project assumption).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase B: Describe images with vision LLM and feed into enrich #34

Spec — Describe images with vision LLM and feed into enrich (Phase B)

Problem

What gets delivered

Requirements

Scope

Acceptance criteria

Success criteria (measurable)

Decisions taken

Open questions for Víctor

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Decision	Choice	Why
Model	Sonnet 4.6 with Batch API by default	Best quality/cost ratio; ~$3-5 for full corpus; Haiku saves ~$2 — not material.
Prompt shape	One call per batch returning JSON list (`is_decorative` + `description` per image)	One round-trip; decorative filter built into the same call.
Batching	5 images per call	12-15% token saving vs per-image; modest added complexity.
Description language	Follows `output_language` config	Consistency with #16.
Re-describe trigger	Description-version bump	Avoids re-describing the whole corpus on every prompt tweak.
Refusal handling	Mark decorative + empty description	Graceful — no special-case error handling needed downstream.
Decorative filter	Excluded from enrich prompt	Avoids topic noise from avatars / memes / reaction images.

Phase B: Describe images with vision LLM and feed into enrich #34

Description

Spec — Describe images with vision LLM and feed into enrich (Phase B)

Problem

What gets delivered

Requirements

Scope

Acceptance criteria

Success criteria (measurable)

Decisions taken

Open questions for Víctor

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions