Article extraction drops `content_state`, `cover_media`, `media_entities`, and 6 other fields ft already has access to

ft's article enrichment (PR #119, merged 2026-04-21) added the ability to fetch X Article body via X's GraphQL `content_state` parser. The extraction at `src/graphql-bookmarks.ts:1517` (`articleFromCandidate`) keeps `title`, derives a flat `text` from several candidate fields, and captures `siteName` — three of the twelve fields X returns. The remaining nine are discarded after the text-flattening step. This issue proposes preserving them.

The current `ArticleContent` interface (`src/types.ts:74-76`) is `{ title, text, siteName }`. After the text is extracted, the structured `content_state.blocks` and `entityMap` are dropped, along with cover image, inline images, summary, timestamps, and identifiers.

## The asymmetry (verified May 2026 across three live captures)

`result.article.article_results.result` has these keys. Three are kept by ft today; nine are dropped:

| Field | ft keeps? | What's lost |
|---|---|---|
| `title` | ✓ | — |
| `plain_text` | ✓ (folded into `article_text`) | — |
| `siteName` / `site_name` | ✓ (`article_site`) | — (absent for X Articles anyway) |
| **`content_state`** (blocks + entityMap) | ✗ | Rich text structure (paragraphs/headers/lists/quotes/code), inline hyperlinks, image-position markers, image captions |
| **`cover_media`** | ✗ | Hero image: URL, dimensions, color palette |
| **`media_entities`** | ✗ | Inline article image URLs, dimensions, color palettes |
| **`preview_text`** | ✗ | Short excerpt (always 199 chars across our captures) |
| **`summary_text`** | ✗ | X-generated AI summary (optional — present on 2 of 3 captures) |
| **`metadata.first_published_at_secs`** | ✗ | Article publication timestamp |
| **`lifecycle_state.modified_at_secs`** | ✗ | Article last-edit timestamp |
| **`rest_id`** | ✗ | Stable article identifier (enables deduplication across multiple bookmarks referencing the same article) |
| **`id`** | ✗ | Base64-encoded global identifier (redundant with `rest_id`) |
| `is_grok_summary_eligible` | ✗ | Niche Grok flag — safe to skip |

## Concrete failing cases

Three articles captured via live `TweetResultByRestId` calls:

1. **Inline links lost.** Bookmark `2049874687069426008` quotes a tweet linking to an X Article (`/i/article/2049760065427574784`). The article's `content_state.entityMap` contains a `LINK` entity pointing to `https://truthsocial.com/@realDonaldTrump/posts/116100300268316472`, with the corresponding text span in `content_state.blocks[1]` ("the direction of President Donald J. Trump", chars 21-63). ft's flattened `article_text` retains the prose but drops the URL — anchor becomes dead text.

2. **Image captions lost.** Bookmark `2050950811006452187` (RasmusNielsen, "A note on the state of app icons") is an article centered on visual examples. Its `content_state.entityMap` contains four `MEDIA` entities, each with a `data.caption` field:
   - "Will the next bin icon be flat?"
   - "OS X Mountain Lion (10.8)"
   - "Lighting setup: viewport and render"
   - "Rendered in Blender (Eevee)"
   
   Image positions are marked by 4 `atomic` blocks in `content_state.blocks` (each with `text: " "` and an `entityRanges` pointer to the MEDIA entity). ft's flat-text extraction collapses these to empty space; captions and image positions both evaporate.

3. **Hero and inline images lost.** Both `cover_media` (the article's hero image, sized 1199x480) and the four `media_entities` items (inline images sized 1289x856, etc., with original twimg.com URLs and color palettes) are dropped entirely. For a visual article like RasmusNielsen's, this is the bulk of the content.

4. **Rich text formatting lost.** `content_state.blocks` use DraftJS conventions:
   - `block.type`: `"unstyled"` (paragraph), `"header-one"`, `"unordered-list-item"`, `"blockquote"`, `"code-block"`, `"atomic"` (image position)
   - `block.inlineStyleRanges[]`: `{length, offset, style: "Bold" | "Italic" | "Underline" | ...}`
   
   Two of our three captures had `Bold` spans (CENTCOM's lede "TAMPA, Fla. —" and UFO article's intro). ft flattens these to text without style markers.

## Same fix applies to articles inside `quoted_status_result`

The article shape is identical regardless of nesting:

- **Outer**: `result.article.article_results.result`
- **Quoted**: `quoted_status_result.result.article.article_results.result`

A quote-tweet whose quoted tweet IS an X Article (e.g., bookmark `2049874687069426008` → quoted tweet `2049779422291460576` → article `2049760065427574784`) carries the full article inline in the same shape. Currently `parseTweetArticleByRestId` (`src/graphql-bookmarks.ts:1556`) walks `collectArticleCandidates` which descends recursively — it WILL find the quoted article, so the extraction code path already handles both contexts. The fix just needs to extend what's preserved.

## Side observation: articles are inline in the response

`parseTweetArticleByRestId` is currently called from the `ft sync --gaps` flow at `src/graphql-bookmarks.ts:1615`, fetched per-bookmark via `TweetResultByRestId`. But `result.article` is present in tweet results regardless of endpoint — verified in all three live captures via `TweetResultByRestId`, and the Tweet GraphQL type is shared across endpoints (Bookmarks, BookmarkSearchTimeline, TweetResultByRestId).

If the Bookmarks endpoint response also carries `result.article` inline (very likely, since same Tweet type), extracting articles during the initial `convertTweetToRecord` pass would eliminate the per-tweet `--gaps` fetch. For a user with 100 article bookmarks, that's 100 saved HTTP requests on every sync. Worth verifying on your end with a sample Bookmarks response; not a blocker for this issue.

## Suggested fix

Extend `articleFromCandidate` (`src/graphql-bookmarks.ts:1517`) to return a richer `ArticleContent` shape, and add columns/storage for the new fields.

### Fields to preserve (8)

| Field | Source path | Notes |
|---|---|---|
| `content_state` | `candidate.content_state` | JSON object: `{blocks, entityMap}`. Preserves rich text, inline links, captions, image positions |
| `cover_media` | `candidate.cover_media` | JSON object: `{media_id, media_key, media_info: {original_img_url, original_img_width, original_img_height, color_info.palette}}` |
| `media_entities` | `candidate.media_entities` | JSON array; each entry same shape as `cover_media.media_info` parent. Preserve as-is so future media types (e.g., video) flow through without parser changes |
| `previewText` | `candidate.preview_text` | string |
| `summaryText` | `candidate.summary_text` | string, optional (absent on some articles) |
| `firstPublishedAt` | `new Date(candidate.metadata.first_published_at_secs * 1000).toISOString()` | ISO timestamp |
| `modifiedAt` | `new Date(candidate.lifecycle_state.modified_at_secs * 1000).toISOString()` | ISO timestamp |
| `articleRestId` | `candidate.rest_id` | string |

### Schema choice (three options for maintainer)

**A. New scalar columns + JSON columns for nested data.** Mirrors the existing `article_title / article_text / article_site` pattern at `src/bookmarks-db.ts:269-271`. New columns:
```sql
article_preview_text TEXT,
article_summary_text TEXT,
article_first_published_at TEXT,
article_modified_at TEXT,
article_rest_id TEXT,
article_content_state TEXT,    -- JSON
article_cover_media TEXT,       -- JSON
article_media_entities TEXT,    -- JSON
```
Eight new columns. Each scalar queryable directly; nested data stored as JSON in its own column. Standard ALTER TABLE migration via the existing `ensureColumn` pattern (`bookmarks-db.ts:341-343`).

**B. Single JSON blob.** One new column holding everything new:
```sql
article_extra_json TEXT,
```
Most flexible for future X-side additions (no migration needed when X adds new fields). Less queryable — would need `JSON_EXTRACT` for any new-field query. Follows the pattern of `quoted_tweet_json TEXT` (`bookmarks-db.ts:266`).

**C. Normalized `articles` table.** Separate table keyed by `rest_id`:
```sql
CREATE TABLE articles (
  rest_id TEXT PRIMARY KEY,
  title TEXT, plain_text TEXT, preview_text TEXT, summary_text TEXT,
  content_state_json TEXT, cover_media_json TEXT, media_entities_json TEXT,
  first_published_at TEXT, modified_at TEXT
);
-- bookmarks gets a foreign key to it
ALTER TABLE bookmarks ADD COLUMN article_rest_id TEXT REFERENCES articles(rest_id);
```
Cleanest long-term: same article quoted in multiple bookmarks dedupes naturally (relevant for the quoted-article case above where the same `2049760065427574784` article could appear via different bookmarks). Biggest migration.

I lean toward **C** because of the deduplication win, but each has tradeoffs and you know the migration appetite better.

## Things to flag

1. **`is_grok_summary_eligible` skipped** — present on all three captures (always `true` under current feature flags), but no clear consumer use case. Safe to leave dropped unless a Grok-related feature ever lands.

2. **No video example in captures** — all 4 of RasmusNielsen's inline media items have `__typename: "ApiImage"`. `media_entities` may also support `ApiVideo` (X has video articles); preserving the array as JSON handles this transparently — whatever shape X returns flows through.

3. **Voice-over not requested** — ft's `TWEET_RESULT_FIELD_TOGGLES` has `withArticleVoiceOver: false`. Out of scope for this issue.

4. **Local fetching of article images is a separate concern.** This issue proposes preserving the URLs (cover_media + media_entities). Actually downloading the image bytes to `~/.ft-bookmarks/media/` is a follow-up issue — `bookmark-media.ts` would need to recognize article image URLs as fetch targets. Filed separately when this lands.

Field	Source path	Notes
`content_state`	`candidate.content_state`	JSON object: `{blocks, entityMap}`. Preserves rich text, inline links, captions, image positions
`cover_media`	`candidate.cover_media`	JSON object: `{media_id, media_key, media_info: {original_img_url, original_img_width, original_img_height, color_info.palette}}`
`media_entities`	`candidate.media_entities`	JSON array; each entry same shape as `cover_media.media_info` parent. Preserve as-is so future media types (e.g., video) flow through without parser changes
`previewText`	`candidate.preview_text`	string
`summaryText`	`candidate.summary_text`	string, optional (absent on some articles)
`firstPublishedAt`	`new Date(candidate.metadata.first_published_at_secs * 1000).toISOString()`	ISO timestamp
`modifiedAt`	`new Date(candidate.lifecycle_state.modified_at_secs * 1000).toISOString()`	ISO timestamp
`articleRestId`	`candidate.rest_id`	string

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Article extraction drops `content_state`, `cover_media`, `media_entities`, and 6 other fields ft already has access to #148

The asymmetry (verified May 2026 across three live captures)

Concrete failing cases

Same fix applies to articles inside `quoted_status_result`

Side observation: articles are inline in the response

Suggested fix

Fields to preserve (8)

Schema choice (three options for maintainer)

Things to flag

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Field	ft keeps?	What's lost
`title`	✓	—
`plain_text`	✓ (folded into `article_text`)	—
`siteName` / `site_name`	✓ (`article_site`)	— (absent for X Articles anyway)
`content_state` (blocks + entityMap)	✗	Rich text structure (paragraphs/headers/lists/quotes/code), inline hyperlinks, image-position markers, image captions
`cover_media`	✗	Hero image: URL, dimensions, color palette
`media_entities`	✗	Inline article image URLs, dimensions, color palettes
`preview_text`	✗	Short excerpt (always 199 chars across our captures)
`summary_text`	✗	X-generated AI summary (optional — present on 2 of 3 captures)
`metadata.first_published_at_secs`	✗	Article publication timestamp
`lifecycle_state.modified_at_secs`	✗	Article last-edit timestamp
`rest_id`	✗	Stable article identifier (enables deduplication across multiple bookmarks referencing the same article)
`id`	✗	Base64-encoded global identifier (redundant with `rest_id`)
`is_grok_summary_eligible`	✗	Niche Grok flag — safe to skip

Article extraction drops content_state, cover_media, media_entities, and 6 other fields ft already has access to #148

Description

The asymmetry (verified May 2026 across three live captures)

Concrete failing cases

Same fix applies to articles inside quoted_status_result

Side observation: articles are inline in the response

Suggested fix

Fields to preserve (8)

Schema choice (three options for maintainer)

Things to flag

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Article extraction drops `content_state`, `cover_media`, `media_entities`, and 6 other fields ft already has access to #148

Same fix applies to articles inside `quoted_status_result`