Skip to content

Article extraction drops content_state, cover_media, media_entities, and 6 other fields ft already has access to #148

@RealADemin

Description

@RealADemin

ft's article enrichment (PR #119, merged 2026-04-21) added the ability to fetch X Article body via X's GraphQL content_state parser. The extraction at src/graphql-bookmarks.ts:1517 (articleFromCandidate) keeps title, derives a flat text from several candidate fields, and captures siteName — three of the twelve fields X returns. The remaining nine are discarded after the text-flattening step. This issue proposes preserving them.

The current ArticleContent interface (src/types.ts:74-76) is { title, text, siteName }. After the text is extracted, the structured content_state.blocks and entityMap are dropped, along with cover image, inline images, summary, timestamps, and identifiers.

The asymmetry (verified May 2026 across three live captures)

result.article.article_results.result has these keys. Three are kept by ft today; nine are dropped:

Field ft keeps? What's lost
title
plain_text ✓ (folded into article_text)
siteName / site_name ✓ (article_site) — (absent for X Articles anyway)
content_state (blocks + entityMap) Rich text structure (paragraphs/headers/lists/quotes/code), inline hyperlinks, image-position markers, image captions
cover_media Hero image: URL, dimensions, color palette
media_entities Inline article image URLs, dimensions, color palettes
preview_text Short excerpt (always 199 chars across our captures)
summary_text X-generated AI summary (optional — present on 2 of 3 captures)
metadata.first_published_at_secs Article publication timestamp
lifecycle_state.modified_at_secs Article last-edit timestamp
rest_id Stable article identifier (enables deduplication across multiple bookmarks referencing the same article)
id Base64-encoded global identifier (redundant with rest_id)
is_grok_summary_eligible Niche Grok flag — safe to skip

Concrete failing cases

Three articles captured via live TweetResultByRestId calls:

  1. Inline links lost. Bookmark 2049874687069426008 quotes a tweet linking to an X Article (/i/article/2049760065427574784). The article's content_state.entityMap contains a LINK entity pointing to https://truthsocial.com/@realDonaldTrump/posts/116100300268316472, with the corresponding text span in content_state.blocks[1] ("the direction of President Donald J. Trump", chars 21-63). ft's flattened article_text retains the prose but drops the URL — anchor becomes dead text.

  2. Image captions lost. Bookmark 2050950811006452187 (RasmusNielsen, "A note on the state of app icons") is an article centered on visual examples. Its content_state.entityMap contains four MEDIA entities, each with a data.caption field:

    • "Will the next bin icon be flat?"
    • "OS X Mountain Lion (10.8)"
    • "Lighting setup: viewport and render"
    • "Rendered in Blender (Eevee)"

    Image positions are marked by 4 atomic blocks in content_state.blocks (each with text: " " and an entityRanges pointer to the MEDIA entity). ft's flat-text extraction collapses these to empty space; captions and image positions both evaporate.

  3. Hero and inline images lost. Both cover_media (the article's hero image, sized 1199x480) and the four media_entities items (inline images sized 1289x856, etc., with original twimg.com URLs and color palettes) are dropped entirely. For a visual article like RasmusNielsen's, this is the bulk of the content.

  4. Rich text formatting lost. content_state.blocks use DraftJS conventions:

    • block.type: "unstyled" (paragraph), "header-one", "unordered-list-item", "blockquote", "code-block", "atomic" (image position)
    • block.inlineStyleRanges[]: {length, offset, style: "Bold" | "Italic" | "Underline" | ...}

    Two of our three captures had Bold spans (CENTCOM's lede "TAMPA, Fla. —" and UFO article's intro). ft flattens these to text without style markers.

Same fix applies to articles inside quoted_status_result

The article shape is identical regardless of nesting:

  • Outer: result.article.article_results.result
  • Quoted: quoted_status_result.result.article.article_results.result

A quote-tweet whose quoted tweet IS an X Article (e.g., bookmark 2049874687069426008 → quoted tweet 2049779422291460576 → article 2049760065427574784) carries the full article inline in the same shape. Currently parseTweetArticleByRestId (src/graphql-bookmarks.ts:1556) walks collectArticleCandidates which descends recursively — it WILL find the quoted article, so the extraction code path already handles both contexts. The fix just needs to extend what's preserved.

Side observation: articles are inline in the response

parseTweetArticleByRestId is currently called from the ft sync --gaps flow at src/graphql-bookmarks.ts:1615, fetched per-bookmark via TweetResultByRestId. But result.article is present in tweet results regardless of endpoint — verified in all three live captures via TweetResultByRestId, and the Tweet GraphQL type is shared across endpoints (Bookmarks, BookmarkSearchTimeline, TweetResultByRestId).

If the Bookmarks endpoint response also carries result.article inline (very likely, since same Tweet type), extracting articles during the initial convertTweetToRecord pass would eliminate the per-tweet --gaps fetch. For a user with 100 article bookmarks, that's 100 saved HTTP requests on every sync. Worth verifying on your end with a sample Bookmarks response; not a blocker for this issue.

Suggested fix

Extend articleFromCandidate (src/graphql-bookmarks.ts:1517) to return a richer ArticleContent shape, and add columns/storage for the new fields.

Fields to preserve (8)

Field Source path Notes
content_state candidate.content_state JSON object: {blocks, entityMap}. Preserves rich text, inline links, captions, image positions
cover_media candidate.cover_media JSON object: {media_id, media_key, media_info: {original_img_url, original_img_width, original_img_height, color_info.palette}}
media_entities candidate.media_entities JSON array; each entry same shape as cover_media.media_info parent. Preserve as-is so future media types (e.g., video) flow through without parser changes
previewText candidate.preview_text string
summaryText candidate.summary_text string, optional (absent on some articles)
firstPublishedAt new Date(candidate.metadata.first_published_at_secs * 1000).toISOString() ISO timestamp
modifiedAt new Date(candidate.lifecycle_state.modified_at_secs * 1000).toISOString() ISO timestamp
articleRestId candidate.rest_id string

Schema choice (three options for maintainer)

A. New scalar columns + JSON columns for nested data. Mirrors the existing article_title / article_text / article_site pattern at src/bookmarks-db.ts:269-271. New columns:

article_preview_text TEXT,
article_summary_text TEXT,
article_first_published_at TEXT,
article_modified_at TEXT,
article_rest_id TEXT,
article_content_state TEXT,    -- JSON
article_cover_media TEXT,       -- JSON
article_media_entities TEXT,    -- JSON

Eight new columns. Each scalar queryable directly; nested data stored as JSON in its own column. Standard ALTER TABLE migration via the existing ensureColumn pattern (bookmarks-db.ts:341-343).

B. Single JSON blob. One new column holding everything new:

article_extra_json TEXT,

Most flexible for future X-side additions (no migration needed when X adds new fields). Less queryable — would need JSON_EXTRACT for any new-field query. Follows the pattern of quoted_tweet_json TEXT (bookmarks-db.ts:266).

C. Normalized articles table. Separate table keyed by rest_id:

CREATE TABLE articles (
  rest_id TEXT PRIMARY KEY,
  title TEXT, plain_text TEXT, preview_text TEXT, summary_text TEXT,
  content_state_json TEXT, cover_media_json TEXT, media_entities_json TEXT,
  first_published_at TEXT, modified_at TEXT
);
-- bookmarks gets a foreign key to it
ALTER TABLE bookmarks ADD COLUMN article_rest_id TEXT REFERENCES articles(rest_id);

Cleanest long-term: same article quoted in multiple bookmarks dedupes naturally (relevant for the quoted-article case above where the same 2049760065427574784 article could appear via different bookmarks). Biggest migration.

I lean toward C because of the deduplication win, but each has tradeoffs and you know the migration appetite better.

Things to flag

  1. is_grok_summary_eligible skipped — present on all three captures (always true under current feature flags), but no clear consumer use case. Safe to leave dropped unless a Grok-related feature ever lands.

  2. No video example in captures — all 4 of RasmusNielsen's inline media items have __typename: "ApiImage". media_entities may also support ApiVideo (X has video articles); preserving the array as JSON handles this transparently — whatever shape X returns flows through.

  3. Voice-over not requested — ft's TWEET_RESULT_FIELD_TOGGLES has withArticleVoiceOver: false. Out of scope for this issue.

  4. Local fetching of article images is a separate concern. This issue proposes preserving the URLs (cover_media + media_entities). Actually downloading the image bytes to ~/.ft-bookmarks/media/ is a follow-up issue — bookmark-media.ts would need to recognize article image URLs as fetch targets. Filed separately when this lands.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions