ft's article enrichment (PR #119, merged 2026-04-21) added the ability to fetch X Article body via X's GraphQL content_state parser. The extraction at src/graphql-bookmarks.ts:1517 (articleFromCandidate) keeps title, derives a flat text from several candidate fields, and captures siteName — three of the twelve fields X returns. The remaining nine are discarded after the text-flattening step. This issue proposes preserving them.
The current ArticleContent interface (src/types.ts:74-76) is { title, text, siteName }. After the text is extracted, the structured content_state.blocks and entityMap are dropped, along with cover image, inline images, summary, timestamps, and identifiers.
The asymmetry (verified May 2026 across three live captures)
result.article.article_results.result has these keys. Three are kept by ft today; nine are dropped:
| Field |
ft keeps? |
What's lost |
title |
✓ |
— |
plain_text |
✓ (folded into article_text) |
— |
siteName / site_name |
✓ (article_site) |
— (absent for X Articles anyway) |
content_state (blocks + entityMap) |
✗ |
Rich text structure (paragraphs/headers/lists/quotes/code), inline hyperlinks, image-position markers, image captions |
cover_media |
✗ |
Hero image: URL, dimensions, color palette |
media_entities |
✗ |
Inline article image URLs, dimensions, color palettes |
preview_text |
✗ |
Short excerpt (always 199 chars across our captures) |
summary_text |
✗ |
X-generated AI summary (optional — present on 2 of 3 captures) |
metadata.first_published_at_secs |
✗ |
Article publication timestamp |
lifecycle_state.modified_at_secs |
✗ |
Article last-edit timestamp |
rest_id |
✗ |
Stable article identifier (enables deduplication across multiple bookmarks referencing the same article) |
id |
✗ |
Base64-encoded global identifier (redundant with rest_id) |
is_grok_summary_eligible |
✗ |
Niche Grok flag — safe to skip |
Concrete failing cases
Three articles captured via live TweetResultByRestId calls:
-
Inline links lost. Bookmark 2049874687069426008 quotes a tweet linking to an X Article (/i/article/2049760065427574784). The article's content_state.entityMap contains a LINK entity pointing to https://truthsocial.com/@realDonaldTrump/posts/116100300268316472, with the corresponding text span in content_state.blocks[1] ("the direction of President Donald J. Trump", chars 21-63). ft's flattened article_text retains the prose but drops the URL — anchor becomes dead text.
-
Image captions lost. Bookmark 2050950811006452187 (RasmusNielsen, "A note on the state of app icons") is an article centered on visual examples. Its content_state.entityMap contains four MEDIA entities, each with a data.caption field:
- "Will the next bin icon be flat?"
- "OS X Mountain Lion (10.8)"
- "Lighting setup: viewport and render"
- "Rendered in Blender (Eevee)"
Image positions are marked by 4 atomic blocks in content_state.blocks (each with text: " " and an entityRanges pointer to the MEDIA entity). ft's flat-text extraction collapses these to empty space; captions and image positions both evaporate.
-
Hero and inline images lost. Both cover_media (the article's hero image, sized 1199x480) and the four media_entities items (inline images sized 1289x856, etc., with original twimg.com URLs and color palettes) are dropped entirely. For a visual article like RasmusNielsen's, this is the bulk of the content.
-
Rich text formatting lost. content_state.blocks use DraftJS conventions:
block.type: "unstyled" (paragraph), "header-one", "unordered-list-item", "blockquote", "code-block", "atomic" (image position)
block.inlineStyleRanges[]: {length, offset, style: "Bold" | "Italic" | "Underline" | ...}
Two of our three captures had Bold spans (CENTCOM's lede "TAMPA, Fla. —" and UFO article's intro). ft flattens these to text without style markers.
Same fix applies to articles inside quoted_status_result
The article shape is identical regardless of nesting:
- Outer:
result.article.article_results.result
- Quoted:
quoted_status_result.result.article.article_results.result
A quote-tweet whose quoted tweet IS an X Article (e.g., bookmark 2049874687069426008 → quoted tweet 2049779422291460576 → article 2049760065427574784) carries the full article inline in the same shape. Currently parseTweetArticleByRestId (src/graphql-bookmarks.ts:1556) walks collectArticleCandidates which descends recursively — it WILL find the quoted article, so the extraction code path already handles both contexts. The fix just needs to extend what's preserved.
Side observation: articles are inline in the response
parseTweetArticleByRestId is currently called from the ft sync --gaps flow at src/graphql-bookmarks.ts:1615, fetched per-bookmark via TweetResultByRestId. But result.article is present in tweet results regardless of endpoint — verified in all three live captures via TweetResultByRestId, and the Tweet GraphQL type is shared across endpoints (Bookmarks, BookmarkSearchTimeline, TweetResultByRestId).
If the Bookmarks endpoint response also carries result.article inline (very likely, since same Tweet type), extracting articles during the initial convertTweetToRecord pass would eliminate the per-tweet --gaps fetch. For a user with 100 article bookmarks, that's 100 saved HTTP requests on every sync. Worth verifying on your end with a sample Bookmarks response; not a blocker for this issue.
Suggested fix
Extend articleFromCandidate (src/graphql-bookmarks.ts:1517) to return a richer ArticleContent shape, and add columns/storage for the new fields.
Fields to preserve (8)
| Field |
Source path |
Notes |
content_state |
candidate.content_state |
JSON object: {blocks, entityMap}. Preserves rich text, inline links, captions, image positions |
cover_media |
candidate.cover_media |
JSON object: {media_id, media_key, media_info: {original_img_url, original_img_width, original_img_height, color_info.palette}} |
media_entities |
candidate.media_entities |
JSON array; each entry same shape as cover_media.media_info parent. Preserve as-is so future media types (e.g., video) flow through without parser changes |
previewText |
candidate.preview_text |
string |
summaryText |
candidate.summary_text |
string, optional (absent on some articles) |
firstPublishedAt |
new Date(candidate.metadata.first_published_at_secs * 1000).toISOString() |
ISO timestamp |
modifiedAt |
new Date(candidate.lifecycle_state.modified_at_secs * 1000).toISOString() |
ISO timestamp |
articleRestId |
candidate.rest_id |
string |
Schema choice (three options for maintainer)
A. New scalar columns + JSON columns for nested data. Mirrors the existing article_title / article_text / article_site pattern at src/bookmarks-db.ts:269-271. New columns:
article_preview_text TEXT,
article_summary_text TEXT,
article_first_published_at TEXT,
article_modified_at TEXT,
article_rest_id TEXT,
article_content_state TEXT, -- JSON
article_cover_media TEXT, -- JSON
article_media_entities TEXT, -- JSON
Eight new columns. Each scalar queryable directly; nested data stored as JSON in its own column. Standard ALTER TABLE migration via the existing ensureColumn pattern (bookmarks-db.ts:341-343).
B. Single JSON blob. One new column holding everything new:
Most flexible for future X-side additions (no migration needed when X adds new fields). Less queryable — would need JSON_EXTRACT for any new-field query. Follows the pattern of quoted_tweet_json TEXT (bookmarks-db.ts:266).
C. Normalized articles table. Separate table keyed by rest_id:
CREATE TABLE articles (
rest_id TEXT PRIMARY KEY,
title TEXT, plain_text TEXT, preview_text TEXT, summary_text TEXT,
content_state_json TEXT, cover_media_json TEXT, media_entities_json TEXT,
first_published_at TEXT, modified_at TEXT
);
-- bookmarks gets a foreign key to it
ALTER TABLE bookmarks ADD COLUMN article_rest_id TEXT REFERENCES articles(rest_id);
Cleanest long-term: same article quoted in multiple bookmarks dedupes naturally (relevant for the quoted-article case above where the same 2049760065427574784 article could appear via different bookmarks). Biggest migration.
I lean toward C because of the deduplication win, but each has tradeoffs and you know the migration appetite better.
Things to flag
-
is_grok_summary_eligible skipped — present on all three captures (always true under current feature flags), but no clear consumer use case. Safe to leave dropped unless a Grok-related feature ever lands.
-
No video example in captures — all 4 of RasmusNielsen's inline media items have __typename: "ApiImage". media_entities may also support ApiVideo (X has video articles); preserving the array as JSON handles this transparently — whatever shape X returns flows through.
-
Voice-over not requested — ft's TWEET_RESULT_FIELD_TOGGLES has withArticleVoiceOver: false. Out of scope for this issue.
-
Local fetching of article images is a separate concern. This issue proposes preserving the URLs (cover_media + media_entities). Actually downloading the image bytes to ~/.ft-bookmarks/media/ is a follow-up issue — bookmark-media.ts would need to recognize article image URLs as fetch targets. Filed separately when this lands.
ft's article enrichment (PR #119, merged 2026-04-21) added the ability to fetch X Article body via X's GraphQL
content_stateparser. The extraction atsrc/graphql-bookmarks.ts:1517(articleFromCandidate) keepstitle, derives a flattextfrom several candidate fields, and capturessiteName— three of the twelve fields X returns. The remaining nine are discarded after the text-flattening step. This issue proposes preserving them.The current
ArticleContentinterface (src/types.ts:74-76) is{ title, text, siteName }. After the text is extracted, the structuredcontent_state.blocksandentityMapare dropped, along with cover image, inline images, summary, timestamps, and identifiers.The asymmetry (verified May 2026 across three live captures)
result.article.article_results.resulthas these keys. Three are kept by ft today; nine are dropped:titleplain_textarticle_text)siteName/site_namearticle_site)content_state(blocks + entityMap)cover_mediamedia_entitiespreview_textsummary_textmetadata.first_published_at_secslifecycle_state.modified_at_secsrest_ididrest_id)is_grok_summary_eligibleConcrete failing cases
Three articles captured via live
TweetResultByRestIdcalls:Inline links lost. Bookmark
2049874687069426008quotes a tweet linking to an X Article (/i/article/2049760065427574784). The article'scontent_state.entityMapcontains aLINKentity pointing tohttps://truthsocial.com/@realDonaldTrump/posts/116100300268316472, with the corresponding text span incontent_state.blocks[1]("the direction of President Donald J. Trump", chars 21-63). ft's flattenedarticle_textretains the prose but drops the URL — anchor becomes dead text.Image captions lost. Bookmark
2050950811006452187(RasmusNielsen, "A note on the state of app icons") is an article centered on visual examples. Itscontent_state.entityMapcontains fourMEDIAentities, each with adata.captionfield:Image positions are marked by 4
atomicblocks incontent_state.blocks(each withtext: " "and anentityRangespointer to the MEDIA entity). ft's flat-text extraction collapses these to empty space; captions and image positions both evaporate.Hero and inline images lost. Both
cover_media(the article's hero image, sized 1199x480) and the fourmedia_entitiesitems (inline images sized 1289x856, etc., with original twimg.com URLs and color palettes) are dropped entirely. For a visual article like RasmusNielsen's, this is the bulk of the content.Rich text formatting lost.
content_state.blocksuse DraftJS conventions:block.type:"unstyled"(paragraph),"header-one","unordered-list-item","blockquote","code-block","atomic"(image position)block.inlineStyleRanges[]:{length, offset, style: "Bold" | "Italic" | "Underline" | ...}Two of our three captures had
Boldspans (CENTCOM's lede "TAMPA, Fla. —" and UFO article's intro). ft flattens these to text without style markers.Same fix applies to articles inside
quoted_status_resultThe article shape is identical regardless of nesting:
result.article.article_results.resultquoted_status_result.result.article.article_results.resultA quote-tweet whose quoted tweet IS an X Article (e.g., bookmark
2049874687069426008→ quoted tweet2049779422291460576→ article2049760065427574784) carries the full article inline in the same shape. CurrentlyparseTweetArticleByRestId(src/graphql-bookmarks.ts:1556) walkscollectArticleCandidateswhich descends recursively — it WILL find the quoted article, so the extraction code path already handles both contexts. The fix just needs to extend what's preserved.Side observation: articles are inline in the response
parseTweetArticleByRestIdis currently called from theft sync --gapsflow atsrc/graphql-bookmarks.ts:1615, fetched per-bookmark viaTweetResultByRestId. Butresult.articleis present in tweet results regardless of endpoint — verified in all three live captures viaTweetResultByRestId, and the Tweet GraphQL type is shared across endpoints (Bookmarks, BookmarkSearchTimeline, TweetResultByRestId).If the Bookmarks endpoint response also carries
result.articleinline (very likely, since same Tweet type), extracting articles during the initialconvertTweetToRecordpass would eliminate the per-tweet--gapsfetch. For a user with 100 article bookmarks, that's 100 saved HTTP requests on every sync. Worth verifying on your end with a sample Bookmarks response; not a blocker for this issue.Suggested fix
Extend
articleFromCandidate(src/graphql-bookmarks.ts:1517) to return a richerArticleContentshape, and add columns/storage for the new fields.Fields to preserve (8)
content_statecandidate.content_state{blocks, entityMap}. Preserves rich text, inline links, captions, image positionscover_mediacandidate.cover_media{media_id, media_key, media_info: {original_img_url, original_img_width, original_img_height, color_info.palette}}media_entitiescandidate.media_entitiescover_media.media_infoparent. Preserve as-is so future media types (e.g., video) flow through without parser changespreviewTextcandidate.preview_textsummaryTextcandidate.summary_textfirstPublishedAtnew Date(candidate.metadata.first_published_at_secs * 1000).toISOString()modifiedAtnew Date(candidate.lifecycle_state.modified_at_secs * 1000).toISOString()articleRestIdcandidate.rest_idSchema choice (three options for maintainer)
A. New scalar columns + JSON columns for nested data. Mirrors the existing
article_title / article_text / article_sitepattern atsrc/bookmarks-db.ts:269-271. New columns:Eight new columns. Each scalar queryable directly; nested data stored as JSON in its own column. Standard ALTER TABLE migration via the existing
ensureColumnpattern (bookmarks-db.ts:341-343).B. Single JSON blob. One new column holding everything new:
article_extra_json TEXT,Most flexible for future X-side additions (no migration needed when X adds new fields). Less queryable — would need
JSON_EXTRACTfor any new-field query. Follows the pattern ofquoted_tweet_json TEXT(bookmarks-db.ts:266).C. Normalized
articlestable. Separate table keyed byrest_id:Cleanest long-term: same article quoted in multiple bookmarks dedupes naturally (relevant for the quoted-article case above where the same
2049760065427574784article could appear via different bookmarks). Biggest migration.I lean toward C because of the deduplication win, but each has tradeoffs and you know the migration appetite better.
Things to flag
is_grok_summary_eligibleskipped — present on all three captures (alwaystrueunder current feature flags), but no clear consumer use case. Safe to leave dropped unless a Grok-related feature ever lands.No video example in captures — all 4 of RasmusNielsen's inline media items have
__typename: "ApiImage".media_entitiesmay also supportApiVideo(X has video articles); preserving the array as JSON handles this transparently — whatever shape X returns flows through.Voice-over not requested — ft's
TWEET_RESULT_FIELD_TOGGLEShaswithArticleVoiceOver: false. Out of scope for this issue.Local fetching of article images is a separate concern. This issue proposes preserving the URLs (cover_media + media_entities). Actually downloading the image bytes to
~/.ft-bookmarks/media/is a follow-up issue —bookmark-media.tswould need to recognize article image URLs as fetch targets. Filed separately when this lands.