Summary
The /alternative_sentences/<user_word_id> endpoint returns bookmarks where t_token_i does not always point to bookmark.from (i.e. the target word) inside the context_tokenized that's shipped alongside it. Any frontend that uses the position to highlight or restore the bookmark ends up attaching it to the wrong word in the sentence.
How it surfaces
The web frontend recently moved exercise highlighting from regex-based string matching to the same previousBookmarks / updateTokensWithBookmarks pathway ArticleReader uses for past_bookmarks. That path looks up the target token by (sent_i, token_i) and attaches the bookmark there.
For the current exercise context, positions are accurate and the new path works great: the cloze word renders with the chip-above + dotted-orange highlight + tap-to-pronounce, consistent with reading view.
For alternative-sentence contexts (reached via the left/right chevron navigation, which calls /alternative_sentences/<user_word_id> under the hood), some bookmarks come back with t_token_i pointing to a different word than bookmark.from. The frontend then either:
- attaches the bookmark to the wrong word (e.g. user sees "mené" highlighted instead of "poussière"), or
- silently fails the restoration and the target word renders plain.
Concrete examples observed
- bookmark.from =
poussière, t_total_token: 1, t_token_i: 12 — but tokens[12] in the returned context_tokenized is mené.
- bookmark.from =
Geheimnis, t_total_token: 1, t_token_i (some value) — but the target token at that position isn't Geheimnis.
Front-end console diagnostic confirmed: after bookmark-restoration, the Word matching findClozeWordIds(...) ends up with translation: null and the chip never appears.
Expected behavior
For every bookmark b returned by /alternative_sentences:
context_tokenized[para][b.t_sentence_i][b.t_token_i ... b.t_token_i + b.t_total_token - 1]
should concatenate (case-insensitively, modulo punctuation) to b.from.
Suggested fix
In generated_examples.py, after generating/fetching the alternative example, recompute t_token_i (and t_total_token) against the exact same tokenization that gets serialized into context_tokenized for the response. The position computed against one tokenizer/run shouldn't be served alongside context_tokenized from another.
Workaround in frontend (not great)
Detect failed restoration and fall back to a string-search lookup against bookmark.from. This papers over the data issue but makes every consumer reinvent the same fallback. Better to fix the data at the source.
Impact
User-visible: when navigating with the chevrons in an exercise, some alternative contexts don't get the highlight + chip, looking inconsistent with the original context.
Summary
The
/alternative_sentences/<user_word_id>endpoint returns bookmarks wheret_token_idoes not always point tobookmark.from(i.e. the target word) inside thecontext_tokenizedthat's shipped alongside it. Any frontend that uses the position to highlight or restore the bookmark ends up attaching it to the wrong word in the sentence.How it surfaces
The web frontend recently moved exercise highlighting from regex-based string matching to the same
previousBookmarks/updateTokensWithBookmarkspathwayArticleReaderuses for past_bookmarks. That path looks up the target token by(sent_i, token_i)and attaches the bookmark there.For the current exercise context, positions are accurate and the new path works great: the cloze word renders with the chip-above + dotted-orange highlight + tap-to-pronounce, consistent with reading view.
For alternative-sentence contexts (reached via the left/right chevron navigation, which calls
/alternative_sentences/<user_word_id>under the hood), some bookmarks come back witht_token_ipointing to a different word thanbookmark.from. The frontend then either:Concrete examples observed
poussière,t_total_token: 1,t_token_i: 12— but tokens[12] in the returnedcontext_tokenizedismené.Geheimnis,t_total_token: 1,t_token_i(some value) — but the target token at that position isn'tGeheimnis.Front-end console diagnostic confirmed: after bookmark-restoration, the Word matching
findClozeWordIds(...)ends up withtranslation: nulland the chip never appears.Expected behavior
For every bookmark
breturned by/alternative_sentences:context_tokenized[para][b.t_sentence_i][b.t_token_i ... b.t_token_i + b.t_total_token - 1]should concatenate (case-insensitively, modulo punctuation) to
b.from.Suggested fix
In
generated_examples.py, after generating/fetching the alternative example, recomputet_token_i(andt_total_token) against the exact same tokenization that gets serialized intocontext_tokenizedfor the response. The position computed against one tokenizer/run shouldn't be served alongsidecontext_tokenizedfrom another.Workaround in frontend (not great)
Detect failed restoration and fall back to a string-search lookup against
bookmark.from. This papers over the data issue but makes every consumer reinvent the same fallback. Better to fix the data at the source.Impact
User-visible: when navigating with the chevrons in an exercise, some alternative contexts don't get the highlight + chip, looking inconsistent with the original context.