Skip to content

/alternative_sentences: bookmark t_token_i can be stale against returned context_tokenized #618

@mircealungu

Description

@mircealungu

Summary

The /alternative_sentences/<user_word_id> endpoint returns bookmarks where t_token_i does not always point to bookmark.from (i.e. the target word) inside the context_tokenized that's shipped alongside it. Any frontend that uses the position to highlight or restore the bookmark ends up attaching it to the wrong word in the sentence.

How it surfaces

The web frontend recently moved exercise highlighting from regex-based string matching to the same previousBookmarks / updateTokensWithBookmarks pathway ArticleReader uses for past_bookmarks. That path looks up the target token by (sent_i, token_i) and attaches the bookmark there.

For the current exercise context, positions are accurate and the new path works great: the cloze word renders with the chip-above + dotted-orange highlight + tap-to-pronounce, consistent with reading view.

For alternative-sentence contexts (reached via the left/right chevron navigation, which calls /alternative_sentences/<user_word_id> under the hood), some bookmarks come back with t_token_i pointing to a different word than bookmark.from. The frontend then either:

  • attaches the bookmark to the wrong word (e.g. user sees "mené" highlighted instead of "poussière"), or
  • silently fails the restoration and the target word renders plain.

Concrete examples observed

  • bookmark.from = poussière, t_total_token: 1, t_token_i: 12 — but tokens[12] in the returned context_tokenized is mené.
  • bookmark.from = Geheimnis, t_total_token: 1, t_token_i (some value) — but the target token at that position isn't Geheimnis.

Front-end console diagnostic confirmed: after bookmark-restoration, the Word matching findClozeWordIds(...) ends up with translation: null and the chip never appears.

Expected behavior

For every bookmark b returned by /alternative_sentences:

context_tokenized[para][b.t_sentence_i][b.t_token_i ... b.t_token_i + b.t_total_token - 1]

should concatenate (case-insensitively, modulo punctuation) to b.from.

Suggested fix

In generated_examples.py, after generating/fetching the alternative example, recompute t_token_i (and t_total_token) against the exact same tokenization that gets serialized into context_tokenized for the response. The position computed against one tokenizer/run shouldn't be served alongside context_tokenized from another.

Workaround in frontend (not great)

Detect failed restoration and fall back to a string-search lookup against bookmark.from. This papers over the data issue but makes every consumer reinvent the same fallback. Better to fix the data at the source.

Impact

User-visible: when navigating with the chevrons in an exercise, some alternative contexts don't get the highlight + chip, looking inconsistent with the original context.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions