Skip to content

chore(deps): bump tokenizers from 0.20.4 to 0.23.1#97

Merged
Pushkinist merged 2 commits into
mainfrom
dependabot/cargo/tokenizers-0.23.1
Jun 17, 2026
Merged

chore(deps): bump tokenizers from 0.20.4 to 0.23.1#97
Pushkinist merged 2 commits into
mainfrom
dependabot/cargo/tokenizers-0.23.1

Conversation

@dependabot

@dependabot dependabot Bot commented on behalf of github Jun 15, 2026

Copy link
Copy Markdown
Contributor

Bumps tokenizers from 0.20.4 to 0.23.1.

Release notes

Sourced from tokenizers's releases.

Release v0.23.1

TL;DR

tokenizers 0.23.1 is the first proper stable release in the 0.23 line — 0.23.0 only ever shipped as rc0 because the release pipeline itself was broken (Node side hadn't shipped multi-platform binaries since 2023, Python side was on pyo3 0.27 without free-threaded support). 0.23.1 is the version where everything actually goes out the door together: full Node multi-platform wheels for the first time in years, Python 3.14 (regular and free-threaded 3.14t), full type hints for every Python class, and a stack of measurable perf wins on the BPE / added-vocab hot paths.

There is no functional 0.23.0 published — we tag 0.23.1 directly so users don't accidentally pull a never-shipped version.


🚨 Breaking changes

  • Drop Python 3.9 (#1952) — requires-python = ">=3.10"; 3.9 users stay on 0.22.x.
  • add_tokens normalizes content at insertion (#1995) — re-saved tokenizer.json may differ in the added_tokens block. Existing files load unchanged.
  • Type stubs are precise (#1928, #1997) — methods that returned Any now return real types; mypy --strict may surface previously-hidden errors. Stub layout also moved from tokenizers/<sub>/__init__.pyi to tokenizers/<sub>.pyi. This breaks the surface of some of the processors like RobertaProcessign's __init__ .
  • 3.14t-only: setters/getters return PyResult<T> because of Arc<RwLock<Tokenizer>>; a poisoned lock surfaces as PyException instead of a panic.

⚡ Performance — measured locally on this Mac, not lifted from PRs

Run with cargo bench --bench <name> -- --save-baseline v0_22_2 on v0.22.2, then --baseline v0_22_2 on v0.23.1. Numbers are point-in-time wall clock on a single laptop; relative deltas are what matters, absolute numbers will differ on CI hardware.

Added-vocabulary deserialize — the headline win (#1995, #1999)

bench: improve added_vocab_deserialize to reflect real-world workloads (#2000) is now representative of how transformers actually loads tokenizer.json files. The combined effect of daachorse for the matching automaton plus the normalize-on-insert refactor is enormous on this workload:

benchmark v0.22.2 v0.23.1 change
100k tokens, special, no norm ~410 ms 248 ms −40%
100k tokens, non-special, no norm ~7.1 s 273 ms −96%
100k tokens, special, NFKC ~395 ms 235 ms −40%
100k tokens, non-special, NFKC ~7.4 s 290 ms −96%
400k tokens, special, no norm ~15 s 980 ms −94%

Real-world impact: loading a Llama-3-style tokenizer with a large set of added tokens dropped from "noticeable pause" to "instant".

BPE encode

benchmark v0.22.2 v0.23.1 change
BPE GPT2 encode batch, no cache 530 ms 446 ms −16%
BPE GPT2 encode batch (cached) 690 ms 685 ms noise
BPE GPT2 encode (single) 1.95 s 1.94 s noise
BPE Train (small) 32.6 ms 31.5 ms −3%
BPE Train (big) 1.01 s 988 ms −2%

The BPE per-thread cache PR (#2028) shows much larger wins on highly-parallel workloads (+47–62% at 88+ threads on a server box, per the PR's own measurements on Vera). Single-thread batch numbers above are flat or slightly improved because cache-hit overhead was already low without contention.

Llama-3 encode

... (truncated)

Commits

@dependabot dependabot Bot added dependencies Pull requests that update a dependency file rust Pull requests that update rust code labels Jun 15, 2026
Bumps [tokenizers](https://github.com/huggingface/tokenizers) from 0.20.4 to 0.23.1.
- [Release notes](https://github.com/huggingface/tokenizers/releases)
- [Changelog](https://github.com/huggingface/tokenizers/blob/main/RELEASE.md)
- [Commits](huggingface/tokenizers@v0.20.4...v0.23.1)

---
updated-dependencies:
- dependency-name: tokenizers
  dependency-version: 0.23.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot force-pushed the dependabot/cargo/tokenizers-0.23.1 branch from b6ed9fc to 54d0d45 Compare June 17, 2026 03:21
tokenizers 0.21+ changed two APIs the source touched:

- Tokenizer::add_special_tokens now takes the tokens by value
  (impl IntoIterator<Item = AddedToken>) and returns Result<usize>.
  The TTS loader now consumes the local Vec and propagates the error
  via TtsError::Tokenizer instead of passing &Vec and discarding the
  return.
- WordLevelBuilder::vocab now requires ahash::AHashMap, not std HashMap.
  The decode-loop test fixture builds its tiny WordLevel tokenizer from
  a HuggingFace tokenizer JSON string via Tokenizer::from_str instead,
  which is behaviour-identical and adds no new dependency.

encode(input, add_special_tokens) and decode(ids, skip_special_tokens)
signatures and defaults are unchanged across 0.20->0.23, so all
encode/decode call sites keep their existing add-special / skip-special
semantics.
@Pushkinist Pushkinist merged commit 6066a49 into main Jun 17, 2026
2 checks passed
@Pushkinist Pushkinist deleted the dependabot/cargo/tokenizers-0.23.1 branch June 17, 2026 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file rust Pull requests that update rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant