A blazing fast Byte Pair Encoding (BPE) tokenizer library written in Rust with Python bindings.
📦 16,000+ downloads on PyPI
To validate efficiency, FIBpeTokenizer was benchmarked against Hugging Face’s tokenizer across vocabulary sizes from 4,000 to 90,000 tokens.
FIBpeTokenizer shows highly stable near-linear scaling due to Rust-level memory control and parallel processing.
At a vocabulary size of 4,000, training completes in 5.15 seconds. Even when scaled by 22x to 90,000 tokens, training time only increases to 14.53 seconds, demonstrating strong computational efficiency under load.
⚔️ Comparison vs Hugging Face Tokenizer Standard range (4k – 20k vocab): Performance is extremely close. At 10,000 vocab, FIBpeTokenizer runs in 5.89s vs HF’s 5.62s, effectively matching mature C++ backend performance. Large scale (30k – 90k vocab): Hugging Face benefits from highly optimized memory management and shows better sub-linear scaling (9.32s at 90k). FIBpeTokenizer reaches 14.53s, but remains strongly competitive and production-viable, especially considering it is a lightweight Rust-first implementation.
- 🔥 Blazing Fast: Written in Rust with parallel processing support
- 🐍 Python Support: Use it in Python via PyO3 bindings
- 🎯 Flexible Pre-tokenization: Choose between whitespace or punctuation-based splitting
- 🔖 Special Token Handling: Built-in support for special tokens like
<pad>,<mask>, etc. - 💾 Save/Load Models: Train once, reuse anywhere
- 🔧 Customizable: Configure vocabulary size, special tokens, and more
Add this to your Cargo.toml:
[dependencies]
fibpetokenizer = "0.1.0"pip install fibpetokenizeruse fibpetokenizer::{BpeTokenizer, PreTokenization, SpecialTokenRemovalMethod};
fn main() {
// Define special tokens
let special_tokens = vec![
"<pad>".to_string(),
"<mask>".to_string(),
"<unk>".to_string()
];
// Create and train tokenizer
let mut tokenizer = BpeTokenizer::new(
"corpus.txt", // Input text file
10000, // Target vocabulary size
PreTokenization::Punctuation, // Pre-tokenization strategy
special_tokens, // Special tokens
SpecialTokenRemovalMethod::AhoCorasick, // Special token removal method
true, // Save model after training
Some("output_dir") // Output directory
);
// Train the tokenizer
tokenizer.train().unwrap();
// Encode text
let text = "Hello, world! This is a test.";
let encoder = tokenizer.encode(text).unwrap();
println!("Tokens: {:?}", encoder.tokens);
println!("Token IDs: {:?}", encoder.ids);
println!("Token Types: {:?}", encoder.token_types);
// Decode back to text
let decoded = tokenizer.decode(&encoder.ids).unwrap();
println!("Decoded: {}", decoded);
// Load a pretrained tokenizer
let loaded_tokenizer = BpeTokenizer::new_from_pretrained("output_dir");
}from fibpetokenizer import (
BpeTokenizer,
PreTokenization,
SpecialTokenRemovalMethod
)
# Define special tokens
special_tokens = ["<pad>", "<mask>", "<unk>"]
# Create tokenizer
tokenizer = BpeTokenizer(
input_path="corpus.txt",
target_vocab_size=10000,
pretokenization_type=PreTokenization.punctuation(),
special_tokens=special_tokens,
special_token_removal_method=SpecialTokenRemovalMethod.aho_corasick(),
save_model=True,
output_dir="output_dir"
)
# Train the tokenizer
tokenizer.train()
# Encode text
text = "Hello, world! This is a test."
encoder = tokenizer.encode(text)
print("Tokens:", encoder.tokens)
print("Token IDs:", encoder.ids)
print("Token Types:", encoder.token_types)
# Decode back to text
decoded = tokenizer.decode(encoder.ids)
print("Decoded:", decoded)
# Load a pretrained tokenizer
loaded_tokenizer = BpeTokenizer.from_pretrained("output_dir")The main tokenizer class.
BpeTokenizer::new(
input_path: &str,
target_vocab_size: usize,
pretokenization_type: PreTokenization,
special_tokens: Vec<String>,
special_token_removal_method: SpecialTokenRemovalMethod,
save_model: bool,
output_dir: Option<&str>
) -> Selftrain(&mut self) -> Result<(), TokenizerError>: Train the tokenizer on the corpusencode(&self, text: &str) -> Result<Encoder, TokenizerError>: Encode text into tokens and IDsdecode(&self, ids: &Vec<u32>) -> Result<String, TokenizerError>: Decode token IDs back to textnew_from_pretrained(files_path: &str) -> Self: Load a pretrained tokenizerget_id_by_token(&self, token: String) -> Result<u32, TokenizerError>: Get ID for a tokenget_token_by_id(&self, id: u32) -> Result<String, TokenizerError>: Get token for an ID
Pre-tokenization strategies:
PreTokenization::Whitespace: Split on whitespacePreTokenization::Punctuation: Split on whitespace and punctuation
Methods for removing special tokens from the training corpus:
SpecialTokenRemovalMethod::Simple: Simple string replacementSpecialTokenRemovalMethod::AhoCorasick: Fast multi-pattern search using Aho-Corasick algorithm
The result of encoding text.
original_text: String: The original input texttokens: Vec<String>: The tokenized representationids: Vec<u32>: Token IDstoken_types: Vec<TokenType>: Type of each token (WORD, SUBWORD, or SPECIALTOKEN)
get_token_type(&self, token: &str) -> Result<TokenType, TokenizerError>: Get the type of a specific token
BPE (Byte Pair Encoding) is a data compression technique that iteratively merges the most frequent pair of bytes (or characters) in a sequence. This library:
- Pre-tokenizes the input text based on the selected strategy
- Builds an initial vocabulary from individual characters
- Iteratively merges the most frequent adjacent token pairs
- Stops when the target vocabulary size is reached
- Saves the vocabulary, merge rules, and configuration for later use
- Parallel processing using Rayon for fast training
- Efficient special token removal using Aho-Corasick algorithm
- Optimized data structures for merge operations
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Developed with ❤️ using Rust and PyO3.