-
Notifications
You must be signed in to change notification settings - Fork 490
Open
Description
Describe the bug
If you run tokenize_and_concatenate on a tokenizer with a known model length, it'll warn that you are tokenizing too long a sequence, even though the function correctly handles this by wrapping.
Code example
# %%
from datasets import Dataset
from transformer_lens.utils import tokenize_and_concatenate
from transformers import AutoTokenizer
model_name = "google/flan-t5-xl"
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(tokenizer.model_max_length)
# 512
ds = Dataset.from_dict({"text": ["x" * 100000]})
res = tokenize_and_concatenate(ds, tokenizer, max_length=tokenizer.model_max_length)
# Warns:
# Token indices sequence length is longer than the specified maximum sequence length for this model (2502 > 512). Running this sequence through the model will result in indexing errors
print(res["tokens"].shape)
# torch.Size([97, 512])
print(tokenizer.deprecation_warnings)
# {'sequence-length-is-longer-than-the-specified-maximum': True}My recomendation is you set tokenizer.deprecation_warnings to supress the warning.
System Info
Describe the characteristic of your environment:
- Describe how
transformer_lenswas installed: uv - What OS are you using? Linux
- Python version (We support 3.7--3.10 currently)
transformers v4.45.2
transformer-lens v2.15.4
Checklist
- I have checked that there is no similar issue in the repo (required)
Metadata
Metadata
Assignees
Labels
No labels