Skip to content

[Bug Report] tokenize_and_concatenate issues spurious warnings #1134

@BorisTheBrave

Description

@BorisTheBrave

Describe the bug

If you run tokenize_and_concatenate on a tokenizer with a known model length, it'll warn that you are tokenizing too long a sequence, even though the function correctly handles this by wrapping.

Code example

# %%
from datasets import Dataset
from transformer_lens.utils import tokenize_and_concatenate
from transformers import AutoTokenizer

model_name = "google/flan-t5-xl"

tokenizer = AutoTokenizer.from_pretrained(model_name)

print(tokenizer.model_max_length)
# 512

ds = Dataset.from_dict({"text": ["x" * 100000]})

res = tokenize_and_concatenate(ds, tokenizer, max_length=tokenizer.model_max_length)
# Warns:
# Token indices sequence length is longer than the specified maximum sequence length for this model (2502 > 512). Running this sequence through the model will result in indexing errors


print(res["tokens"].shape)
# torch.Size([97, 512])

print(tokenizer.deprecation_warnings)
# {'sequence-length-is-longer-than-the-specified-maximum': True}

My recomendation is you set tokenizer.deprecation_warnings to supress the warning.

System Info
Describe the characteristic of your environment:

  • Describe how transformer_lens was installed: uv
  • What OS are you using? Linux
  • Python version (We support 3.7--3.10 currently)
    transformers v4.45.2
    transformer-lens v2.15.4

Checklist

  • I have checked that there is no similar issue in the repo (required)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions