Conversation
Signed-off-by: Bruno Alvisio <balvisio@nvidia.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In `@bionemo-recipes/models/esm2/pyproject.toml`:
- Line 17: The git-based peft dependency string in pyproject.toml currently pins
to a branch ("peft @
git+https://github.com/balvisio/peft.git@dev/ba/support-te-lora"); replace the
branch ref with an immutable identifier (a specific commit SHA or an official
release tag) so the requirement becomes pinned to that commit/tag, update the
dependency entry accordingly, and verify the chosen SHA/tag exists in the peft
repo and builds correctly (look for the dependency line containing "peft @
git+https://github.com/balvisio/peft.git@...").
In `@bionemo-recipes/recipes/esm2_peft_te/checkpoint.py`:
- Line 1: The current checkpoint.py is a cross-recipe symlink to another
recipe's checkpoint logic; replace it by copying the checkpoint implementation
into this recipe (or extract the shared logic into a new common module and
import from there), add the proper per-file license header to the copied file,
remove any imports that reference other recipes, and update all references to
use the local checkpoint implementation (i.e., the functions/classes originally
provided by the external checkpoint module) so the recipe is fully
self-contained.
In `@bionemo-recipes/recipes/esm2_peft_te/dataset.py`:
- Around line 20-28: The file imports DataCollatorWithFlattening from
transformers but this recipe expects the local, recipe-specific implementation
in collator.py; change the import to pull DataCollatorWithFlattening from the
local collator module (the same place TokenPackingDataset is imported from) so
that the DataCollatorWithFlattening used by Dataset code matches the
recipe-specific signature and Flash Attention / THD-format parameters; update
the import list to reference collator.DataCollatorWithFlattening instead of
transformers.DataCollatorWithFlattening.
In `@bionemo-recipes/recipes/esm2_peft_te/distributed_config.py`:
- Line 1: The file distributed_config.py currently points to
../esm2_native_te/distributed_config.py (cross-recipe symlink); replace this
pointer by inlining the distributed_config implementation into this recipe (copy
the code from esm2_native_te/distributed_config.py into this recipe's
distributed_config.py) and add the required per-file license header at the top,
or alternatively extract any truly shared utilities into a new common module
outside recipes (e.g., a shared package) and import that instead; ensure the new
distributed_config.py in this recipe contains no imports from other recipes and
includes the proper license header.
In `@bionemo-recipes/recipes/esm2_peft_te/Dockerfile`:
- Around line 1-12: The Dockerfile is copying files from esm2_native_te
(checkpoint.py, collator.py, distributed_config.py, scheduler.py) which violates
the self-contained recipe rule; remove those COPY lines and either vendor those
helper modules into esm2_peft_te (add them under esm2_peft_te/ and update any
imports) or move them into a shared non-recipe package, then update the
Dockerfile to only COPY esm2_peft_te/ and its requirements; ensure any import
paths in the code reference the vendored modules (e.g., esm2_peft_te.checkpoint,
esm2_peft_te.collator, etc.) so the image no longer depends on esm2_native_te.
In `@bionemo-recipes/recipes/esm2_peft_te/infer.py`:
- Around line 27-36: Add a Google-style docstring to the _batched_inference
function: document the function purpose in one line, then an Args section
listing model, tokenizer, records, batch_size (int), max_seq_length (int),
stride (int), infer_overflowing_aas (bool), and device (str) with short
descriptions and types, and a Returns section describing the tuple of list[str]
(predicted sequences) and list[int] (corresponding lengths/ids) following
pydocstyle conventions; place the docstring immediately below the def
_batched_inference(...) signature and ensure proper triple-quote formatting and
punctuation.
- Around line 46-76: The code assumes inputs contains
"overflow_to_sample_mapping" but when tokenizer was called with
return_overflowing_tokens/infer_overflowing_aas=False that key is missing;
before using overflow_map (variable overflow_map) in the inner loop (inside the
block that constructs sub_inputs and iterates preds) guard access by checking if
"overflow_to_sample_mapping" is in inputs and set overflow_map =
inputs.pop("overflow_to_sample_mapping", None); when overflow_map is None,
compute original_idx using i (outer sample index) directly (or j+k) without
indexing overflow_map, and keep appending to sequences_to_sample_mapping and
predictions as before to avoid KeyError; update references in the loop where
original_idx is assigned (original_idx = i + overflow_map[j + k].item()) to
handle both cases.
- Line 108: Replace the hardcoded tokenizer checkpoint with the model-derived
tokenizer: modify the AutoTokenizer.from_pretrained call (where tokenizer =
AutoTokenizer.from_pretrained("nvidia/esm2_t48_15B_UR50D")) to load from the
runtime/config value (e.g., args.model_tag or a config field tokenizer_name) so
the tokenizer matches the model loaded elsewhere in infer.py; also add
tokenizer_name: ${model_tag} to hydra_config/defaults_infer.yaml so the
tokenizer checkpoint can be overridden via config. Ensure you reference the same
symbol used to load the model (args.model_tag or the config object) when calling
AutoTokenizer.from_pretrained.
In `@bionemo-recipes/recipes/esm2_peft_te/requirements.txt`:
- Around line 2-6: The git-based peft dependency is pinned to a mutable branch;
update the requirement line "peft @
git+https://github.com/balvisio/peft.git@dev/ba/support-te-lora" to reference an
immutable identifier by replacing the branch name with a specific commit SHA or
released tag (e.g., @<commit-sha> or `@vX.Y.Z`) so builds are reproducible and
supply-chain safe; ensure the updated string remains in requirements.txt as
"peft @ git+https://github.com/balvisio/peft.git@<commit-or-tag>".
In `@bionemo-recipes/recipes/esm2_peft_te/scheduler.py`:
- Line 1: This file currently redirects to or imports from
esm2_native_te.scheduler which violates recipe isolation; replace that
cross-recipe dependency by copying the required scheduler logic into this recipe
or implementing a local equivalent. Identify the exported symbols you depend on
(e.g., Scheduler class, create_scheduler or get_scheduler factory function, and
any helper functions like schedule_task or init_scheduler) from the
esm2_native_te implementation, reproduce their behavior locally inside
bionemo-recipes/recipes/esm2_peft_te/scheduler.py, update any local imports to
use the new local implementations, and remove any import lines that reference
esm2_native_te.scheduler so the recipe is fully self-contained.
- Line 1: The file currently contains a bare path string which raises a
SyntaxError; replace it with a valid Python implementation or a re-export
import: remove the literal path and add an import that re-exports the native
scheduler (e.g., import or from import of esm2_native_te.scheduler) or implement
a minimal wrapper function/class matching this package's expected API (e.g.,
functions/classes used elsewhere that reference scheduler.py) so the module can
be imported without error; update symbols to match callers if you implement a
wrapper.
In `@bionemo-recipes/recipes/esm2_peft_te/train_lora_convnet.py`:
- Line 25: The recipe imports NVEsmForConvTokenClassification from
modeling_esm_te in train_lora_convnet.py which pulls code from another recipe
and breaks self-containment; either vendor the required module into this recipe
(add modeling_esm_te.py with NVEsmForConvTokenClassification implementation
alongside train_lora_convnet.py and update the package/module path) or declare
and install the external esm package from models/esm2 (add it to
requirements.txt and update the Dockerfile to copy/install that package) so that
the import in train_lora_convnet.py resolves without referencing code outside
this recipe.
🟡 Minor comments (15)
bionemo-recipes/recipes/esm2_peft_te/scheduler.py-1-1 (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd license header and Google-style module docstring.
The file is missing the required license header and module docstring. Pre-commit hooks will likely fail.
As per coding guidelines: “Ensure license headers are present in all files…” and “Use Google-style docstrings (pydocstyle).”
.github/workflows/unit-tests-recipes.yml-158-160 (1)
158-160:⚠️ Potential issue | 🟡 MinorAvoid hard-coded
safe.directorypath to prevent CI breakage in forks.Using a fixed
/__w/bionemo-framework/bionemo-frameworkpath can fail when the repo name changes (forks or mirrors), triggering “dubious ownership” errors. Prefer$GITHUB_WORKSPACEso the command is resilient.🛠️ Proposed fix
- run: git -c safe.directory=/__w/bionemo-framework/bionemo-framework sparse-checkout add bionemo-recipes/recipes/esm2_native_te + run: git -c safe.directory="$GITHUB_WORKSPACE" sparse-checkout add bionemo-recipes/recipes/esm2_native_tebionemo-recipes/recipes/esm2_peft_te/example_nv_esm2_t6_8M_UR50D_peft_checkpoint/config.json-31-42 (1)
31-42:⚠️ Potential issue | 🟡 MinorAdd a comment explaining the "L" label in
label2id.The
label2idmapping includes"L": 2, which is not a standard DSSP secondary structure code. While this label is used consistently throughout the recipe (mapped to the coil class), it's undocumented what "L" represents. Add an inline comment clarifying whether "L" represents "Loop" or another designation, and explain why it's included in the label scheme alongside standard DSSP codes.bionemo-recipes/recipes/esm2_peft_te/example_nv_esm2_t6_8M_UR50D_peft_checkpoint/README.md-1-200 (1)
1-200:⚠️ Potential issue | 🟡 MinorReplace placeholder fields before shipping the model card.
The model card is entirely "[More Information Needed]" placeholders. If this accompanies a published checkpoint, please fill in at least core metadata (license, training data, intended use, evaluation) or clearly mark it as a template to avoid shipping incomplete documentation.bionemo-recipes/recipes/esm2_peft_te/README.md-101-107 (1)
101-107:⚠️ Potential issue | 🟡 MinorUse descriptive link text for the esm2_native_te README link.
Line 107 uses "here", which is not descriptive and triggers MD059. Suggest updating the anchor text.Proposed fix
-For more information see [here](../esm2_native_te/README.md). +For more information see the [esm2_native_te README](../esm2_native_te/README.md).bionemo-recipes/models/esm2/src/esm/modeling_esm_te.py-684-707 (1)
684-707:⚠️ Potential issue | 🟡 MinorAdd Google-style Args/Returns to the new conv head docstrings.
NVConvNetHeadandNVEsmForConvTokenClassificationdocstrings are minimal and missing Args/Returns sections, and the__init__docstring name is off. Please update them to Google-style to satisfy pydocstyle.Example docstring updates
class NVConvNetHead(nn.Module): - """Convolution based head for token classification.""" + """Convolution-based head for token classification. + + Args: + config (NVEsmConfig): Model configuration. + """ @@ - def forward(self, features, **kwargs): - """Forward pass for the convolutional token classification head.""" + def forward(self, features, **kwargs): + """Forward pass for the convolutional token classification head. + + Args: + features (torch.Tensor): Input features of shape (batch, hidden, seq_len). + **kwargs: Unused keyword arguments. + + Returns: + torch.Tensor: Logits of shape (batch, seq_len, num_labels). + """ @@ class NVEsmForConvTokenClassification(NVEsmPreTrainedModel): @@ - def __init__(self, config): - """Initialize NVEsmForTokenClassification.""" + def __init__(self, config): + """Initialize NVEsmForConvTokenClassification. + + Args: + config (NVEsmConfig): Model configuration. + """As per coding guidelines: Use Google-style docstrings following pydocstyle conventions.
bionemo-recipes/recipes/esm2_peft_te/example_8m_checkpoint/esm_nv.py-705-714 (1)
705-714:⚠️ Potential issue | 🟡 MinorFix docstring mismatch and redundant
init_weights()call.Two issues in the constructor:
- Line 706: Docstring says "Initialize NVEsmForTokenClassification" but this is
NVEsmForConvTokenClassification.- Lines 713-714: Calling both
init_weights()andpost_init()is redundant—post_init()already invokesinit_weights()internally (see HuggingFacePreTrainedModel.post_init()). Other classes in this file (e.g.,NVEsmForTokenClassificationat line 640) only callpost_init().Proposed fix
def __init__(self, config): - """Initialize NVEsmForTokenClassification.""" + """Initialize NVEsmForConvTokenClassification.""" super().__init__(config) self.num_labels = config.num_labels self.esm = NVEsmModel(config, add_pooling_layer=False) self.classifier = NVConvNetHead(config) - self.init_weights() self.post_init()bionemo-recipes/recipes/esm2_accelerate_te/example_8m_checkpoint/esm_nv.py-705-714 (1)
705-714:⚠️ Potential issue | 🟡 MinorFix docstring mismatch and redundant
init_weights()call.Same issues as in other esm_nv.py files:
- Docstring says "Initialize NVEsmForTokenClassification" instead of "Initialize NVEsmForConvTokenClassification".
- Redundant
init_weights()call beforepost_init().Proposed fix
def __init__(self, config): - """Initialize NVEsmForTokenClassification.""" + """Initialize NVEsmForConvTokenClassification.""" super().__init__(config) self.num_labels = config.num_labels self.esm = NVEsmModel(config, add_pooling_layer=False) self.classifier = NVConvNetHead(config) - self.init_weights() self.post_init()bionemo-recipes/recipes/esm2_peft_te/example_nv_esm2_t6_8M_UR50D_peft_checkpoint/esm_nv.py-705-714 (1)
705-714:⚠️ Potential issue | 🟡 MinorFix docstring mismatch and redundant
init_weights()call.Same issues as in
esm2_peft_te/example_8m_checkpoint/esm_nv.py:
- Docstring says "Initialize NVEsmForTokenClassification" instead of "Initialize NVEsmForConvTokenClassification".
- Redundant
init_weights()call beforepost_init().Proposed fix
def __init__(self, config): - """Initialize NVEsmForTokenClassification.""" + """Initialize NVEsmForConvTokenClassification.""" super().__init__(config) self.num_labels = config.num_labels self.esm = NVEsmModel(config, add_pooling_layer=False) self.classifier = NVConvNetHead(config) - self.init_weights() self.post_init()bionemo-recipes/recipes/esm2_native_te/example_8m_checkpoint/esm_nv.py-705-714 (1)
705-714:⚠️ Potential issue | 🟡 MinorFix docstring mismatch and redundant
init_weights()call.Same issues as in other esm_nv.py files:
- Docstring says "Initialize NVEsmForTokenClassification" instead of "Initialize NVEsmForConvTokenClassification".
- Redundant
init_weights()call beforepost_init().Proposed fix
def __init__(self, config): - """Initialize NVEsmForTokenClassification.""" + """Initialize NVEsmForConvTokenClassification.""" super().__init__(config) self.num_labels = config.num_labels self.esm = NVEsmModel(config, add_pooling_layer=False) self.classifier = NVConvNetHead(config) - self.init_weights() self.post_init()bionemo-recipes/recipes/esm2_peft_te/infer.py-104-106 (1)
104-106:⚠️ Potential issue | 🟡 MinorHandle CPU-only environments or allow a device override.
The script hardcodes CUDA at line 106, which crashes on CPU-only systems. Additionally, the
_batched_inference()call (lines 112-117) doesn't pass the device parameter despite the function supporting it.🛠️ Suggested device handling
# Load PEFT adapters on top peft_model = PeftModel.from_pretrained(base_model, args.peft_model_config_dir) - peft_model = peft_model.to("cuda").eval() + device = "cuda" if torch.cuda.is_available() else "cpu" + peft_model = peft_model.to(device).eval() @@ - predictions, sequences_to_sample_mapping = _batched_inference( - peft_model, - tokenizer, - records, - **args.inference, - ) + inference_kwargs = dict(args.inference) + inference_kwargs.setdefault("device", device) + predictions, sequences_to_sample_mapping = _batched_inference( + peft_model, + tokenizer, + records, + **inference_kwargs, + )bionemo-recipes/recipes/esm2_peft_te/train_lora_convnet.py-170-196 (1)
170-196:⚠️ Potential issue | 🟡 MinorGuard against empty validation dataloader.
If validation yields zero batches,
val_stepsstays 0 and the averaging will raise. Add a guard to handle empty validation splits.💡 Suggested fix
- avg_val_loss = val_loss_total / val_steps - avg_val_acc = val_correct_total / val_tokens_total if val_tokens_total > 0 else 0.0 + if val_steps == 0: + avg_val_loss = 0.0 + avg_val_acc = 0.0 + else: + avg_val_loss = val_loss_total / val_steps + avg_val_acc = val_correct_total / val_tokens_total if val_tokens_total > 0 else 0.0bionemo-recipes/recipes/esm2_peft_te/train_lora_ddp.py-171-197 (1)
171-197:⚠️ Potential issue | 🟡 MinorGuard against empty validation dataloader.
If the validation dataloader yields zero batches,
val_stepsremains 0 and this division will raise. Add a zero-batch guard.💡 Suggested fix
- avg_val_loss = val_loss_total / val_steps - avg_val_acc = val_correct_total / val_tokens_total if val_tokens_total > 0 else 0.0 + if val_steps == 0: + avg_val_loss = 0.0 + avg_val_acc = 0.0 + else: + avg_val_loss = val_loss_total / val_steps + avg_val_acc = val_correct_total / val_tokens_total if val_tokens_total > 0 else 0.0bionemo-recipes/recipes/esm2_peft_te/utils.py-106-123 (1)
106-123:⚠️ Potential issue | 🟡 MinorValidate CSV headers and handle empty files.
The current code fails on empty CSV files (line 115:
"pdb_id" in reader.fieldnamesraisesTypeErrorwhenfieldnamesisNone), and missing thesequencecolumn causes an unhandledKeyErroron line 120. Add upfront validation to provide clear error messages.Suggested fix
def load_csv(path: Path) -> list[dict]: """Read input CSV file for inference. @@ - with open(path) as f: - reader = csv.DictReader(f) - has_pdb_id = "pdb_id" in reader.fieldnames + with open(path, newline="") as f: + reader = csv.DictReader(f) + if reader.fieldnames is None: + raise ValueError("CSV must include a header with a 'sequence' column.") + if "sequence" not in reader.fieldnames: + raise ValueError("CSV header must include a 'sequence' column.") + has_pdb_id = "pdb_id" in reader.fieldnamesbionemo-recipes/recipes/esm2_peft_te/tests/test_train_lora.py-16-18 (1)
16-18:⚠️ Potential issue | 🟡 MinorGuard
test_sanity_ddp_thdwith CUDA availability check to prevent failures in CPU-only environments.
torch.cuda.get_device_capability()raises if CUDA is unavailable. Add the skip guard at the start of the function to avoid test failures in CPU-only runs.💡 Suggested fix
import torch +import pytest from hydra import compose, initialize_config_dir from train_lora_ddp import main as main_ddp def test_sanity_ddp_thd(tmp_path, monkeypatch, recipe_path): + if not torch.cuda.is_available(): + pytest.skip("CUDA is required for DDP THD sanity test") if torch.cuda.get_device_capability() == (12, 0): # TODO(BIONEMO-2840): On sm120, we need to set NVTE_FUSED_ATTN to 0 since TE will choose fused attn by default,
🧹 Nitpick comments (7)
bionemo-recipes/recipes/esm2_peft_te/perf_logger.py (2)
139-142: Avoid mutating the caller'soutputsobject.Directly modifying
outputs.logitswithunsqueeze(0)mutates the object passed by the caller, which could cause unexpected side effects in the training loop if the outputs are used elsewhere after this call.Proposed fix: use a local variable
# Handle sequence packing for torchmetrics calculation. + logits_for_perplexity = outputs.logits if outputs.logits.dim() < 3: - outputs.logits = outputs.logits.unsqueeze(0) + logits_for_perplexity = outputs.logits.unsqueeze(0) - self.metrics["train/perplexity"].update(outputs.logits, batch["labels"]) + self.metrics["train/perplexity"].update(logits_for_perplexity, batch["labels"])
153-166: Inconsistent rank checks for logging.Line 153 uses
is_main_process()for wandb logging, but line 165 useslocal_rank == 0for logger output. In multi-node distributed setups, these may not be equivalent (e.g.,local_rank == 0is true on every node, whileis_main_process()is typically true only on global rank 0).Consider using consistent rank checks throughout.
Proposed fix
- if self._dist_config.local_rank == 0: + if self._dist_config.is_main_process(): logger.info(", ".join([f"{k.split('/')[1]}: {v:.3g}" for k, v in metrics.items()]))bionemo-recipes/recipes/esm2_peft_te/tests/test_train_lora_two_gpus.py (1)
49-52: Add a Google-style docstring to the test.Keeps docstring linting consistent even under relaxed test rules.
♻️ Suggested update
`@requires_multi_gpu` def test_multi_gpu_train_te_ddp(tmp_path, recipe_path): + """Smoke-test multi-GPU DDP training for the recipe.""" # Run 'accelerate launch train.py' as a subprocessAs per coding guidelines: Use Google-style docstrings following pydocstyle conventions.
bionemo-recipes/recipes/esm2_peft_te/train_lora_ddp.py (1)
111-112: Gate verbose prints to the main process.These prints will fire on every rank and can spam logs; consider restricting them to the main process.
💡 Suggested fix
- print("----- PEFT Model --------") - peft_model.print_trainable_parameters() + if dist_config.is_main_process(): + print("----- PEFT Model --------") + peft_model.print_trainable_parameters() @@ - print(f"\nStep: {step}: Validation Loss = {avg_val_loss:.4f}, Accuracy: {avg_val_acc:.4f}\n") + if dist_config.is_main_process(): + print(f"\nStep: {step}: Validation Loss = {avg_val_loss:.4f}, Accuracy: {avg_val_acc:.4f}\n")Also applies to: 198-198
bionemo-recipes/recipes/esm2_peft_te/train_lora_convnet.py (1)
110-111: Gate verbose prints to the main process.These prints execute on every rank; restricting to the main process avoids noisy multi-rank output.
💡 Suggested fix
- print("----- PEFT Model --------") - peft_model.print_trainable_parameters() + if dist_config.is_main_process(): + print("----- PEFT Model --------") + peft_model.print_trainable_parameters() @@ - print(f"\nStep: {step}: Validation Loss = {avg_val_loss:.4f}, Accuracy: {avg_val_acc:.4f}\n") + if dist_config.is_main_process(): + print(f"\nStep: {step}: Validation Loss = {avg_val_loss:.4f}, Accuracy: {avg_val_acc:.4f}\n")Also applies to: 197-197
bionemo-recipes/recipes/esm2_peft_te/dataset.py (1)
31-44: Use Google-style docstring forcreate_dataloader.The current one-line docstring doesn’t meet the Google-style requirement. Add
Args:/Returns:sections.As per coding guidelines, Ensure all Python files follow Google-style docstrings (pydocstyle convention).💡 Suggested fix
- """Create a dataloader for the secondary structure dataset.""" + """Create dataloaders for the secondary structure dataset. + + Args: + distributed_config: Distributed training configuration. + use_sequence_packing: Whether to enable sequence packing. + tokenizer_name: Tokenizer identifier. + micro_batch_size: Training micro-batch size. + val_micro_batch_size: Validation micro-batch size. + num_workers: DataLoader worker count. + max_seq_length: Maximum sequence length. + stride: Tokenizer stride for overflow. + seed: RNG seed for shuffling. + ss3_classification: Whether to use SS3 labels (else SS8). + load_dataset_kwargs: Keyword arguments for datasets.load_dataset. + + Returns: + Tuple of (train_dataloader, val_dataloader, train_dataset_or_sampler). + """bionemo-recipes/recipes/esm2_peft_te/utils.py (1)
55-61: Use Google-style docstrings for public helpers.These public functions should include
Args:/Returns:sections to meet the docstring standard. Consider updating all helper docstrings similarly.As per coding guidelines, Ensure all Python files follow Google-style docstrings (pydocstyle convention).💡 Example for one function
def compute_accuracy(preds, labels, ignore_index=-100) -> tuple[int, int]: - """Calculate the accuracy.""" + """Calculate the accuracy. + + Args: + preds: Model logits or scores per token. + labels: Ground-truth label tensor. + ignore_index: Label value to exclude from accuracy. + + Returns: + Tuple of (correct_count, total_count). + """Also applies to: 64-82, 84-104, 106-135, 138-161, 163-170
| "megatron-fsdp", | ||
| "omegaconf", | ||
| "peft", | ||
| "peft @ git+https://github.com/balvisio/peft.git@dev/ba/support-te-lora", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "pyproject.toml" -type f | grep -E "(esm2|bionemo)" | head -20Repository: NVIDIA/bionemo-framework
Length of output: 840
🏁 Script executed:
cat -n bionemo-recipes/models/esm2/pyproject.toml | head -30Repository: NVIDIA/bionemo-framework
Length of output: 909
🏁 Script executed:
# Check for requirements.txt files mentioned in the review
find . -name "requirements*.txt" -type f | grep -E "(esm2|recipe)" | head -20Repository: NVIDIA/bionemo-framework
Length of output: 628
🏁 Script executed:
cat -n bionemo-recipes/recipes/esm2_peft_te/requirements.txt | grep -i peft -A 2 -B 2Repository: NVIDIA/bionemo-framework
Length of output: 215
🏁 Script executed:
# Check other esm2 requirements files too
cat -n bionemo-recipes/recipes/esm2_accelerate_te/requirements.txt | grep -i peft -A 2 -B 2Repository: NVIDIA/bionemo-framework
Length of output: 50
🏁 Script executed:
cat -n bionemo-recipes/recipes/esm2_native_te/requirements.txt | grep -i peft -A 2 -B 2Repository: NVIDIA/bionemo-framework
Length of output: 50
Pin the git-based peft dependency to an immutable commit/tag.
The peft dependency uses a branch reference (@dev/ba/support-te-lora) instead of a commit SHA or released tag. Branch references are non-immutable and can change unpredictably, creating reproducibility and supply-chain risks. Use a pinned commit SHA or released tag to ensure deterministic builds.
🤖 Prompt for AI Agents
In `@bionemo-recipes/models/esm2/pyproject.toml` at line 17, The git-based peft
dependency string in pyproject.toml currently pins to a branch ("peft @
git+https://github.com/balvisio/peft.git@dev/ba/support-te-lora"); replace the
branch ref with an immutable identifier (a specific commit SHA or an official
release tag) so the requirement becomes pinned to that commit/tag, update the
dependency entry accordingly, and verify the chosen SHA/tag exists in the peft
repo and builds correctly (look for the dependency line containing "peft @
git+https://github.com/balvisio/peft.git@...").
| def forward( | ||
| self, | ||
| input_ids: Optional[torch.LongTensor] = None, | ||
| attention_mask: Optional[torch.Tensor] = None, | ||
| position_ids: Optional[torch.LongTensor] = None, | ||
| inputs_embeds: Optional[torch.FloatTensor] = None, | ||
| labels: Optional[torch.LongTensor] = None, | ||
| **kwargs: Unpack[TransformersKwargs], | ||
| ) -> TokenClassifierOutput: | ||
| """Forward pass for the token classification head. | ||
|
|
||
| labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): | ||
| Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`. | ||
| """ | ||
| outputs = self.esm( | ||
| input_ids, | ||
| attention_mask=attention_mask, | ||
| position_ids=position_ids, | ||
| inputs_embeds=inputs_embeds, | ||
| **kwargs, | ||
| ) | ||
|
|
||
| if outputs[0].dim() == 3: | ||
| sequence_output = outputs[0] | ||
| else: | ||
| sequence_output = outputs[0].unsqueeze(0) | ||
|
|
||
| sequence_output = sequence_output.transpose(1, 2) | ||
|
|
||
| logits = self.classifier(sequence_output) | ||
|
|
||
| loss = None | ||
| if labels is not None: | ||
| loss_fct = CrossEntropyLoss() | ||
|
|
||
| labels = labels.to(logits.device) | ||
| loss = loss_fct(logits.reshape(-1, self.num_labels), labels.view(-1)) | ||
|
|
There was a problem hiding this comment.
Packed/THD sequences will be convolved across boundaries.
When sequence packing is enabled (THD), outputs[0] becomes a flattened token stream. The Conv1d head will mix tokens from adjacent sequences, corrupting logits/loss. Consider rejecting packed inputs for this head or inserting explicit padding between sequences and handling it accordingly.
Possible guard to prevent incorrect packed-input usage
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
**kwargs: Unpack[TransformersKwargs],
) -> TokenClassifierOutput:
+ if kwargs.get("cu_seq_lens_q") is not None or kwargs.get("cu_seq_lens_k") is not None:
+ raise ValueError(
+ "NVEsmForConvTokenClassification does not support sequence-packed (THD) inputs; "
+ "disable use_sequence_packing."
+ )
outputs = self.esm(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,| @@ -0,0 +1 @@ | |||
| ../esm2_native_te/checkpoint.py No newline at end of file | |||
There was a problem hiding this comment.
Replace cross-recipe symlink with a local/shared implementation.
Pointing this file at ../esm2_native_te/checkpoint.py makes the recipe depend on another recipe and bypasses per-file license headers. Please copy the checkpoint logic into this recipe, or move shared logic into a non-recipe common module and import from there.
As per coding guidelines, recipes in bionemo-recipes/recipes/ must be completely self-contained with no shared dependencies between recipes and no imports from other recipes, and ensure license headers are present in all **/*.{py,js,ts,jsx,tsx} files.
🤖 Prompt for AI Agents
In `@bionemo-recipes/recipes/esm2_peft_te/checkpoint.py` at line 1, The current
checkpoint.py is a cross-recipe symlink to another recipe's checkpoint logic;
replace it by copying the checkpoint implementation into this recipe (or extract
the shared logic into a new common module and import from there), add the proper
per-file license header to the copied file, remove any imports that reference
other recipes, and update all references to use the local checkpoint
implementation (i.e., the functions/classes originally provided by the external
checkpoint module) so the recipe is fully self-contained.
| from transformers import ( | ||
| AutoTokenizer, | ||
| DataCollatorForTokenClassification, | ||
| DataCollatorWithFlattening, | ||
| ) | ||
|
|
||
| from collator import TokenPackingDataset | ||
| from distributed_config import DistributedConfig | ||
| from utils import SS3_LABEL2ID, SS8_LABEL2ID |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find the DataCollatorWithFlattening class definition
rg -n "class DataCollatorWithFlattening" -g '*.py' --type pyRepository: NVIDIA/bionemo-framework
Length of output: 409
🏁 Script executed:
#!/bin/bash
# Check the structure of the recipes/esm2_peft_te directory
find bionemo-recipes/recipes/esm2_peft_te -type f -name "*.py" | head -20Repository: NVIDIA/bionemo-framework
Length of output: 778
🏁 Script executed:
#!/bin/bash
# Read the collator.py file to see what's defined there
cat -n bionemo-recipes/recipes/esm2_peft_te/collator.py 2>/dev/null | head -100Repository: NVIDIA/bionemo-framework
Length of output: 5676
🏁 Script executed:
#!/bin/bash
# Check the dataset.py file around the import section
cat -n bionemo-recipes/recipes/esm2_peft_te/dataset.py 2>/dev/null | head -50Repository: NVIDIA/bionemo-framework
Length of output: 2298
🏁 Script executed:
#!/bin/bash
# Check if collator.py exists in esm2_peft_te directory
ls -la bionemo-recipes/recipes/esm2_peft_te/ | grep collatorRepository: NVIDIA/bionemo-framework
Length of output: 162
🌐 Web query:
DataCollatorWithFlattening transformers library
💡 Result:
DataCollatorWithFlattening is a Hugging Face Transformers data collator for padding-free training. Instead of padding each example to a common length, it concatenates the whole mini-batch into one long sequence of shape [1, total_tokens], and (by default) returns input_ids, labels, and position_ids. It separates each original example inside the concatenated labels using separator_id (default -100, i.e., ignore index for loss). It can also optionally return seq_idx (which original sequence each token came from) and FlashAttention kwargs. [1]
Key params (from the docs): return_position_ids=True, separator_id=-100, return_flash_attn_kwargs=False, return_seq_idx=False. [1]
Caveat: because the batch becomes one long sequence, your attention implementation must be able to handle this “flattened batch” layout (it’s not the usual [batch, seq] padding/masking pattern). [1]
Example:
from transformers import DataCollatorWithFlattening
collator = DataCollatorWithFlattening(separator_id=-100, return_seq_idx=True)
# pass `collator` as the data_collator to Trainer / DataLoaderSource: Transformers “Data collator” docs (main branch). [1]
References:
[1] https://huggingface.co/docs/transformers/main/en/main_classes/data_collator
🏁 Script executed:
#!/bin/bash
# Check if there's a collator.py file in esm2_peft_te by searching more thoroughly
find bionemo-recipes/recipes/esm2_peft_te -name "collator.py"Repository: NVIDIA/bionemo-framework
Length of output: 116
🏁 Script executed:
#!/bin/bash
# Check the imports in dataset.py more carefully and see if TokenPackingDataset is also from collator
cat -n bionemo-recipes/recipes/esm2_peft_te/dataset.py | head -35Repository: NVIDIA/bionemo-framework
Length of output: 1567
Import DataCollatorWithFlattening from the local collator module.
A custom DataCollatorWithFlattening is defined in collator.py (via symlink to esm2_native_te/collator.py) with a recipe-specific implementation for THD format and Flash Attention support. The current import from transformers retrieves the official library version instead, which has a different signature and will cause runtime errors when instantiated with the custom parameters expected by this recipe. Import it from the local module to ensure the recipe uses its own self-contained implementation.
Suggested fix
-from transformers import (
- AutoTokenizer,
- DataCollatorForTokenClassification,
- DataCollatorWithFlattening,
-)
+from transformers import AutoTokenizer, DataCollatorForTokenClassification
@@
-from collator import TokenPackingDataset
+from collator import DataCollatorWithFlattening, TokenPackingDataset📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from transformers import ( | |
| AutoTokenizer, | |
| DataCollatorForTokenClassification, | |
| DataCollatorWithFlattening, | |
| ) | |
| from collator import TokenPackingDataset | |
| from distributed_config import DistributedConfig | |
| from utils import SS3_LABEL2ID, SS8_LABEL2ID | |
| from transformers import AutoTokenizer, DataCollatorForTokenClassification | |
| from collator import DataCollatorWithFlattening, TokenPackingDataset | |
| from distributed_config import DistributedConfig | |
| from utils import SS3_LABEL2ID, SS8_LABEL2ID |
🤖 Prompt for AI Agents
In `@bionemo-recipes/recipes/esm2_peft_te/dataset.py` around lines 20 - 28, The
file imports DataCollatorWithFlattening from transformers but this recipe
expects the local, recipe-specific implementation in collator.py; change the
import to pull DataCollatorWithFlattening from the local collator module (the
same place TokenPackingDataset is imported from) so that the
DataCollatorWithFlattening used by Dataset code matches the recipe-specific
signature and Flash Attention / THD-format parameters; update the import list to
reference collator.DataCollatorWithFlattening instead of
transformers.DataCollatorWithFlattening.
| @@ -0,0 +1 @@ | |||
| ../esm2_native_te/distributed_config.py No newline at end of file | |||
There was a problem hiding this comment.
Avoid cross-recipe symlink for distributed config.
This pointer ties the recipe to another recipe’s config and skips the required per-file license header. Please inline the config here or move shared logic into a non-recipe common module.
As per coding guidelines, recipes in bionemo-recipes/recipes/ must be completely self-contained with no shared dependencies between recipes and no imports from other recipes, and ensure license headers are present in all **/*.{py,js,ts,jsx,tsx} files.
🤖 Prompt for AI Agents
In `@bionemo-recipes/recipes/esm2_peft_te/distributed_config.py` at line 1, The
file distributed_config.py currently points to
../esm2_native_te/distributed_config.py (cross-recipe symlink); replace this
pointer by inlining the distributed_config implementation into this recipe (copy
the code from esm2_native_te/distributed_config.py into this recipe's
distributed_config.py) and add the required per-file license header at the top,
or alternatively extract any truly shared utilities into a new common module
outside recipes (e.g., a shared package) and import that instead; ensure the new
distributed_config.py in this recipe contains no imports from other recipes and
includes the proper license header.
| inputs = tokenizer( | ||
| sequences, | ||
| max_length=max_seq_length, | ||
| truncation=True, | ||
| stride=stride, | ||
| return_overflowing_tokens=infer_overflowing_aas, | ||
| return_tensors="pt", | ||
| padding=True, | ||
| ) | ||
|
|
||
| overflow_map = inputs.pop("overflow_to_sample_mapping") | ||
| num_samples = len(inputs["input_ids"]) | ||
|
|
||
| # inner batching over tokenizer outputs | ||
| for j in range(0, num_samples, batch_size): | ||
| sub_inputs = {k: v[j : j + batch_size].to(device) for k, v in inputs.items()} | ||
|
|
||
| with torch.no_grad(): | ||
| outputs = model(**sub_inputs) | ||
|
|
||
| preds = outputs.logits.argmax(dim=-1) | ||
|
|
||
| for k, (pred, input_ids) in enumerate(zip(preds, sub_inputs["input_ids"])): | ||
| length = (input_ids != tokenizer.pad_token_id).sum().item() | ||
| labels = "".join(id2label[i.item()] for i in pred[:length]) | ||
|
|
||
| predictions.append(labels) | ||
|
|
||
| # map back to original record index | ||
| original_idx = i + overflow_map[j + k].item() | ||
| sequences_to_sample_mapping.append(original_idx) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "infer.py" -path "*esm2_peft_te*" | head -5Repository: NVIDIA/bionemo-framework
Length of output: 115
🏁 Script executed:
cat -n bionemo-recipes/recipes/esm2_peft_te/infer.py | head -100Repository: NVIDIA/bionemo-framework
Length of output: 4496
🌐 Web query:
Hugging Face Transformers tokenizer return_overflowing_tokens overflow_to_sample_mapping behavior
💡 Result:
return_overflowing_tokens=Truetells the tokenizer to split inputs longer thanmax_lengthinto multiple “windows” (optionally overlapping bystride) and return them as additional rows in the output. (huggingface.co)- When that happens, the returned batch is flattened from roughly
(batch, overflows, seq_len)into(batch * overflows, seq_len), andoverflow_to_sample_mappingis added so you can map each returned window back to the original example index within the input batch. Specifically, it’s built by repeating each original indexiexactly(# of windows produced for sample i)times. (huggingface.co) - Important edge case: for sequence pairs with truncation strategy
'longest_first', Transformers warns that overflowing tokens aren’t returned (so overflow-related outputs may be empty/missing even if truncation occurred). (discuss.huggingface.co) - Practical gotcha with 🤗 Datasets: if you tokenize inside
Dataset.map(..., num_proc>1),overflow_to_sample_mappingcan be wrong because each worker computes indices relative to its shard (not the global dataset) and then results are concatenated. (github.com)
Citations:
- 1: https://huggingface.co/transformers/v4.3.3/_modules/transformers/tokenization_utils_fast.html?utm_source=openai
- 2: https://huggingface.co/transformers/v4.3.3/_modules/transformers/tokenization_utils_fast.html?utm_source=openai
- 3: https://discuss.huggingface.co/t/what-does-this-warning-mean-overflowing-tokens-are-not-returned-for-the-setting-you-have-chosen/11594?utm_source=openai
- 4: Datasets.map causes incorrect overflow_to_sample_mapping when used with tokenizers and small batch size huggingface/datasets#4755
Guard overflow_to_sample_mapping when overflow is disabled.
When return_overflowing_tokens=False, the tokenizer does not include overflow_to_sample_mapping in its output, causing a KeyError on line 56.
Suggested fix
- overflow_map = inputs.pop("overflow_to_sample_mapping")
+ if infer_overflowing_aas:
+ overflow_map = inputs.pop("overflow_to_sample_mapping")
+ else:
+ overflow_map = torch.arange(len(inputs["input_ids"]))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| inputs = tokenizer( | |
| sequences, | |
| max_length=max_seq_length, | |
| truncation=True, | |
| stride=stride, | |
| return_overflowing_tokens=infer_overflowing_aas, | |
| return_tensors="pt", | |
| padding=True, | |
| ) | |
| overflow_map = inputs.pop("overflow_to_sample_mapping") | |
| num_samples = len(inputs["input_ids"]) | |
| # inner batching over tokenizer outputs | |
| for j in range(0, num_samples, batch_size): | |
| sub_inputs = {k: v[j : j + batch_size].to(device) for k, v in inputs.items()} | |
| with torch.no_grad(): | |
| outputs = model(**sub_inputs) | |
| preds = outputs.logits.argmax(dim=-1) | |
| for k, (pred, input_ids) in enumerate(zip(preds, sub_inputs["input_ids"])): | |
| length = (input_ids != tokenizer.pad_token_id).sum().item() | |
| labels = "".join(id2label[i.item()] for i in pred[:length]) | |
| predictions.append(labels) | |
| # map back to original record index | |
| original_idx = i + overflow_map[j + k].item() | |
| sequences_to_sample_mapping.append(original_idx) | |
| inputs = tokenizer( | |
| sequences, | |
| max_length=max_seq_length, | |
| truncation=True, | |
| stride=stride, | |
| return_overflowing_tokens=infer_overflowing_aas, | |
| return_tensors="pt", | |
| padding=True, | |
| ) | |
| if infer_overflowing_aas: | |
| overflow_map = inputs.pop("overflow_to_sample_mapping") | |
| else: | |
| overflow_map = torch.arange(len(inputs["input_ids"])) | |
| num_samples = len(inputs["input_ids"]) | |
| # inner batching over tokenizer outputs | |
| for j in range(0, num_samples, batch_size): | |
| sub_inputs = {k: v[j : j + batch_size].to(device) for k, v in inputs.items()} | |
| with torch.no_grad(): | |
| outputs = model(**sub_inputs) | |
| preds = outputs.logits.argmax(dim=-1) | |
| for k, (pred, input_ids) in enumerate(zip(preds, sub_inputs["input_ids"])): | |
| length = (input_ids != tokenizer.pad_token_id).sum().item() | |
| labels = "".join(id2label[i.item()] for i in pred[:length]) | |
| predictions.append(labels) | |
| # map back to original record index | |
| original_idx = i + overflow_map[j + k].item() | |
| sequences_to_sample_mapping.append(original_idx) |
🤖 Prompt for AI Agents
In `@bionemo-recipes/recipes/esm2_peft_te/infer.py` around lines 46 - 76, The code
assumes inputs contains "overflow_to_sample_mapping" but when tokenizer was
called with return_overflowing_tokens/infer_overflowing_aas=False that key is
missing; before using overflow_map (variable overflow_map) in the inner loop
(inside the block that constructs sub_inputs and iterates preds) guard access by
checking if "overflow_to_sample_mapping" is in inputs and set overflow_map =
inputs.pop("overflow_to_sample_mapping", None); when overflow_map is None,
compute original_idx using i (outer sample index) directly (or j+k) without
indexing overflow_map, and keep appending to sequences_to_sample_mapping and
predictions as before to avoid KeyError; update references in the loop where
original_idx is assigned (original_idx = i + overflow_map[j + k].item()) to
handle both cases.
| peft_model = PeftModel.from_pretrained(base_model, args.peft_model_config_dir) | ||
| peft_model = peft_model.to("cuda").eval() | ||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained("nvidia/esm2_t48_15B_UR50D") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and read the infer.py file
find . -name "infer.py" -path "*/esm2_peft_te/*" | head -20Repository: NVIDIA/bionemo-framework
Length of output: 115
🏁 Script executed:
# Get the size to determine how much we can read
wc -l bionemo-recipes/recipes/esm2_peft_te/infer.pyRepository: NVIDIA/bionemo-framework
Length of output: 117
🌐 Web query:
Transformers documentation AutoModelForTokenClassification tokenizer selection best practices
💡 Result:
Tokenizer selection best practices for AutoModelForTokenClassification (Transformers)
- Always load the tokenizer from the same checkpoint as the model
- Use
AutoTokenizer.from_pretrained(model_name_or_path)that matches the checkpoint you pass toAutoModelForTokenClassification.from_pretrained(...)so the vocab, normalization, and special tokens align with the model. (This is the default pattern shown throughout the token classification task docs.) [1]
- Prefer a fast tokenizer (
use_fast=True) for token classification
- Token classification typically requires word-to-subword alignment, done via
is_split_into_words=Trueplusword_ids(). The official token classification guide usesword_ids()for label alignment. [1] - If you accidentally end up with a Python (“slow”) tokenizer,
word_ids()may be unavailable (common failure mode). [5]
- Use the recommended label-alignment strategy
- Realign labels after subword tokenization by:
- mapping tokens to word indices with
word_ids(), - setting labels for special tokens to
-100, - labeling only the first subtoken of each word (others
-100). [1]
- mapping tokens to word indices with
- Pad/truncate correctly (usually: dynamic padding + truncation)
- For training/inference batches, rely on the standard tokenizer args
padding/truncation/max_lengthas needed. [2] - For token classification training, prefer dynamic padding per batch via
DataCollatorForTokenClassification(tokenizer=tokenizer)(recommended in the task guide). [1]
- If you need spans (start/end char offsets), request offsets from the tokenizer
- Fast tokenizers can return
return_offsets_mapping=Trueto map tokens back to character spans (special tokens typically map to(0, 0)). [3]
- RoBERTa / byte-level BPE gotcha when using pre-tokenized words
- If you call the tokenizer with
is_split_into_words=Trueon RoBERTa-like tokenizers, instantiate withadd_prefix_space=True(explicitly noted in the RoBERTa docs). [4]
Minimal, “safe default” loading snippet
from transformers import AutoTokenizer, AutoModelForTokenClassification
ckpt = "your-checkpoint"
tokenizer = AutoTokenizer.from_pretrained(ckpt, use_fast=True)
model = AutoModelForTokenClassification.from_pretrained(ckpt)Key takeaway: for token classification, the “best” tokenizer is almost always the matching checkpoint’s fast tokenizer, because alignment utilities like word_ids()/offsets are central to correct training and evaluation. [1] [5]
🏁 Script executed:
cat -n bionemo-recipes/recipes/esm2_peft_te/infer.pyRepository: NVIDIA/bionemo-framework
Length of output: 5556
🏁 Script executed:
# Check the Hydra config structure
find bionemo-recipes/recipes/esm2_peft_te -name "*.yaml" | head -10Repository: NVIDIA/bionemo-framework
Length of output: 467
🏁 Script executed:
# Read the default Hydra config
cat bionemo-recipes/recipes/esm2_peft_te/hydra_config/L0_sanity_infer.yamlRepository: NVIDIA/bionemo-framework
Length of output: 384
🏁 Script executed:
cat bionemo-recipes/recipes/esm2_peft_te/hydra_config/defaults_infer.yamlRepository: NVIDIA/bionemo-framework
Length of output: 334
🏁 Script executed:
# Also check other training configs to see if tokenizer_name pattern is used elsewhere
grep -r "tokenizer_name" bionemo-recipes/recipes/esm2_peft_te/ 2>/dev/null || echo "No tokenizer_name found"Repository: NVIDIA/bionemo-framework
Length of output: 646
🏁 Script executed:
# Check if there's a pattern in other recipes
find bionemo-recipes/recipes -name "*.py" -type f -exec grep -l "tokenizer_name\|AutoTokenizer.from_pretrained" {} \; | head -10Repository: NVIDIA/bionemo-framework
Length of output: 767
🏁 Script executed:
# Check how dataset.py uses tokenizer_name to understand the pattern
sed -n '1,50p' bionemo-recipes/recipes/esm2_peft_te/dataset.pyRepository: NVIDIA/bionemo-framework
Length of output: 1948
🏁 Script executed:
# Also check if there's any guidance on tokenizer selection in training configs
grep -A5 -B5 "tokenizer_name" bionemo-recipes/recipes/esm2_peft_te/hydra_config/defaults.yamlRepository: NVIDIA/bionemo-framework
Length of output: 266
Derive tokenizer from the model checkpoint, not hardcoded.
Line 108 hardcodes "nvidia/esm2_t48_15B_UR50D" while the model is loaded from args.model_tag (e.g., "nvidia/esm2_t6_8M_UR50D" in L0_sanity_infer.yaml). Per Transformers documentation, the tokenizer must match the model checkpoint for vocabulary and special tokens to align. This mismatch can silently corrupt results.
The training scripts in this recipe correctly use tokenizer_name: ${model_tag} in the config. Update infer.py to follow the same pattern:
Suggested update
- tokenizer = AutoTokenizer.from_pretrained("nvidia/esm2_t48_15B_UR50D")
+ tokenizer_name = getattr(args, "tokenizer_name", None) or args.model_tag
+ tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)Also add tokenizer_name: ${model_tag} to hydra_config/defaults_infer.yaml to allow overriding when needed (as done in training configs like L1_fb_15B.yaml).
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| tokenizer = AutoTokenizer.from_pretrained("nvidia/esm2_t48_15B_UR50D") | |
| tokenizer_name = getattr(args, "tokenizer_name", None) or args.model_tag | |
| tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True) |
🤖 Prompt for AI Agents
In `@bionemo-recipes/recipes/esm2_peft_te/infer.py` at line 108, Replace the
hardcoded tokenizer checkpoint with the model-derived tokenizer: modify the
AutoTokenizer.from_pretrained call (where tokenizer =
AutoTokenizer.from_pretrained("nvidia/esm2_t48_15B_UR50D")) to load from the
runtime/config value (e.g., args.model_tag or a config field tokenizer_name) so
the tokenizer matches the model loaded elsewhere in infer.py; also add
tokenizer_name: ${model_tag} to hydra_config/defaults_infer.yaml so the
tokenizer checkpoint can be overridden via config. Ensure you reference the same
symbol used to load the model (args.model_tag or the config object) when calling
AutoTokenizer.from_pretrained.
| hydra-core | ||
| peft @ git+https://github.com/balvisio/peft.git@dev/ba/support-te-lora | ||
| torch | ||
| torchao!=0.14.0 | ||
| torchmetrics |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if the file exists and read its contents
cat bionemo-recipes/recipes/esm2_peft_te/requirements.txtRepository: NVIDIA/bionemo-framework
Length of output: 245
🏁 Script executed:
# Search for any references to this peft dependency in the codebase
rg "balvisio/peft" --type txt --type py --type mdRepository: NVIDIA/bionemo-framework
Length of output: 192
🏁 Script executed:
# Check if there's any documentation about this branch or dependency
rg "dev/ba/support-te-lora"Repository: NVIDIA/bionemo-framework
Length of output: 313
🏁 Script executed:
# Look for similar git-based dependencies in recipes
fd "requirements.txt" bionemo-recipes/recipes/ -exec grep -l "git+" {} \;Repository: NVIDIA/bionemo-framework
Length of output: 266
🏁 Script executed:
# Search for documentation or comments about this branch
rg "support-te-lora" --type md --type txt --type pyRepository: NVIDIA/bionemo-framework
Length of output: 192
🏁 Script executed:
# Look for README or documentation files in the recipe directory
find bionemo-recipes/recipes/esm2_peft_te -type f \( -name "README*" -o -name "*.md" -o -name "SETUP*" \)Repository: NVIDIA/bionemo-framework
Length of output: 274
🏁 Script executed:
# Check if there are any inline comments or documentation about peft dependency
rg -B 3 -A 3 "balvisio/peft" bionemo-recipes/Repository: NVIDIA/bionemo-framework
Length of output: 1001
🏁 Script executed:
# Check the recipe directory structure to understand what's included
ls -la bionemo-recipes/recipes/esm2_peft_te/Repository: NVIDIA/bionemo-framework
Length of output: 1560
🏁 Script executed:
# Read the README to understand why the custom branch is needed
cat bionemo-recipes/recipes/esm2_peft_te/README.mdRepository: NVIDIA/bionemo-framework
Length of output: 4322
🏁 Script executed:
# Check if there are any comments in the training scripts about this dependency
grep -n "peft\|TE\|LoRA" bionemo-recipes/recipes/esm2_peft_te/train_lora_*.py | head -20Repository: NVIDIA/bionemo-framework
Length of output: 2534
🏁 Script executed:
# Look for any setup/install documentation
find bionemo-recipes -name "SETUP*" -o -name "INSTALL*" | head -5Repository: NVIDIA/bionemo-framework
Length of output: 50
Pin the git-based peft dependency to an immutable commit or tag.
The dependency on branch dev/ba/support-te-lora is non-reproducible and increases supply-chain risk. Branches can be force-pushed, rebased, or deleted without notice. Pin to a specific commit SHA (or a released tag) to ensure deterministic builds and maintain reproducibility required for production recipes.
🔧 Example fix
-peft @ git+https://github.com/balvisio/peft.git@dev/ba/support-te-lora
+peft @ git+https://github.com/balvisio/peft.git@<COMMIT_SHA>🤖 Prompt for AI Agents
In `@bionemo-recipes/recipes/esm2_peft_te/requirements.txt` around lines 2 - 6,
The git-based peft dependency is pinned to a mutable branch; update the
requirement line "peft @
git+https://github.com/balvisio/peft.git@dev/ba/support-te-lora" to reference an
immutable identifier by replacing the branch name with a specific commit SHA or
released tag (e.g., @<commit-sha> or `@vX.Y.Z`) so builds are reproducible and
supply-chain safe; ensure the updated string remains in requirements.txt as
"peft @ git+https://github.com/balvisio/peft.git@<commit-or-tag>".
| @@ -0,0 +1 @@ | |||
| ../esm2_native_te/scheduler.py No newline at end of file | |||
There was a problem hiding this comment.
Cross-recipe dependency violates recipe isolation rules.
This file effectively redirects to bionemo-recipes/recipes/esm2_native_te/scheduler.py, which breaks the rule that recipes must be self-contained and must not import or depend on other recipes. Please copy the needed logic into this recipe or provide a local implementation.
As per coding guidelines: “bionemo-recipes/recipes/**/*.py must be completely self-contained with no shared dependencies between recipes and no imports from other recipes.”
🤖 Prompt for AI Agents
In `@bionemo-recipes/recipes/esm2_peft_te/scheduler.py` at line 1, This file
currently redirects to or imports from esm2_native_te.scheduler which violates
recipe isolation; replace that cross-recipe dependency by copying the required
scheduler logic into this recipe or implementing a local equivalent. Identify
the exported symbols you depend on (e.g., Scheduler class, create_scheduler or
get_scheduler factory function, and any helper functions like schedule_task or
init_scheduler) from the esm2_native_te implementation, reproduce their behavior
locally inside bionemo-recipes/recipes/esm2_peft_te/scheduler.py, update any
local imports to use the new local implementations, and remove any import lines
that reference esm2_native_te.scheduler so the recipe is fully self-contained.
Fix invalid Python file (current content will raise SyntaxError).
A bare path line is not valid Python code and will fail on import/execution. Replace this with a real implementation or a valid import/alias.
✅ Example fix (inline implementation placeholder)
-../esm2_native_te/scheduler.py
+"""Scheduler utilities for esm2_peft_te."""
+
+# TODO: Implement scheduler logic here (keep this recipe self-contained).🤖 Prompt for AI Agents
In `@bionemo-recipes/recipes/esm2_peft_te/scheduler.py` at line 1, The file
currently contains a bare path string which raises a SyntaxError; replace it
with a valid Python implementation or a re-export import: remove the literal
path and add an import that re-exports the native scheduler (e.g., import or
from import of esm2_native_te.scheduler) or implement a minimal wrapper
function/class matching this package's expected API (e.g., functions/classes
used elsewhere that reference scheduler.py) so the module can be imported
without error; update symbols to match callers if you implement a wrapper.
| import hydra | ||
| import peft | ||
| import torch | ||
| from modeling_esm_te import NVEsmForConvTokenClassification |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all instances of modeling_esm_te.py
echo "=== Searching for modeling_esm_te.py ==="
fd -a "modeling_esm_te.py"
# Search for NVEsmForConvTokenClassification definition
echo -e "\n=== Searching for NVEsmForConvTokenClassification definition ==="
rg "class NVEsmForConvTokenClassification" -g '*.py'
# Check what's in esm2_peft_te recipe directory
echo -e "\n=== Contents of esm2_peft_te recipe ==="
fd -a . "bionemo-recipes/recipes/esm2_peft_te" --type f | head -30
# Check imports from other esm2 recipes
echo -e "\n=== Checking esm2_native_te recipe structure ==="
fd -a . "bionemo-recipes/recipes/esm2_native_te" --type f | grep "modeling_esm" | head -20Repository: NVIDIA/bionemo-framework
Length of output: 3711
🏁 Script executed:
#!/bin/bash
# View the train_lora_convnet.py file around the import
echo "=== train_lora_convnet.py imports ==="
head -40 "bionemo-recipes/recipes/esm2_peft_te/train_lora_convnet.py"
# Check if there's a setup.py or pyproject.toml that adds models to path
echo -e "\n=== Check esm2_peft_te for path configuration ==="
find "bionemo-recipes/recipes/esm2_peft_te" -name "setup.py" -o -name "pyproject.toml" -o -name "__init__.py" | head -10
# Check if recipe root has sys.path manipulation
echo -e "\n=== Check for PYTHONPATH or sys.path in train files ==="
grep -r "sys.path" "bionemo-recipes/recipes/esm2_peft_te" --include="*.py" | head -10
# Check parent recipe esm2_native_te for comparison
echo -e "\n=== Compare with esm2_native_te imports ==="
head -40 "bionemo-recipes/recipes/esm2_native_te/train_lora_ddp.py" 2>/dev/null || echo "File not found"
# Check if models/esm2 is meant to be a shared package
echo -e "\n=== Check models/esm2 structure ==="
ls -la "bionemo-recipes/models/esm2/src/" 2>/dev/null || echo "Directory structure:"
find "bionemo-recipes/models/esm2" -name "__init__.py" | head -5Repository: NVIDIA/bionemo-framework
Length of output: 2095
🏁 Script executed:
#!/bin/bash
# View the Dockerfile
echo "=== Dockerfile ==="
cat "bionemo-recipes/recipes/esm2_peft_te/Dockerfile"
# Check if there's a docker-compose or similar
echo -e "\n=== Docker/run configuration ==="
find "bionemo-recipes/recipes/esm2_peft_te" -name "docker-compose*" -o -name "*.sh" | head -10
# Check if there's any entrypoint script
echo -e "\n=== Entrypoint/wrapper scripts ==="
find "bionemo-recipes/recipes/esm2_peft_te" -type f -name "*.sh" -exec head -5 {} \;
# Check other recipes for comparison on how they handle modeling imports
echo -e "\n=== Check esm2_accelerate_te imports ==="
grep "from modeling_esm_te" "bionemo-recipes/recipes/esm2_accelerate_te/train_lora_ddp.py" 2>/dev/null || echo "Not found"
# Check README for setup instructions
echo -e "\n=== README setup instructions ==="
head -100 "bionemo-recipes/recipes/esm2_peft_te/README.md" | grep -A 10 -B 5 "PYTHONPATH\|sys.path\|modeling\|import" || echo "No specific setup found"Repository: NVIDIA/bionemo-framework
Length of output: 1020
🏁 Script executed:
#!/bin/bash
# Check requirements.txt
echo "=== esm2_peft_te requirements.txt ==="
cat "bionemo-recipes/recipes/esm2_peft_te/requirements.txt"
# Check if models/esm2 has setup.py or pyproject.toml
echo -e "\n=== models/esm2 setup configuration ==="
ls -la "bionemo-recipes/models/esm2/" | grep -E "setup|pyproject"
# Check if models/esm2 is installable
echo -e "\n=== models/esm2 pyproject.toml ==="
cat "bionemo-recipes/models/esm2/pyproject.toml" 2>/dev/null || echo "Not found"
# Check the parent bionemo-recipes for any workspace or setup
echo -e "\n=== Root pyproject.toml or setup ==="
head -50 "bionemo-recipes/pyproject.toml" 2>/dev/null || echo "Not found"
# Check esm2_native_te to see how it handles this
echo -e "\n=== esm2_native_te Dockerfile ==="
cat "bionemo-recipes/recipes/esm2_native_te/Dockerfile" 2>/dev/null | head -30 || echo "File not found"
# Check if there's a setup.py in models/esm2
echo -e "\n=== models/esm2 structure ==="
ls -la "bionemo-recipes/models/esm2/"Repository: NVIDIA/bionemo-framework
Length of output: 2186
🏁 Script executed:
#!/bin/bash
# Check if models/esm2 is meant to be installed as a dependency
echo "=== Check if esm package should be in requirements ==="
grep -r "esm" "bionemo-recipes/recipes/esm2_peft_te/requirements.txt"
# Check if there's a higher-level requirements or setup
echo -e "\n=== Check for parent-level requirements ==="
find "bionemo-recipes" -maxdepth 2 -name "requirements.txt" -o -name "pyproject.toml" | head -10
# Verify how esm2_native_te actually handles this
echo -e "\n=== esm2_native_te requirements ==="
cat "bionemo-recipes/recipes/esm2_native_te/requirements.txt" 2>/dev/null || echo "Not found"
# Check if there's any documentation about how recipes should be run
echo -e "\n=== Top-level README ==="
head -100 "bionemo-recipes/README.md" 2>/dev/null | grep -A 5 -B 5 "setup\|install\|PYTHONPATH" || echo "Not found"
# Look for any CI/runner configuration
echo -e "\n=== GitHub workflows or CI config ==="
find . -path "./.git" -prune -o -name "*.yml" -o -name "*.yaml" | grep -E "github|gitlab|ci" | head -5Repository: NVIDIA/bionemo-framework
Length of output: 1125
🏁 Script executed:
#!/bin/bash
# Check if there's an install script or setup docs
echo "=== Check esm2_peft_te directory for setup docs ==="
find "bionemo-recipes/recipes/esm2_peft_te" -type f \( -name "*.md" -o -name "*.txt" -o -name "*.sh" \) | xargs grep -l "PYTHONPATH\|sys.path\|modeling" 2>/dev/null || echo "None found"
# Check if the import works with current setup by looking at test runs
echo -e "\n=== Check test configuration for PYTHONPATH ==="
cat "bionemo-recipes/recipes/esm2_peft_te/tests/conftest.py" 2>/dev/null | head -30
# Check what the actual test file imports
echo -e "\n=== Check test imports ==="
head -40 "bionemo-recipes/recipes/esm2_peft_te/tests/"*.py 2>/dev/null | grep -A 5 -B 5 "import"
# Verify: does the Dockerfile actually work as-is?
echo -e "\n=== Double check Dockerfile details ==="
cat "bionemo-recipes/recipes/esm2_peft_te/Dockerfile"Repository: NVIDIA/bionemo-framework
Length of output: 3749
🏁 Script executed:
#!/bin/bash
# Check how training is supposed to be launched in the container
echo "=== Check for entrypoint or launch instructions ==="
grep -r "ENTRYPOINT\|CMD" "bionemo-recipes/recipes/esm2_peft_te/" --include="Dockerfile" --include="*.md"
# Check if there's an actual way this import resolves
echo -e "\n=== Verify: does train_lora_convnet.py have any sys.path manipulation? ==="
head -100 "bionemo-recipes/recipes/esm2_peft_te/train_lora_convnet.py" | grep -E "sys.path|PYTHONPATH"
# Check if the models/esm2 package is supposed to be installed during Dockerfile build
echo -e "\n=== Check if pip installs from models/esm2 ==="
grep -r "pip install.*models\|pip install.*esm" "bionemo-recipes/recipes/esm2_peft_te/Dockerfile"
# Check esm2_native_te for comparison - how does it work?
echo -e "\n=== esm2_native_te Dockerfile full ==="
cat "bionemo-recipes/recipes/esm2_native_te/Dockerfile"
# Final check: any way to resolve modeling_esm_te locally in peft_te recipe?
echo -e "\n=== Is there a modeling_esm_te.py in esm2_peft_te? ==="
find "bionemo-recipes/recipes/esm2_peft_te" -name "modeling_esm_te.py" -o -name "esm_nv.py" | head -5Repository: NVIDIA/bionemo-framework
Length of output: 781
🏁 Script executed:
#!/bin/bash
# Let me verify the exact directory structure to confirm the violation
echo "=== Complete verification of import resolution ==="
echo "1. Location of modeling_esm_te.py:"
find "bionemo-recipes" -name "modeling_esm_te.py" -type f
echo -e "\n2. Is it copied into esm2_peft_te Dockerfile?"
echo "Checking Dockerfile COPY commands:"
grep "COPY.*modeling\|COPY.*models" "bionemo-recipes/recipes/esm2_peft_te/Dockerfile" || echo "No models/ copy found"
echo -e "\n3. Is models/esm2 in esm2_peft_te's requirements?"
cat "bionemo-recipes/recipes/esm2_peft_te/requirements.txt"
echo -e "\n4. Check if any other esm2_* recipe handles this correctly"
grep -r "from modeling_esm_te\|from esm.modeling" "bionemo-recipes/recipes/" --include="*.py" | grep -v "example_" | head -10
echo -e "\n5. Verify the import statement one more time:"
grep -n "from modeling_esm_te" "bionemo-recipes/recipes/esm2_peft_te/train_lora_convnet.py"Repository: NVIDIA/bionemo-framework
Length of output: 830
Fix self-contained recipe dependency: modeling_esm_te import.
The import from modeling_esm_te import NVEsmForConvTokenClassification (line 25) depends on code in bionemo-recipes/models/esm2/src/esm/modeling_esm_te.py, which is outside this recipe. The Dockerfile does not copy or install this module, and requirements.txt does not include the esm package. This violates the self-containment rule: recipes must not import from other recipes or shared code directories.
Either include modeling_esm_te.py locally in this recipe or install the esm package from models/esm2/ as a documented dependency.
🤖 Prompt for AI Agents
In `@bionemo-recipes/recipes/esm2_peft_te/train_lora_convnet.py` at line 25, The
recipe imports NVEsmForConvTokenClassification from modeling_esm_te in
train_lora_convnet.py which pulls code from another recipe and breaks
self-containment; either vendor the required module into this recipe (add
modeling_esm_te.py with NVEsmForConvTokenClassification implementation alongside
train_lora_convnet.py and update the package/module path) or declare and install
the external esm package from models/esm2 (add it to requirements.txt and update
the Dockerfile to copy/install that package) so that the import in
train_lora_convnet.py resolves without referencing code outside this recipe.
Description
This PR adds a recipe to perform LoRA fine-tuning to the Evo2 model. It provides support for DDP and sequence packing. It also contains a file
infer.pythat shows how to do inference from a fine-tuned checkpoint. The PR contains the datasets that were used to train the models. Eventually we can convert them into HF datasets.It does not add support for FSDP or FP8 yet.
Usage
cd bionemo-recipes/recipes/evo2_peft_te/ python train_lora_ddp.pyFor more information on usage see the README.
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist
Summary by CodeRabbit
Release Notes
New Features
Documentation
Chores