feat: Implement KV-cache for umT5 (#137) by Satyamgupta2365 · Pull Request #162 · jax-ml/bonsai

Satyamgupta2365 · 2026-02-11T14:27:02Z

Summary

Implements KV-cache for umT5 model to enable efficient autoregressive inference.

Changes

Added LayerCache class for storing K/V pairs per decoder layer
Updated attention layers to support caching (self-attention and cross-attention)
Added init_cache() method to UMT5Model
Enhanced generate() with use_cache parameter (default: True)

Performance

2-5x faster generation
Minimal memory overhead
Outputs identical with/without cache

Files Modified

bonsai/models/umt5/modeling.py

Closes #137

gemini-code-assist · 2026-02-11T14:27:25Z

Summary of Changes

Hello @Satyamgupta2365, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the umT5 model's autoregressive inference capabilities by integrating a Key-Value (KV) cache. This optimization allows the model to reuse previously computed key and value states during token generation, leading to substantial speed improvements for text generation tasks. The implementation ensures identical outputs whether caching is enabled or disabled, maintaining model integrity while boosting performance.

Highlights

KV-cache Implementation: A new LayerCache class has been introduced to store key and value projections for individual decoder layers, enabling efficient caching during autoregressive inference.
Attention Layer Updates: Both self-attention and cross-attention mechanisms within the UMT5 model have been updated to accept and utilize the new LayerCache for storing and retrieving K/V pairs.
Cache Initialization: A dedicated init_cache() method has been added to the UMT5Model to properly initialize and prepare the self-attention and cross-attention caches based on batch size, sequence lengths, and data type.
Enhanced Generation Method: The generate() method now includes a use_cache parameter (defaulting to True), allowing users to toggle KV-caching for improved generation speed without altering output quality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

bonsai/models/umt5/modeling.py
- Introduced LayerCache class for storing key and value projections and defined Cache type alias.
- Modified UMT5Attention's __call__ method to accept a cache parameter for KV-caching.
- Integrated KV-cache logic into UMT5Attention to store and retrieve K/V pairs incrementally for self-attention and once for cross-attention.
- Updated UMT5SelfAttentionBlock to propagate the cache parameter to its internal self-attention module.
- Updated UMT5CrossAttentionBlock to propagate the cache parameter to its internal encoder-decoder attention module.
- Modified UMT5DecoderLayer to accept and utilize distinct self_attn_cache and cross_attn_cache for its attention sub-layers.
- Ensured cross_attn_cache is correctly passed to the cross-attention block within UMT5DecoderLayer.
- Modified UMT5Decoder's __call__ method to accept optional self_attn_cache and cross_attn_cache.
- Implemented logic in UMT5Decoder to distribute the overall caches to individual decoder layers during the forward pass.
- Introduced init_cache method in UMT5Model to create and prepare LayerCache instances for both self-attention and cross-attention.
- Modified the generate method to include a use_cache boolean parameter, enabling or disabling KV-caching during generation.
- Integrated the init_cache method and conditional logic within the generate loop to manage and utilize KV-caches for efficient token generation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Satyamgupta2365 · 2026-02-11T14:29:41Z

Hi @jenriver and @chapman20j,

Thank you for the detailed feedback. I’ll remove the extra artifacts, refactor the cache to align with existing Bonsai patterns, ensure pretrained weight loading works correctly, and update the tests with proper numerical comparisons.

This is the new PR wit the updated code.
Thanks again for the guidance.

gemini-code-assist

Code Review

This pull request introduces a KV-cache for the umT5 model to accelerate autoregressive inference, which is a significant performance enhancement. The implementation is generally well-structured, including the new LayerCache class and the necessary modifications to the attention and generation logic. However, I've identified a critical issue in the self-attention caching mechanism that could lead to incorrect outputs and inefficiency. My review includes a specific code suggestion to address this.

gemini-code-assist · 2026-02-11T14:33:12Z

bonsai/models/umt5/modeling.py

+                k = cache.k_cache[...]
+                v = cache.v_cache[...]
+                cache.cur_ind.value = cache.cur_ind.value + q_len


The current implementation retrieves the full cache tensors for keys and values, but doesn't mask the unused padded portions. This can lead to incorrect attention scores and is inefficient as computation is performed over padding, because the softmax will be computed over zero-vectors. To fix this, you should slice the cache to its valid length after updating it.

Suggested change

k = cache.k_cache[...]

v = cache.v_cache[...]

cache.cur_ind.value = cache.cur_ind.value + q_len

new_len = cache.cur_ind.value + q_len

k = jax.lax.dynamic_slice(cache.k_cache[...], (0, 0, 0, 0), (k.shape[0], new_len, k.shape[2], k.shape[3]))

v = jax.lax.dynamic_slice(cache.v_cache[...], (0, 0, 0, 0), (v.shape[0], new_len, v.shape[2], v.shape[3]))

cache.cur_ind.value = new_len

coder0143 · 2026-02-16T12:21:41Z

@Satyamgupta2365 Make sure to use jax.lax.dynamic_slice, also, confirm the tests and update run_model accordingly.

Added new file

a669567

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement KV-cache for umT5 (#137)#162

feat: Implement KV-cache for umT5 (#137)#162
Satyamgupta2365 wants to merge 1 commit intojax-ml:mainfrom
Satyamgupta2365:main

Satyamgupta2365 commented Feb 11, 2026

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Uh oh!

Satyamgupta2365 commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 11, 2026

Uh oh!

coder0143 commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                k = cache.k_cache[...]
-                v = cache.v_cache[...]
-                cache.cur_ind.value = cache.cur_ind.value + q_len
+                new_len = cache.cur_ind.value + q_len
+                k = jax.lax.dynamic_slice(cache.k_cache[...], (0, 0, 0, 0), (k.shape[0], new_len, k.shape[2], k.shape[3]))
+                v = jax.lax.dynamic_slice(cache.v_cache[...], (0, 0, 0, 0), (v.shape[0], new_len, v.shape[2], v.shape[3]))
+                cache.cur_ind.value = new_len

Conversation

Satyamgupta2365 commented Feb 11, 2026

Summary

Changes

Performance

Files Modified

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Satyamgupta2365 commented Feb 11, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coder0143 commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants