Conversation
Summary of ChangesHello @Aatman09, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new GPT-OSS model architecture within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new GptOss model, which is a GPT-style model with a Mixture-of-Experts (MoE) component. The implementation includes the model definition and a script for loading pretrained weights. My review has identified several significant issues in the model definition (modeling.py) that would prevent it from being trained correctly from scratch. These include incorrect parameter initialization and a missing implementation for the MoE auxiliary loss, which is critical for training such models. I have also pointed out some unused parameters and potentially confusing code. The weight loading script (params.py) appears to be well-implemented. Addressing the feedback on the model definition is crucial for the correctness and usability of this new model.
| self.gate_up_proj = nnx.Param(jnp.zeros((self.num_experts, self.hidden_size, 2 * self.expert_dim))) | ||
| self.gate_up_proj_bias = nnx.Param(jnp.zeros((self.num_experts, 2 * self.expert_dim))) | ||
| self.down_proj = nnx.Param(jnp.zeros((self.num_experts, self.expert_dim, self.hidden_size))) | ||
| self.down_proj_bias = nnx.Param(jnp.zeros((self.num_experts, self.hidden_size))) |
There was a problem hiding this comment.
Parameters are initialized with zeros, which will prevent the model from learning correctly if trained from scratch. These should be initialized using a proper random initializer (e.g., from nnx.initializers).
Additionally, the GptOssExperts module's __init__ method does not accept rngs, which is necessary for random initialization. It should be added to the signature and passed from GptOssMLP.
| hidden_states, _ = self.mlp(hidden_states) | ||
| hidden_states = residual + hidden_states | ||
|
|
||
| return hidden_states |
There was a problem hiding this comment.
The router_scores (the second element of the tuple returned by self.mlp) are discarded here. For training a Mixture-of-Experts model, these scores are crucial for calculating the auxiliary load-balancing loss. They should be returned by this layer and propagated up to the final model output. This will require changing the return signature of this method and updating the call sites in GptOssModel.
| hidden_states, _ = self.mlp(hidden_states) | |
| hidden_states = residual + hidden_states | |
| return hidden_states | |
| hidden_states, router_scores = self.mlp(hidden_states) | |
| hidden_states = residual + hidden_states | |
| return hidden_states, router_scores |
| self.down_proj = nnx.Param(jnp.zeros((self.num_experts, self.expert_dim, self.hidden_size))) | ||
| self.down_proj_bias = nnx.Param(jnp.zeros((self.num_experts, self.hidden_size))) | ||
|
|
||
| def __call__(self, hidden_states, router_indices=None, routing_weights=None): |
There was a problem hiding this comment.
The router_indices parameter is unused within this function. This is confusing and suggests a potential disconnect between the router and the expert layer. If the implementation using the dense routing_weights is correct, please remove the unused router_indices parameter from the signature. If a sparse implementation was intended, this function needs to be updated to use the indices.
| def __call__(self, hidden_states, router_indices=None, routing_weights=None): | |
| def __call__(self, hidden_states, routing_weights=None): |
| self.num_heads * self.head_dim, self.hidden_size, use_bias=config.attention_bias, rngs=rngs | ||
| ) | ||
|
|
||
| self.sinks = nnx.Param(jnp.zeros((self.num_heads,))) |
| if mask is not None: | ||
| scores = scores + mask | ||
|
|
||
| probs = nnx.softmax(scores, axis=-1) |
There was a problem hiding this comment.
|
@Aatman09 Good to have GPT-OSS here, do add tests for - layers, attention, and final logits and a run_model.py file. Since this is an MoE, do add sharding if possible. |
Resolves #143
Reference
I have implemented the modeling.py and the params.py code and yet to commit the test_outputs.py and run_model.py
Checklist
run_model.pyfor model usage,test_outputs.pyand/ormodel_validation_colab.ipynbfor quality).