Replies: 1 comment
-
|
Agree. We have ongoing discussions inside the team about this. However, it might take some time for us to figure out a better structure. You can check the discussion in the closed PR |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello World!
Upfront disclaimer: I'm no LLM researcher.
Looking at the additional heads, I'm wondering if the model could benefit from having a residual connection from head N to N+1. Given that token N+1 strongly depends on token N, I expect the accuracy to improve, especially for an increasing number of heads.
In its easiest form:
Beta Was this translation helpful? Give feedback.
All reactions