Discussion about this post

User's avatar
Ronan McGovern's avatar

I think the issue with multi-query is more about quality than parallelization, because you’re basically using the same K & V values for everything. With grouped query attention, you take an approach that is between multi query and multi head.

Expand full comment
4 more comments...

No posts