5 Comments
User's avatar
Ronan McGovern's avatar

I think the issue with multi-query is more about quality than parallelization, because you’re basically using the same K & V values for everything. With grouped query attention, you take an approach that is between multi query and multi head.

Expand full comment
Benjamin Marie's avatar

Yes, it's what was mainly pointed out by previous work. But in their paper, they show that the quality with multiquery is actually almost comparable with Vanilla transformer (Table 9). If I understood well, this is why they show the "parallelization" angle to motivate GQA.

Expand full comment
Ronan McGovern's avatar

I think they use the parallel approach, so that would be the right hand diagram on the figure

Expand full comment
Benjamin Marie's avatar

Thanks for pointing out this. I misread the figure. It's corrected.

Expand full comment
Ronan McGovern's avatar

Great piece btw, thanks

Expand full comment