You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think something is wrong when calculating load_act of qk_matmul in prefill stage.
From my understanding, the load_act of qk_matmul should be calculated as: load_act=seqlen * head_size * batchsize * num_attention_heads * a_byte.However, in the code at model_analyzer.py#L359, it is written as: load_act=seqlen * head_size * batchsize * num_key_value_heads * a_byte.
Could it be that I’m misunderstanding some fundamental concepts, or is there a potential issue with the code?
Thanks!
The text was updated successfully, but these errors were encountered:
I think load_act=seqlen * head_size * batchsize * num_key_value_heads * a_byte is correct, If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. num_key_value_heads would be general usage.
In qk_matmul, we need to load Q matrix and K matrix. Based on my understanding, the shape of Q matrix is [batchsize, num_attention_heads, seqlen, head_size], and the shape of K matrix is [batchsize, num_key_value_heads, seqlen, head_size]. Therefore, load_act=seqlen * head_size * batchsize * num_attention_heads * a_byte and load_kv_cache=seqlen * head_size * batchsize * num_kv_heads * kv_byte. Just as the formula in decode stage model_analyzer.py#L264
Hi!
I think something is wrong when calculating load_act of qk_matmul in prefill stage.
From my understanding, the load_act of qk_matmul should be calculated as:
load_act=seqlen * head_size * batchsize * num_attention_heads * a_byte
.However, in the code at model_analyzer.py#L359, it is written as:load_act=seqlen * head_size * batchsize * num_key_value_heads * a_byte
.Could it be that I’m misunderstanding some fundamental concepts, or is there a potential issue with the code?
Thanks!
The text was updated successfully, but these errors were encountered: