-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make MultiHeadAttention op return attention probabilities #23125
base: main
Are you sure you want to change the base?
Conversation
/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline |
/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-linux-gpu-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline |
/azp run iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline |
Azure Pipelines successfully started running 6 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
Azure Pipelines successfully started running 9 pipeline(s). |
T* attn_probs_data = nullptr; | ||
if (attn_probs == nullptr) { | ||
size_t bytes = SafeInt<size_t>(batch_size) * num_heads_ * sequence_length * total_sequence_length * sizeof(T); | ||
attention_probs = allocator->Alloc(bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to allocate extra space if we do not output it. You can follow the handling of output_qk (temp result of q*k
before softmax) in this function.
If we do not output both q*k
and softmax(q*k)
, we can consolidate them together by using a boolean flag to indicate whether we need output the one before softmax or after softmax.
@@ -1034,6 +1058,11 @@ ONNX_MS_OPERATOR_SET_SCHEMA( | |||
"or present state for self attention value with shape (batch_size, num_heads, total_sequence_length, head_size)", | |||
"T", | |||
OpSchema::Optional) | |||
.Output(3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need update documents (You can find the updated documents in artifacts of Windows GPU Doc Gen CI Pipeline
for this PR).
auto& key_shape = getInputShape(ctx, 1); | ||
auto& key_seqlen_dim = key_shape.dim()[1]; | ||
auto& past_seqlen_dim = getInputShape(ctx, past_key_index).dim()[2]; | ||
if (key_seqlen_dim.has_dim_value() && past_seqlen_dim.has_dim_value()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a condition of !past_present_share_buffer
here.
Description
Add an additional optional output to MultiHeadAttention op, allowing to return attention probabilities.
Motivation and Context