Make MultiHeadAttention op return attention probabilities #23125

amancini-N · 2024-12-16T16:24:39Z

Description

Add an additional optional output to MultiHeadAttention op, allowing to return attention probabilities.

Motivation and Context

Fixes MultiHeadAttention op shall return attention probabilities #23124

tianleiwu · 2024-12-17T17:51:30Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

tianleiwu · 2024-12-17T17:51:31Z

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-linux-gpu-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline

tianleiwu · 2024-12-17T17:51:33Z

/azp run iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

azure-pipelines · 2024-12-17T17:52:01Z

Azure Pipelines successfully started running 6 pipeline(s).

azure-pipelines · 2024-12-17T17:52:09Z

Azure Pipelines successfully started running 10 pipeline(s).

azure-pipelines · 2024-12-17T17:52:10Z

Azure Pipelines successfully started running 9 pipeline(s).

tianleiwu · 2024-12-17T17:54:49Z

onnxruntime/contrib_ops/cpu/bert/attention_cpu_base.h

+    T* attn_probs_data = nullptr;
+    if (attn_probs == nullptr) {
+      size_t bytes = SafeInt<size_t>(batch_size) * num_heads_ * sequence_length * total_sequence_length * sizeof(T);
+      attention_probs = allocator->Alloc(bytes);


There is no need to allocate extra space if we do not output it. You can follow the handling of output_qk (temp result of q*k before softmax) in this function.

If we do not output both q*k and softmax(q*k), we can consolidate them together by using a boolean flag to indicate whether we need output the one before softmax or after softmax.

tianleiwu · 2024-12-17T18:51:07Z

onnxruntime/core/graph/contrib_ops/bert_defs.cc

@@ -1034,6 +1058,11 @@ ONNX_MS_OPERATOR_SET_SCHEMA(
                "or present state for self attention value with shape (batch_size, num_heads, total_sequence_length, head_size)",
                "T",
                OpSchema::Optional)
+        .Output(3,


You will need update documents (You can find the updated documents in artifacts of Windows GPU Doc Gen CI Pipeline for this PR).

tianleiwu · 2024-12-17T21:18:45Z

onnxruntime/core/graph/contrib_ops/bert_defs.cc

+      auto& key_shape = getInputShape(ctx, 1);
+      auto& key_seqlen_dim = key_shape.dim()[1];
+      auto& past_seqlen_dim = getInputShape(ctx, past_key_index).dim()[2];
+      if (key_seqlen_dim.has_dim_value() && past_seqlen_dim.has_dim_value()) {


Add a condition of !past_present_share_buffer here.

amancini-N added 2 commits December 11, 2024 14:18

Allow returning attention probs from MultiHeadAttention

3147d51

Add CUDA implementation for attn_probs

239df8b

tianleiwu reviewed Dec 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make MultiHeadAttention op return attention probabilities #23125

Make MultiHeadAttention op return attention probabilities #23125

amancini-N commented Dec 16, 2024

tianleiwu commented Dec 17, 2024

tianleiwu commented Dec 17, 2024

tianleiwu commented Dec 17, 2024

azure-pipelines bot commented Dec 17, 2024

azure-pipelines bot commented Dec 17, 2024

azure-pipelines bot commented Dec 17, 2024

tianleiwu Dec 17, 2024 •

edited

Loading

tianleiwu Dec 17, 2024

tianleiwu Dec 17, 2024

Make MultiHeadAttention op return attention probabilities #23125

Are you sure you want to change the base?

Make MultiHeadAttention op return attention probabilities #23125

Conversation

amancini-N commented Dec 16, 2024

Description

Motivation and Context

tianleiwu commented Dec 17, 2024

tianleiwu commented Dec 17, 2024

tianleiwu commented Dec 17, 2024

azure-pipelines bot commented Dec 17, 2024

azure-pipelines bot commented Dec 17, 2024

azure-pipelines bot commented Dec 17, 2024

tianleiwu Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

tianleiwu Dec 17, 2024

Choose a reason for hiding this comment

tianleiwu Dec 17, 2024

Choose a reason for hiding this comment

tianleiwu Dec 17, 2024 •

edited

Loading