Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[webgpu] Optimize matmulnbits with M > 1 #23102

Merged
merged 6 commits into from
Dec 17, 2024

Conversation

qjia7
Copy link
Contributor

@qjia7 qjia7 commented Dec 13, 2024

This is the webgpu native ep implementation of #23092.

I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to test. Meanwhile, applied fs-eire/ort-webgpu-nodejs-chatapp-prototype#2 to print the first token time.

The result is like below:
The latest main branch:
Intel Arc Graphics

659 tokens in 24.8sec, 26.57 tokens/sec
    Decoding first token with input 449 tokens: 13.0 sec
    Decoding remaining 210 tokens:
        11.8 sec
        17.79 tokens/sec

NV RTX 2000

659 tokens in 14.4sec, 45.85 tokens/sec
    Decoding first token with input 449 tokens: 7.3 sec
    Decoding remaining 210 tokens:
        7.0 sec
        29.81 tokens/sec

With this PR:
Intel Arc Graphics

657 tokens in 20.6sec, 31.92 tokens/sec
    Decoding first token with input 449 tokens: 8.5 sec
    Decoding remaining 208 tokens:
        12.1 sec
        17.23 tokens/sec

NV RTX 2000

659 tokens in 11.4sec, 57.93 tokens/sec
    Decoding first token with input 449 tokens: 4.1 sec
    Decoding remaining 210 tokens:
        7.2 sec
        28.98 tokens/sec

From above data, you can see that with this PR, both intel (13s -> 8.5s) and NV (7.3s -> 4.1s) GPUs for the first token time are performing better.

@qjia7
Copy link
Contributor Author

qjia7 commented Dec 13, 2024

@sushraja-msft @sushanthr Currently, I only test it on my laptop with dual GPUs. You can find the data in above description message. Please help verify it in your side to see if we can see similar result since our gpus and benchmarks are not same.

cc @guschmue @fs-eire This PR still needs to be further refactored to reduce some duplicated codes. Now it's just for verification.

@sushraja-msft
Copy link
Contributor

@sushraja-msft @sushanthr Currently, I only test it on my laptop with dual GPUs. You can find the data in above description message. Please help verify it in your side to see if we can see similar result since our gpus and benchmarks are not same.

cc @guschmue @fs-eire This PR still needs to be further refactored to reduce some duplicated codes. Now it's just for verification.

Ran your change on my intel Xe laptop, this is faster than mine 👏. 55tk's vs 44tk's in mine.
We should land yours and its okay to remove my implementation @guschmue @qjia7

C:\model_benchmark>model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       9.10299e+06
        avg (tokens/s): 55.0369                                <<<<
        p50 (us):       9.09658e+06
        stddev (us):    13042.6
        n:              5 * 501 token(s)
Token generation:
        avg (us):       79482.3
        avg (tokens/s): 12.5814
        p50 (us):       79505.4
        stddev (us):    2280.06
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       18.0841
        avg (tokens/s): 55297.3
        p50 (us):       14.4
        stddev (us):    24.9088
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       19199.6
        p50 (ms):       19200.1
        stddev (ms):    20.0724
        n:              5
Peak working set size (bytes): 5470642176
WebGPU device lost (2): Device was destroyed.```

@guschmue
Copy link
Contributor

very cool JiaJia, I can run it on a bunch of machines

@guschmue
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

@guschmue
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@guschmue
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

@guschmue
Copy link
Contributor

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Dec 14, 2024
@qjia7
Copy link
Contributor Author

qjia7 commented Dec 16, 2024

@guschmue @fs-eire This is ready for review. Please take a look, thanks.

@guschmue
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

@guschmue
Copy link
Contributor

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

@guschmue
Copy link
Contributor

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@guschmue
Copy link
Contributor

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Copy link

Azure Pipelines successfully started running 9 pipeline(s).

@guschmue
Copy link
Contributor

/azp run ONNX Runtime Web CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@guschmue guschmue merged commit 0981bbf into microsoft:main Dec 17, 2024
77 checks passed
guschmue pushed a commit that referenced this pull request Dec 20, 2024
### Description
After the optimization of prefill time with #23102, it seems that always
using the tile matmulnibits with block_size = 32 can bring better
performance even for discrete gpu for phi3 model.

Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my
NV RTX 2000 GPU.
guschmue pushed a commit that referenced this pull request Dec 20, 2024
This is the webgpu native ep implementation of #23092.

I used https://github.com/fs-eire/ort-webgpu-nodejs-chatapp-prototype to
test. Meanwhile, applied
fs-eire/ort-webgpu-nodejs-chatapp-prototype#2 to
print the first token time.

The result is like below:
The latest main branch:
Intel Arc Graphics
```
659 tokens in 24.8sec, 26.57 tokens/sec
    Decoding first token with input 449 tokens: 13.0 sec
    Decoding remaining 210 tokens:
        11.8 sec
        17.79 tokens/sec
```
NV RTX 2000
```
659 tokens in 14.4sec, 45.85 tokens/sec
    Decoding first token with input 449 tokens: 7.3 sec
    Decoding remaining 210 tokens:
        7.0 sec
        29.81 tokens/sec
```

-------------------------------------------------------------------------
With this PR:
Intel Arc Graphics
```
657 tokens in 20.6sec, 31.92 tokens/sec
    Decoding first token with input 449 tokens: 8.5 sec
    Decoding remaining 208 tokens:
        12.1 sec
        17.23 tokens/sec
```
NV RTX 2000
```
659 tokens in 11.4sec, 57.93 tokens/sec
    Decoding first token with input 449 tokens: 4.1 sec
    Decoding remaining 210 tokens:
        7.2 sec
        28.98 tokens/sec
```

From above data, you can see that with this PR, both intel (13s -> 8.5s)
and NV (7.3s -> 4.1s) GPUs for the first token time are performing
better.
guschmue pushed a commit that referenced this pull request Dec 20, 2024
### Description
After the optimization of prefill time with #23102, it seems that always
using the tile matmulnibits with block_size = 32 can bring better
performance even for discrete gpu for phi3 model.

Phi3 becomes 42.64 tokens/sec from 32.82 tokens/sec in easy mode on my
NV RTX 2000 GPU.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants