[mlas] Speed up tanhf activation function #20612

r-devulap · 2024-05-08T20:39:33Z

Description

New faster algorithm for tanhf activation function using Intel SVML.

Motivation and Context

Improves performance of tanhf by nearly 40%. The newer algorithm also fixes a bug in the current tanhf algorithm which goes out of bounds [-1, 1]. Example: for x = +0x1.06417ep+003, tanhf= +0x1.000002p+000.

Benchmark                                                 Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------
[BM_Tanh vs. BM_Tanh]/40000/real_time                  -0.3822         -0.3825         15059          9304         15035          9283
[BM_Tanh vs. BM_Tanh]/80000/real_time                  -0.3845         -0.3844         30055         18499         29998         18467
[BM_Tanh vs. BM_Tanh]/160000/real_time                 -0.3146         -0.3144         17803         12203         17762         12178
[BM_Tanh vs. BM_Tanh]/320000/real_time                 -0.3495         -0.3491         32840         21362         32724         21300
[BM_Tanh vs. BM_Tanh]/640000/real_time                 -0.3563         -0.3568         62902         40487         62754         40361
[BM_Tanh vs. BM_Tanh]/1280000/real_time                -0.3326         -0.3333        128536         85780        128102         85408
OVERALL_GEOMEAN                                        -0.3538         -0.3539             0             0             0             0

Use Intel SVML tanhf function which speeds up tanhf computation by up to ~38%. The algorithm has a max ULP error of 1536. Benchmark numbers comparison v/s main branch is provided below (generated on TigerLake Dell XPS laptop using: https://github.com/google/benchmark/blob/main/tools/compare.py) |-----------------+---------+---------+----------+----------+---------+---------| | Benchmark | Time | CPU | Time Old | Time New | CPU Old | CPU New | |-----------------+---------+---------+----------+----------+---------+---------| | BM_Tanh/40000 | -0.3822 | -0.3825 | 15059 | 9304 | 15035 | 9283 | | BM_Tanh/80000 | -0.3845 | -0.3844 | 30055 | 18499 | 29998 | 18467 | | BM_Tanh/160000 | -0.3146 | -0.3144 | 17803 | 12203 | 17762 | 12178 | | BM_Tanh/320000 | -0.3495 | -0.3491 | 32840 | 21362 | 32724 | 21300 | | BM_Tanh/640000 | -0.3563 | -0.3568 | 62902 | 40487 | 62754 | 40361 | | BM_Tanh/1280000 | -0.3326 | -0.3333 | 128536 | 85780 | 128102 | 85408 | |-----------------+---------+---------+----------+----------+---------+---------| | OVERALL_GEOMEAN | -0.3538 | -0.3539 | 0 | 0 | 0 | 0 | |-----------------+---------+---------+----------+----------+---------+---------|

yufenglee · 2024-05-08T21:27:24Z

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

yufenglee · 2024-05-08T21:27:35Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

azure-pipelines · 2024-05-08T21:27:54Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2024-05-08T21:28:13Z

Azure Pipelines successfully started running 10 pipeline(s).

yihonglyu

Please add a benchmark for the tanh activation function in onnxruntime/test/mlas/bench/. Once you've done that, make sure to record the performance number both with and without your patch in the commit message.

r-devulap · 2024-05-09T16:06:07Z

Please add a benchmark for the tanh activation function in onnxruntime/test/mlas/bench/.

There is already a benchmark for tanhf BM_Tanh. Is this not sufficient?

onnxruntime/onnxruntime/test/onnx/microbenchmark/activation.cc

Lines 342 to 344 in 69cfcba

 static void BM_Tanh(benchmark::State& state) { 

 RunSingleNode<Tanh<float>>("Tanh", "", {}, state, -2.0f, 2.0f); 

 }

Once you've done that, make sure to > record the performance number both with and without your patch in the commit message.

The performance numbers of BM_Tanh before and after have already been included in the commit message: See c6c9309

yufenglee · 2024-05-11T02:14:01Z

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

yufenglee · 2024-05-11T02:14:08Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

azure-pipelines · 2024-05-11T02:14:30Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2024-05-11T02:14:41Z

Azure Pipelines successfully started running 10 pipeline(s).

yufenglee · 2024-05-13T18:18:55Z

onnxruntime/core/mlas/lib/tanh.cpp

+
+ size_t count = 0;
+ while (count < N) {
+ if (N - count >= 4) {


there needs a check for each iteration with this change. If N is large, the previous version can save a significant amount of instructions

For large values of N, the CPU branch predictor should be able to predict this branch pretty easily. It will only miss at the very last iteration for the tail but when N is large, one single branch miss should hardly matter in terms of performance. It does bring the benefit of processing the entire array contained in a single loop.

onnxruntime/core/mlas/lib/amd64/TanhKernelFma3.asm

onnxruntime/core/mlas/lib/x86_64/TanhKernelFma3.S

yufenglee · 2024-05-13T23:08:42Z

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

yufenglee · 2024-05-13T23:09:07Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

azure-pipelines · 2024-05-13T23:09:11Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2024-05-13T23:09:41Z

Azure Pipelines successfully started running 10 pipeline(s).

yufenglee · 2024-05-17T02:35:33Z

/azp run Linux Android Emulator QNN CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline , Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2024-05-17T02:36:02Z

Azure Pipelines successfully started running 8 pipeline(s).

yufenglee · 2024-05-17T02:36:31Z

You need sign the license/cla agreement to move on.

r-devulap · 2024-05-22T15:58:25Z

Removed if (provider_name == "cpu") for fp16_coreml_FNS test filter
Added relative and absolute error tolerance for LSTM.BackwardCompute test.

Hoping it will fix it, I still am not able to reproduce the failure locally though :/

snnn · 2024-05-23T00:11:51Z

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

snnn · 2024-05-23T00:11:57Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

azure-pipelines · 2024-05-23T00:12:22Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2024-05-23T00:12:35Z

Azure Pipelines successfully started running 10 pipeline(s).

orttraining/orttraining/test/training_ops/cpu/rnn/lstm_test.cc

r-devulap · 2024-05-29T16:30:51Z

You need sign the license/cla agreement to move on.

CLA shows up as signed now.

snnn · 2024-05-30T05:11:40Z

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-05-30T05:12:10Z

Azure Pipelines successfully started running 7 pipeline(s).

snnn · 2024-05-30T05:12:13Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

azure-pipelines · 2024-05-30T05:12:45Z

Azure Pipelines successfully started running 10 pipeline(s).

snnn · 2024-05-31T16:17:11Z

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

snnn · 2024-05-31T16:17:19Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

azure-pipelines · 2024-05-31T16:17:43Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2024-05-31T16:18:01Z

Azure Pipelines successfully started running 10 pipeline(s).

snnn · 2024-06-01T02:44:40Z

@yufenglee , please help review

snnn · 2024-06-01T02:45:16Z

/azp run Linux Android Emulator QNN CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2024-06-01T02:45:48Z

Azure Pipelines successfully started running 8 pipeline(s).

yufenglee · 2024-06-03T21:30:03Z

/azp run ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux GPU CI Pipeline,orttraining-amd-gpu-ci-pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

yufenglee · 2024-06-03T21:30:33Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Android CI Pipeline

azure-pipelines · 2024-06-03T21:30:43Z

Azure Pipelines successfully started running 7 pipeline(s).

azure-pipelines · 2024-06-03T21:31:11Z

Azure Pipelines successfully started running 10 pipeline(s).

yufenglee · 2024-06-03T23:59:23Z

/azp run Linux Android Emulator QNN CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU TensorRT CI Pipeline, Windows ARM64 QNN CI Pipeline, Windows CPU CI Pipeline , Windows GPU CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2024-06-03T23:59:52Z

Azure Pipelines successfully started running 8 pipeline(s).

r-devulap requested a review from a team as a code owner May 8, 2024 20:39

TEST: Adjust hard coded expected values based on new tanhf algorithm

79a1b84

yihonglyu reviewed May 9, 2024

View reviewed changes

r-devulap added 4 commits May 9, 2024 12:42

[maint] fix linter failures

cab3a83

Use static_cast to avoid compiler error

accf576

Modify hard coded constants in tanhf activation test

cc6abd0

Increase fp16 error tol for models_opset7_fp16_coreml_FNSCandy

6bd807d

yufenglee reviewed May 13, 2024

View reviewed changes

onnxruntime/core/mlas/lib/amd64/TanhKernelFma3.asm Outdated Show resolved Hide resolved

yufenglee reviewed May 13, 2024

View reviewed changes

onnxruntime/core/mlas/lib/x86_64/TanhKernelFma3.S Outdated Show resolved Hide resolved

r-devulap added 2 commits May 13, 2024 13:58

Adjust tolerance for new tanhf algorithm

f3fbe1a

Retab to replace tabs with spaces

01b0007

yufenglee previously approved these changes May 17, 2024

View reviewed changes

Add rel and abs error for LSTM.BackwardCompute test

7e1bf2c

snnn reviewed May 23, 2024

View reviewed changes

orttraining/orttraining/test/training_ops/cpu/rnn/lstm_test.cc Outdated Show resolved Hide resolved

Run clang-format on lstm_test.cc

31fa70f

snnn requested review from yufenglee and yihonglyu May 30, 2024 05:12

Mark literals explicitly as float

1bf2d91

Change tolerance of ModelTest.Run to 1e-02

f777814

[mlas] Speed up tanhf activation function #20612

Are you sure you want to change the base?

[mlas] Speed up tanhf activation function #20612

Conversation

r-devulap commented May 8, 2024

Description

Motivation and Context

yufenglee commented May 8, 2024

yufenglee commented May 8, 2024

azure-pipelines bot commented May 8, 2024

azure-pipelines bot commented May 8, 2024

yihonglyu left a comment

Choose a reason for hiding this comment

r-devulap commented May 9, 2024

yufenglee commented May 11, 2024

yufenglee commented May 11, 2024

azure-pipelines bot commented May 11, 2024

azure-pipelines bot commented May 11, 2024

yufenglee May 13, 2024

Choose a reason for hiding this comment

r-devulap May 13, 2024 • edited

Choose a reason for hiding this comment

yufenglee commented May 13, 2024

yufenglee commented May 13, 2024

azure-pipelines bot commented May 13, 2024

azure-pipelines bot commented May 13, 2024

yufenglee commented May 17, 2024

azure-pipelines bot commented May 17, 2024

yufenglee commented May 17, 2024

r-devulap commented May 22, 2024

snnn commented May 23, 2024

snnn commented May 23, 2024

azure-pipelines bot commented May 23, 2024

azure-pipelines bot commented May 23, 2024

r-devulap commented May 29, 2024

snnn commented May 30, 2024

azure-pipelines bot commented May 30, 2024

snnn commented May 30, 2024

azure-pipelines bot commented May 30, 2024

snnn commented May 31, 2024

snnn commented May 31, 2024

azure-pipelines bot commented May 31, 2024

azure-pipelines bot commented May 31, 2024

snnn commented Jun 1, 2024

snnn commented Jun 1, 2024

azure-pipelines bot commented Jun 1, 2024

yufenglee commented Jun 3, 2024

yufenglee commented Jun 3, 2024

azure-pipelines bot commented Jun 3, 2024

azure-pipelines bot commented Jun 3, 2024

yufenglee commented Jun 3, 2024

azure-pipelines bot commented Jun 3, 2024

r-devulap May 13, 2024 •

edited