Flash attention recompute #20603

pengwa · 2024-05-08T08:23:24Z

Flash attn recompute

Allow PythonOp(FlashAttn) can be recomputed correctly. 45879ff
Use JSON to pass the selected-to-recompute subgraphs. 3c374da

Better Memory Efficiency

Customer model can run both PyTorch SPDA and Flash Attn, this PR make it possible to let the Flash Attn path work with ORTModule layerwise recompute. The peak drop from 45.xGB to 32.xGB if we only compare the layers (not including other pieces, BTW there are few more optimization targeting other pieces as well later).

Better Perf

Using Flash ATTN bring additionally 16% end to end time reduction, with highly aligned loss curve.

Use JSON File to pass Recompute Plans

To overcome the limitation of max length of the strings defined in session options.

Motivation and Context

…pengwa/flash_attn_recompute

orttraining/orttraining/core/optimizer/memory_optimizer/memory_insight.cc

…pengwa/flash_attn_recompute

orttraining/orttraining/core/optimizer/memory_optimizer/memory_optimizer.cc

orttraining/orttraining/core/optimizer/memory_optimizer/recompute_analysis.cc

docs/Memory_Optimizer.md

orttraining/orttraining/core/optimizer/memory_optimizer/memory_optimizer.cc

…pengwa/flash_attn_recompute

flash attn recompute

45879ff

pengwa added the training issues related to ONNX Runtime training; typically submitted using template label May 8, 2024

pengwa requested review from wschin and zhijxu-MS May 8, 2024 08:24

pengwa added 2 commits May 8, 2024 13:53

use json file to pass recompute plans

3c374da

fix

f822e7b

pengwa changed the title ~~Flash attn recompute~~ Flash attention recompute May 8, 2024

pengwa added 9 commits May 9, 2024 03:24

fixes

4ee17c4

minor

11a15a0

fix

3676305

fix build

ac44c6c

fix win build

6b8120a

fix win

5312917

fixes

20baf15

fix tests

c5cc319

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

5111053

…pengwa/flash_attn_recompute

guyang3532 reviewed May 10, 2024

View reviewed changes

orttraining/orttraining/core/optimizer/memory_optimizer/memory_insight.cc Show resolved Hide resolved

guyang3532 reviewed May 10, 2024

View reviewed changes

orttraining/orttraining/core/optimizer/memory_optimizer/memory_insight.cc Outdated Show resolved Hide resolved

pengwa added 2 commits May 11, 2024 00:00

refinement

624adcd

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

757ed23

…pengwa/flash_attn_recompute

guyang3532 reviewed May 11, 2024

View reviewed changes

orttraining/orttraining/core/optimizer/memory_optimizer/memory_optimizer.cc Outdated Show resolved Hide resolved

guyang3532 reviewed May 11, 2024

View reviewed changes

orttraining/orttraining/core/optimizer/memory_optimizer/recompute_analysis.cc Show resolved Hide resolved

fixes

f6ace9b

guyang3532 previously approved these changes May 13, 2024

View reviewed changes

zhijxu-MS reviewed May 13, 2024

View reviewed changes

docs/Memory_Optimizer.md Show resolved Hide resolved

zhijxu-MS reviewed May 13, 2024

View reviewed changes

orttraining/orttraining/core/optimizer/memory_optimizer/memory_optimizer.cc Show resolved Hide resolved

restore the rng stage for CPU and CUDA

94d510f

pengwa dismissed guyang3532’s stale review via 94d510f May 20, 2024 12:31

pengwa added 2 commits May 20, 2024 12:31

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

5642696

…pengwa/flash_attn_recompute

remove c++ test because it is hard to maintain it

9b94d4c

minor

0e9d80f

wschin approved these changes May 21, 2024

View reviewed changes

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

31e8b97

…pengwa/flash_attn_recompute

pengwa merged commit 8a98874 into main May 21, 2024
98 checks passed

pengwa deleted the pengwa/flash_attn_recompute branch May 21, 2024 05:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention recompute #20603

Flash attention recompute #20603

pengwa commented May 8, 2024 •

edited

Flash attention recompute #20603

Flash attention recompute #20603

Conversation

pengwa commented May 8, 2024 • edited

Flash attn recompute

Better Memory Efficiency

Better Perf

Use JSON File to pass Recompute Plans

Motivation and Context

pengwa commented May 8, 2024 •

edited