Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_OUT_OF_MEMORY issue when running test case on 4090 24G GPU machine locally #209

Closed
MaoSihong opened this issue Dec 12, 2024 · 14 comments
Labels
question Further information is requested

Comments

@MaoSihong
Copy link

MaoSihong commented Dec 12, 2024

when I launched the 2PV7 test case with all default profile, I encountered following err after MSA pipeline(it seems so):
image
image
I also tried the mentioned troubleshooting set:

ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
ENV TF_FORCE_UNIFIED_MEMORY=true
ENV XLA_CLIENT_MEM_FRACTION=3.2
and modified model_config.py:
  pair_transition_shard_spec: Sequence[_Shape2DType] = (
      (2048, None),
      (3072, 1024),
      (None, 512),
  )

but still got errors at very beginning this time,no matter whether the --norun_data_pipeline was used:
image
any advice are appreciated!
link my docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi here:
image

@MaoSihong
Copy link
Author

additionally I noticed that the container have low RAM costs when the test case is running at its MSA searching stage(usually less than 5GB). I dont know if it is the normal case for 2PV7

@MaoSihong
Copy link
Author

additionally I noticed that the container have low RAM costs when the test case is running at its MSA searching stage(usually less than 5GB). I dont know if it is the normal case for 2PV7
also following notice under default profile running
image

@alchemistcai
Copy link

I test 2pv7 on 4060 8G and 768 tokens is the max default compile bucket.

By using --buckets='900' I can inference at most 900 tokens without OutOfMemory error.

I use the default single GPU settings without uniform memory:

export XLA_FLAGS="--xla_gpu_enable_triton_gemm=false" 
export XLA_PYTHON_CLIENT_PREALLOCATE=true
export XLA_CLIENT_MEM_FRACTION=0.95

Use nvidia-smi before you python run_alphafold.py to find out which process is using the GPU.

For MSA searching,my test RAM is about 4~5G,used by default 8 CPU core.So it's normal.

@joshabramson joshabramson added the question Further information is requested label Dec 13, 2024
@joshabramson
Copy link
Collaborator

The error is during the inference stage, after msa.

As well as @alchemistcai being able to run on 4060, other users have reported success for this example on RTX 4090: #59 (comment)

It should be fine having two gpu available, but perhaps that is causing an issue for some reason, can you try with --gpus device=0 instead of --gpus all ?

@MaoSihong
Copy link
Author

MaoSihong commented Dec 13, 2024

I test 2pv7 on 4060 8G and 768 tokens is the max default compile bucket.

By using --buckets='900' I can inference at most 900 tokens without OutOfMemory error.

I use the default single GPU settings without uniform memory:

export XLA_FLAGS="--xla_gpu_enable_triton_gemm=false" 
export XLA_PYTHON_CLIENT_PREALLOCATE=true
export XLA_CLIENT_MEM_FRACTION=0.95

Use nvidia-smi before you python run_alphafold.py to find out which process is using the GPU.

For MSA searching,my test RAM is about 4~5G,used by default 8 CPU core.So it's normal.

I tried default profile compiling but with --buckets=1280 this time, seems like the vanilla pipeline will pre-occupy the GPU resources so I expect for success with limit of bucketsize this time. Unfortunately, I still got the same error, thanks for your advice anyway

@MaoSihong
Copy link
Author

The error is during the inference stage, after msa.

As well as @alchemistcai being able to run on 4060, other users have reported success for this example on RTX 4090: #59 (comment)

It should be fine having two gpu available, but perhaps that is causing an issue for some reason, can you try with --gpus device=0 instead of --gpus all ?

image
sorry for that, I still failed

@alchemistcai
Copy link

alchemistcai commented Dec 14, 2024

ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
ENV TF_FORCE_UNIFIED_MEMORY=true
ENV XLA_CLIENT_MEM_FRACTION=3.2
and modified model_config.py:
  pair_transition_shard_spec: Sequence[_Shape2DType] = (
      (2048, None),
      (3072, 1024),
      (None, 512),
  )

You may try to undo this modification. I tried it before.

For a single GPU,uniform memory settings makes even 256 tokens inference OutOfMemory on 4060.

@XIANZHE-LI
Copy link

XIANZHE-LI commented Dec 14, 2024

I've the same problem.

W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 18.88GiB (20266891105 bytes) by rematerialization; only reduced to 41.43GiB (44490724824 bytes), down from 45.92GiB (49309847144 bytes) originally
2024-12-14 19:57:44.462633: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 47.20GiB (rounded to 50677266944)requested by op 
2024-12-14 19:57:44.464119: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ************________________________________________________________________________________________
E1214 19:57:44.464390   17928 pjrt_stream_executor_client.cc:3084] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
Traceback (most recent call last):
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 690, in <module>
    app.run(main)
  File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 674, in main
    process_fold_input(
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 542, in process_fold_input
    all_inference_results = predict_structure(
                            ^^^^^^^^^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 375, in predict_structure
    result = model_runner.run_inference(example, rng_key)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 313, in run_inference
    result = self._model(rng_key, featurised_example)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
--------------------

For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these.
image
i have 3 4090
but only can be used

| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:3D:00.0 Off |                  Off |
| 30%   28C    P8             15W /  450W |   23399MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:63:00.0 Off |                  Off |
| 30%   29C    P8             21W /  450W |     393MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  |   00000000:BD:00.0 Off |                  Off |
| 30%   27C    P8             19W /  450W |     393MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

@MaoSihong
Copy link
Author

ENV XLA_PYTHON_CLIENT_PREALLOCATE=false
ENV TF_FORCE_UNIFIED_MEMORY=true
ENV XLA_CLIENT_MEM_FRACTION=3.2
and modified model_config.py:
pair_transition_shard_spec: Sequence[_Shape2DType] = (
(2048, None),
(3072, 1024),
(None, 512),
)

You may try to undo this modification.I tried it before.

For a single GPU,uniform memory settings makes even 256 tokens inference OutOfMemory on 4060.

yep, but at my first test with default PARAs, those ENV PARAs are not turned on.
but OOM failure still happened that time.
thank you very much for your close attention!

@MaoSihong
Copy link
Author

I've the same problem.

W external/xla/xla/service/hlo_rematerialization.cc:3005] Can't reduce memory use below 18.88GiB (20266891105 bytes) by rematerialization; only reduced to 41.43GiB (44490724824 bytes), down from 45.92GiB (49309847144 bytes) originally
2024-12-14 19:57:44.462633: W external/xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 47.20GiB (rounded to 50677266944)requested by op 
2024-12-14 19:57:44.464119: W external/xla/xla/tsl/framework/bfc_allocator.cc:508] ************________________________________________________________________________________________
E1214 19:57:44.464390   17928 pjrt_stream_executor_client.cc:3084] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
Traceback (most recent call last):
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 690, in <module>
    app.run(main)
  File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/root/miniconda/envs/af3-2/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 674, in main
    process_fold_input(
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 542, in process_fold_input
    all_inference_results = predict_structure(
                            ^^^^^^^^^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 375, in predict_structure
    result = model_runner.run_inference(example, rng_key)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/lanyun-tmp/AF3/alphafold3/run_alphafold_exit.py", line 313, in run_inference
    result = self._model(rng_key, featurised_example)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 50677266816 bytes.
--------------------

For simplicity, JAX has removed its internal frames from the traceback of the following exception. Set JAX_TRACEBACK_FILTERING=off to include these. image i have 3 4090 but only can be used

| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:3D:00.0 Off |                  Off |
| 30%   28C    P8             15W /  450W |   23399MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  |   00000000:63:00.0 Off |                  Off |
| 30%   29C    P8             21W /  450W |     393MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        On  |   00000000:BD:00.0 Off |                  Off |
| 30%   27C    P8             19W /  450W |     393MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

sure I'll try to dig out the inside details without log filtering.
as far as I know, also only single GPU were allocated when my container was running

@MaoSihong
Copy link
Author

错误发生在推理阶段,在 msa 之后。

@alchemistcai能够在 4060 上运行,其他用户报告此示例在 RTX 4090 上成功运行:#59(评论)

有两个可用的 GPU 应该没问题,但也许由于某种原因导致了问题,您可以尝试用--gpus device=0而不是 吗--gpus all

yes I tried it yet with explicitly assignment of device=0, in fact no matter 'all' or 'device=0' option was on ,there's always device0 4090 24GB GPU showing significant resources occupy, but same OOM error occured

@Augustin-Zidek
Copy link
Collaborator

Augustin-Zidek commented Dec 19, 2024

@MaoSihong from the nvidia-smi screenshots, it looks like you are on the 560 version of NVIDIA drivers, i.e. on the beta channel. Could you try downgrading to the stable 550 version?

Or is this under Windows for Linux subsystem? If so, I strongly recommending running AlphaFold 3 under Linux, this is the only supported operating system.

@MaoSihong
Copy link
Author

MaoSihong commented Dec 23, 2024

@MaoSihongnvidia-smi截图来看,您使用的是 NVIDIA 驱动程序的 560 版本,即测试版。您可以尝试降级到稳定的 550 版本吗?

或者这是在 Windows 下的 Linux 子系统?如果是这样,我强烈建议在 Linux 下运行 AlphaFold 3,这是唯一受支持的操作系统。

yep I made the case under the WSL2+docker desktop, docker can integrate to the subsystem. I dont know whether is the cuda version or subsystem cause the OOM. I think I should probably quit this struggling process on WSL2, seems that there is complicated compatibility problem for launching af3 in WSL

@Augustin-Zidek
Copy link
Collaborator

Agreed, I think running this natively under Linux will hopefully fix the issue. Feel free to comment or open a new issue if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants