-
Notifications
You must be signed in to change notification settings - Fork 699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA_ERROR_OUT_OF_MEMORY issue when running test case on 4090 24G GPU machine locally #209
Comments
additionally I noticed that the container have low RAM costs when the test case is running at its MSA searching stage(usually less than 5GB). I dont know if it is the normal case for 2PV7 |
I test 2pv7 on 4060 8G and 768 tokens is the max default compile bucket. By using I use the default single GPU settings without uniform memory: export XLA_FLAGS="--xla_gpu_enable_triton_gemm=false"
export XLA_PYTHON_CLIENT_PREALLOCATE=true
export XLA_CLIENT_MEM_FRACTION=0.95 Use For MSA searching,my test RAM is about 4~5G,used by default 8 CPU core.So it's normal. |
The error is during the inference stage, after msa. As well as @alchemistcai being able to run on 4060, other users have reported success for this example on RTX 4090: #59 (comment) It should be fine having two gpu available, but perhaps that is causing an issue for some reason, can you try with |
I tried default profile compiling but with --buckets=1280 this time, seems like the vanilla pipeline will pre-occupy the GPU resources so I expect for success with limit of bucketsize this time. Unfortunately, I still got the same error, thanks for your advice anyway |
|
You may try to undo this modification. I tried it before. For a single GPU,uniform memory settings makes even 256 tokens inference OutOfMemory on 4060. |
yep, but at my first test with default PARAs, those ENV PARAs are not turned on. |
yes I tried it yet with explicitly assignment of device=0, in fact no matter 'all' or 'device=0' option was on ,there's always device0 4090 24GB GPU showing significant resources occupy, but same OOM error occured |
@MaoSihong from the Or is this under Windows for Linux subsystem? If so, I strongly recommending running AlphaFold 3 under Linux, this is the only supported operating system. |
yep I made the case under the WSL2+docker desktop, docker can integrate to the subsystem. I dont know whether is the cuda version or subsystem cause the OOM. I think I should probably quit this struggling process on WSL2, seems that there is complicated compatibility problem for launching af3 in WSL |
Agreed, I think running this natively under Linux will hopefully fix the issue. Feel free to comment or open a new issue if not. |
when I launched the 2PV7 test case with all default profile, I encountered following err after MSA pipeline(it seems so):
I also tried the mentioned troubleshooting set:
but still got errors at very beginning this time,no matter whether the --norun_data_pipeline was used:
any advice are appreciated!
link my docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi here:
The text was updated successfully, but these errors were encountered: