You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello all, I am encountering inconsistent behavior during GPU inference. Sometimes the inference runs successfully, but other times it fails with either:
Segmentation fault
or
Non-OK-status: has_executable.status() status: INTERNAL: ptxas exited with non-zero error code 132, output: : If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.Failure occured when compiling fusion gemm_fusion_dot.65403 with config '{block_m:32,block_n:64,block_k:32,split_k:1,num_stages:4,num_warps:4,num_ctas:1}'
Environment:
Docker image as provided or local install
GPU: RTX A6000 48GB CUDA, Driver Version: 535.183.01 CUDA Version: 12.2, Default run mode
The run may succeed and only use about 2GB of vRAM, and the results look fine.
If I run another inference, I encounter either:
Segmentation Fault:
I0802 00:55:45.666194 127875129517888 run_docker.py:262] Fatal Python error: Segmentation fault
I0802 00:55:45.666548 127875129517888 run_docker.py:262]
I0802 00:55:45.666624 127875129517888 run_docker.py:262] Thread 0x000070bc94f6b280 (most recent call first):
I0802 00:55:45.666688 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 238 in backend_compile
I0802 00:55:45.666738 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/profiler.py", line 335 in wrapper
I0802 00:55:45.667106 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 500 in _compile_and_write_cache
I0802 00:55:45.667136 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 333 in compile_or_get_cached
I0802 00:55:45.667161 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2718 in _cached_compilation
I0802 00:55:45.667187 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2908 in from_hlo
I0802 00:55:45.667212 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2369 in compile
I0802 00:55:45.667233 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1406 in _pjit_call_impl_python
I0802 00:55:45.667258 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1471 in call_impl_cache_miss
I0802 00:55:45.667283 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1488 in _pjit_call_impl
I0802 00:55:45.667304 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 913 in process_primitive
I0802 00:55:45.667324 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 425 in bind_with_trace
I0802 00:55:45.667344 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 2788 in bind
I0802 00:55:45.667364 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 176 in _python_pjit_helper
I0802 00:55:45.667383 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 298 in cache_miss
I0802 00:55:45.667402 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/traceback_util.py", line 179 in reraise_with_filtered_traceback
I0802 00:55:45.667421 127875129517888 run_docker.py:262] File "/app/alphafold/alphafold/model/model.py", line 167 in predict
I0802 00:55:45.667440 127875129517888 run_docker.py:262] File "/app/alphafold/run_alphafold.py", line 284 in predict_structure
I0802 00:55:45.667459 127875129517888 run_docker.py:262] File "/app/alphafold/run_alphafold.py", line 543 in main
I0802 00:55:45.667478 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 258 in _run_main
I0802 00:55:45.667497 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 312 in run
I0802 00:55:45.667516 127875129517888 run_docker.py:262] File "/app/alphafold/run_alphafold.py", line 570 in <module>
Fatal Python error: Aborted: If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided
I0802 01:34:50.745555 132592778102592 run_docker.py:263] 2024-08-02 01:34:50.745063: F external/xla/xla/service/gpu/gemm_fusion_autotuner.cc:780] Non-OK-status: has_executable.status() status: INTERNAL: ptxas exited with non-zero error code 139, output: : If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.Failure occured when compiling fusion gemm_fusion_dot.52354 with config '{block_m:16,block_n:16,block_k:256,split_k:1,num_stages:1,num_warps:4,num_ctas:1}'
I0802 01:34:50.745778 132592778102592 run_docker.py:263] Fused HLO computation:
I0802 01:34:50.745835 132592778102592 run_docker.py:263] %gemm_fusion_dot.52354_computation (parameter_0.92: f32[17,384], parameter_1.92: f32[384], parameter_2.28: f32[384,384]) -> f32[17,384] {
I0802 01:34:50.745885 132592778102592 run_docker.py:263] %parameter_0.92 = f32[17,384]{1,0} parameter(0)
I0802 01:34:50.745933 132592778102592 run_docker.py:263] %parameter_1.92 = f32[384]{0} parameter(1)
I0802 01:34:50.745979 132592778102592 run_docker.py:263] %broadcast.15023 = f32[17,384]{1,0} broadcast(f32[384]{0} %parameter_1.92), dimensions={1}, metadata={op_name="jit(apply_fn)/jit(main)/alphafold/alphafold_iteration/structure_module/single_layer_norm/single_layer_norm/add" source_file="/app/alphafold/alphafold/model/common_modules.py" source_line=185}
I0802 01:34:50.746032 132592778102592 run_docker.py:263] %add.12065 = f32[17,384]{1,0} add(f32[17,384]{1,0} %parameter_0.92, f32[17,384]{1,0} %broadcast.15023), metadata={op_name="jit(apply_fn)/jit(main)/alphafold/alphafold_iteration/structure_module/single_layer_norm/single_layer_norm/add" source_file="/app/alphafold/alphafold/model/common_modules.py" source_line=185}
I0802 01:34:50.746080 132592778102592 run_docker.py:263] %parameter_2.28 = f32[384,384]{1,0} parameter(2)
I0802 01:34:50.746122 132592778102592 run_docker.py:263] ROOT %dot.3542 = f32[17,384]{1,0} dot(f32[17,384]{1,0} %add.12065, f32[384,384]{1,0} %parameter_2.28), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_name="jit(apply_fn)/jit(main)/alphafold/alphafold_iteration/structure_module/initial_projection/...a, ah->...h/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=float32]" source_file="/app/alphafold/alphafold/model/common_modules.py" source_line=122}
I0802 01:34:50.746166 132592778102592 run_docker.py:263] }
I0802 01:34:50.746207 132592778102592 run_docker.py:263] Fatal Python error: Aborted
I0802 01:34:50.746250 132592778102592 run_docker.py:263]
I0802 01:34:50.746290 132592778102592 run_docker.py:263] Thread 0x00007874dcb35280 (most recent call first):
I0802 01:34:50.746330 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 238 in backend_compile
I0802 01:34:50.746370 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/profiler.py", line 335 in wrapper
I0802 01:34:50.746411 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 500 in _compile_and_write_cache
I0802 01:34:50.746460 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 333 in compile_or_get_cached
I0802 01:34:50.746501 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2718 in _cached_compilation
I0802 01:34:50.746541 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2908 in from_hlo
I0802 01:34:50.746581 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2369 in compile
I0802 01:34:50.746620 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1406 in _pjit_call_impl_python
I0802 01:34:50.746675 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1471 in call_impl_cache_miss
I0802 01:34:50.746716 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1488 in _pjit_call_impl
I0802 01:34:50.746762 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 913 in process_primitive
I0802 01:34:50.747026 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 425 in bind_with_trace
I0802 01:34:50.747194 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 2788 in bind
I0802 01:34:50.747266 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 176 in _python_pjit_helper
I0802 01:34:50.747329 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 298 in cache_miss
I0802 01:34:50.747386 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/traceback_util.py", line 179 in reraise_with_filtered_traceback
I0802 01:34:50.747431 132592778102592 run_docker.py:263] File "/app/alphafold/alphafold/model/model.py", line 167 in predict
I0802 01:34:50.747478 132592778102592 run_docker.py:263] File "/app/alphafold/run_alphafold.py", line 284 in predict_structure
I0802 01:34:50.747540 132592778102592 run_docker.py:263] File "/app/alphafold/run_alphafold.py", line 543 in main
I0802 01:34:50.747584 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 258 in _run_main
I0802 01:34:50.747641 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 312 in run
I0802 01:34:50.747692 132592778102592 run_docker.py:263] File "/app/alphafold/run_alphafold.py", line 570 in <module>
Any guidance or suggestions for resolving these issues would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
juliocesar-io
changed the title
Weird Inference Behavior with GPU: Segmentation Fault or Filesystem Space Error
Weird Inference Behavior with GPU: Sometimes it works or Segmentation Fault/Filesystem Space Error
Aug 2, 2024
Hello all, I am encountering inconsistent behavior during GPU inference. Sometimes the inference runs successfully, but other times it fails with either:
or
Environment:
Steps to Reproduce:
Flags used:
Build the Docker image: build -f docker/Dockerfile -t alphafold
Run the Docker container:
Segmentation Fault:
Fatal Python error: Aborted: If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided
Successful inference:
Expected Behavior:
Troubleshooting Steps Taken:
Any guidance or suggestions for resolving these issues would be greatly appreciated.
The text was updated successfully, but these errors were encountered: