Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Cannot allocate pinned memory" error on a supercomputer #316

Open
jooj211 opened this issue Jul 2, 2024 · 5 comments
Open

"Cannot allocate pinned memory" error on a supercomputer #316

jooj211 opened this issue Jul 2, 2024 · 5 comments
Labels

Comments

@jooj211
Copy link

jooj211 commented Jul 2, 2024

I am encountering a "Cannot allocate pinned memory" error while running a program that uses AMGX solvers on a supercomputer that uses the SLURM Workload Manager. The program fails to allocate the necessary pinned memory for efficient GPU memory transfers.

Here's the full output file:

Nonlinear Elasticity FEM Solver
Updated Lagrangian Formulation
Setting material and elasticity type
Setting boundary conditions
Setup of non-linear elasticity problem
Reading parameters from XML file
Creating Incompressible Material
Bulk modulus: 300
 Hyperelastic material: Guccione
 Material properties: 10 1 1 1 0 300
 Number of nodal loads: 0
 Number of prescribed displ.: 249
 Number of traction (Neumann) loads: 0
 Number of dirichlet boundary conds:0
 Number of normal pressure loads: 1
 Number of spring boundary conds: 0
Solving problem
Reading XML mesh
Reading XML mesh file: ./prob2_12x27x2_k300.xml
Fiber model: fiber_isotropic
Mesh information
 Number of dimensions: 3
 Number of nodes: 975
 Number of elements: 648
 Number of boundary elements: 324
Output step: 1
Solving nonlinear problem (UL) using NonlinearSolver
Initial inner volume: 3189.27
Initial cavity volume: 0
Size of the problem 2925
Matrices and vectors creation done
 Load increment 1
 1 Newton-LS step
Caught amgx exception: Cannot allocate pinned memory
 at: /prj/hearttwins/jonatas.costa/source/amgx/src/global_thread_handle.cu:351
Stack trace:
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::memory::PinnedMemoryPool::PinnedMemoryPool()+0x412
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::allocate_resources(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x3b
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::Resources::Resources(amgx::AMG_Configuration*, void*, int, int const*)+0xc24
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : AMGX_resources_create_simple()+0x8b
 ./nonlinearelas : petsc::LinearSolver::solve(petsc::Matrix&, petsc::Vector&, petsc::Vector&, double)+0x25d
 ./nonlinearelas : NewtonLineSearch::solve()+0x222
 ./nonlinearelas : UpdatedLagrangian::solve()+0x3c4
 ./nonlinearelas : NonlinearElasticity::run()+0x6b
 ./nonlinearelas : main()+0x479
 /lib64/libc.so.6 : __libc_start_main()+0xf5
 ./nonlinearelas() [0x45336f]


Here's the SLURM output:

AMGX ERROR: file /prj/hearttwins/jonatas.costa/nodal_cardiax/src/linalg/petsc_linear_solver.cpp line    371
AMGX ERROR: Insufficient memory.
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
:
system msg for write_line failure : Bad file descriptor

And those were the SLURM options utilized in this test:

#SBATCH --nodes=1             
#SBATCH --ntasks-per-node=1
#SBATCH --ntasks=1
#SBATCH -p nvidia_dev
#SBATCH --time=0:01:00
#SBATCH -J cardiax_test
#SBATCH --gres=gpu:0
#SBATCH --mem-bind=local

System info

Operating System: Linux Red Hat 7.9
CUDA Version: 12.3
GCC Version: 9.3
MPI Version: 3.4
AMGX Version: 2.5.0
GPU Model: NVIDIA Tesla K40t
NVIDIA Driver Version: 470.82.01

Any guidance or suggestions on resolving this issue would be greatly appreciated. Thank you!

@jooj211 jooj211 added the bug label Jul 2, 2024
@marsaev
Copy link
Collaborator

marsaev commented Jul 3, 2024

@jooj211

GPU Model: NVIDIA Tesla K40t
CUDA Version: 12.3
NVIDIA Driver Version: 470.82.01

CUDA v12.3 doesn't support Kepler architecture. CUDA v12.3 also requires more recent driver version:
https://docs.nvidia.com/cuda/archive/12.3.0/cuda-toolkit-release-notes/index.html#id4
The latest CUDA that should work for Kepler GPUs is 11.4.

@jooj211
Copy link
Author

jooj211 commented Jul 3, 2024

CUDA v12.3 doesn't support Kepler architecture. CUDA v12.3 also requires more recent driver version: https://docs.nvidia.com/cuda/archive/12.3.0/cuda-toolkit-release-notes/index.html#id4 The latest CUDA that should work for Kepler GPUs is 11.4.

Thank you for the heads up! I've now changed the version of CUDA to 11.4. After recompiling both AMGX and my program, however, the same problem persists:

Caught amgx exception: Cannot allocate pinned memory
 at: /prj/hearttwins/jonatas.costa/source/amgx/src/global_thread_handle.cu:351
Stack trace:
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::memory::PinnedMemoryPoo>
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::allocate_resources(unsi>
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::Resources::Resources(am>
 /scratch/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : AMGX_resources_create_simple(>
 ./nonlinearelas : petsc::LinearSolver::solve(petsc::Matrix&, petsc::Vector&, petsc::Vector&, d>
 ./nonlinearelas : NewtonLineSearch::solve()+0x222
 ./nonlinearelas : UpdatedLagrangian::solve()+0x3c4
 ./nonlinearelas : NonlinearElasticity::run()+0x6b
 ./nonlinearelas : main()+0x479
 /lib64/libc.so.6 : __libc_start_main()+0xf5
 ./nonlinearelas() [0x45336f]

And in the slurm out file:

AMGX ERROR: file /prj/hearttwins/jonatas.costa/nodal_cardiax/src/linalg/petsc_linear_solver.cpp>
AMGX ERROR: Insufficient memory.
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=1
:
system msg for write_line failure : Bad file descriptor

``

@marsaev
Copy link
Collaborator

marsaev commented Jul 3, 2024

@jooj211

I wonder if it's somehow related to what another user reported in this issue: #313

First thing - does your environment support locked memory? You can try running an example that tries to allocate same amount of pinned memory to see if it's environment issue, something like this: https://godbolt.org/z/7ab86qc34
Next, I would suggest trying out simple AMGX example with the same config that you use in your application.

@jooj211
Copy link
Author

jooj211 commented Jul 4, 2024

@marsaev

First thing - does your environment support locked memory? You can try running an example that tries to allocate same amount of pinned memory to see if it's environment issue, something like this: https://godbolt.org/z/7ab86qc34

Interestingly, the example ran without any issues, so it doesn't seem like a environment problem. When trying AMGX examples though, I ran into the same problem i had in my application:

AMGX version 2.5.0
Built on Jun 14 2024, 13:49:23
Compiled with CUDA Runtime 12.3, using CUDA driver 11.4
Warning: No mode specified, using dDDI by default.
Caught amgx exception: Cannot allocate pinned memory
 at: /prj/hearttwins/jonatas.costa/source/amgx/src/global_thread_handle.cu:351
Stack trace:
 /prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::memory::PinnedMemoryPool::PinnedMemoryPool()+0x412
 /prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::allocate_resources(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x3b
 /prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : amgx::Resources::Resources(amgx::AMG_Configuration*, void*, int, int const*)+0xc24
 /prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so : AMGX_resources_create_simple()+0x8b
 ./examples/amgx_capi() [0x40172c]
 /lib64/libc.so.6 : __libc_start_main()+0xf5
 ./examples/amgx_capi() [0x401de3]

Caught signal 11 - SIGSEGV (segmentation violation)

@marsaev
Copy link
Collaborator

marsaev commented Jul 8, 2024

Can you check last small thing - there is still this in the output:

Compiled with CUDA Runtime 12.3

can you check that CUDA 11.4 is actually used in the runtime?
(you can try running ldd /prj/hearttwins/jonatas.costa/source/amgx/lib/libamgxsh.so)

Sorry for misleading output, since that message actually means what version is being used at the runtime ( https://github.com/NVIDIA/AMGX/blob/main/src/core.cu#L738-L751 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants