Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue] Recovering from out of memory error #289

Open
Samev opened this issue Jan 11, 2024 · 0 comments
Open

[Issue] Recovering from out of memory error #289

Samev opened this issue Jan 11, 2024 · 0 comments
Labels

Comments

@Samev
Copy link

Samev commented Jan 11, 2024

Describe the issue

When running AMGX on a too large case for the GPU it reports the following error

Thrust failure: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
File and line number are not available for this exception.

when calling AMGX_solver_setup. Following this we try to reset the AMGX solver but when AMGX_solver_destroy is called it crashes the application (despite being done within a try-catch block) with the following:

terminate called after throwing an instance of 'amgx::amgx_exception'
  what():  Cuda failure: 'an illegal memory access was encountered'

 /<censored>/lib/libamgxsh.so : amgx::handle_signals(int)+0xa2
 /lib/x86_64-linux-gnu/libc.so.6 : ()+0x42520
 /lib/x86_64-linux-gnu/libc.so.6 : pthread_kill()+0x12c
 /lib/x86_64-linux-gnu/libc.so.6 : raise()+0x16
 /lib/x86_64-linux-gnu/libc.so.6 : abort()+0xd3
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xa2b9e
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xae20c
 /lib/x86_64-linux-gnu/libstdc++.so.6 : ()+0xad1e9
 /lib/x86_64-linux-gnu/libstdc++.so.6 : __gxx_personality_v0()+0x99
 /lib/x86_64-linux-gnu/libgcc_s.so.1 : ()+0x16884
 /lib/x86_64-linux-gnu/libgcc_s.so.1 : _Unwind_RaiseException()+0x311
 /lib/x86_64-linux-gnu/libstdc++.so.6 : __cxa_throw()+0x3b
 /<censored>/lib/libamgxsh.so : amgx::dense_lu_solver::DenseLUSolver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~DenseLUSolver()+0x998
 /<censored>/lib/libamgxsh.so : amgx::dense_lu_solver::DenseLUSolver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~DenseLUSolver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::AMG<(AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2>::~AMG()+0x42
 /<censored>/lib/libamgxsh.so : amgx::AlgebraicMultigrid_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AlgebraicMultigrid_Solver()+0x26
 /<censored>/lib/libamgxsh.so : amgx::AlgebraicMultigrid_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AlgebraicMultigrid_Solver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::PBiCGStab_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~PBiCGStab_Solver()+0x35
 /<censored>/lib/libamgxsh.so : amgx::PBiCGStab_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~PBiCGStab_Solver()+0xd
 /<censored>/lib/libamgxsh.so : amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >::~AMG_Solver()+0x180
 /<censored>/lib/libamgxsh.so : std::_Sp_counted_ptr<amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> >*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x16
 /<censored>/lib/libamgxsh.so : std::_Sp_counted_ptr<amgx::CWrapHandle<AMGX_solver_handle_struct*, amgx::AMG_Solver<amgx::TemplateConfig<(AMGX_MemorySpace)1, (AMGX_VecPrecision)0, (AMGX_MatPrecision)0, (AMGX_IndPrecision)2> > >*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0x56
 /<censored>/lib/libamgxsh.so : ()+0x1394590
 /<censored>/lib/libamgxsh.so : AMGX_solver_destroy()+0xe24

I'm wondering if it is at all intended to be possible to recover from out of memory errors like this? I tried looking in the documentation and couldn't find anything specific indicating that a failure in AMGX_solver_setup needs some special handling.

Obviously AMGX won't be able to handle the specific matrix+solver combo in question on the specific GPU but this crash currently prevents us from destructing our AMGX solver object in case we run into this limit which is a bit of a problem since it results in the application crashing completely.

I tried skipping the call to AMGX_solver_destroy (proceeding with the rest of the *destroy commands and finalize commands, but then I run into the !!! detected some memory leaks in the code: trying to free non-empty temporary device pool !!! error which makes sense since the solver object isn't destroyed in the intended order.

Environment information:

  • OS: Ubuntu 22.04 (through WSL on Windows 11)
  • CUDA runtime: CUDA 11.7.1
  • MPI version (if applicable): Not applicable
  • AMGX version or commit hash v2.3.0 + cherry picked 8bb693b42acc64c1893835d95858cad350c790c1
  • NVIDIA driver: 528.24 (probably the Windows driver version as nvidia-smi reports the same version in Windows + WSL)
  • NVIDIA GPU: RTX4080
  • Any related environment variables information: Not applicable

Same problem has been reported on same build but for at least a RTX3090 card as well.

AMGX solver configuration

config_version=2,
determinism_flag=0,
solver(mainSolver)=PBICGSTAB,
mainSolver:preconditioner(precon)=AMG,
precon:cycle=V,
precon:max_levels=15,
precon:selector=PMIS,
precon:smoother(smooth)=BLOCK_JACOBI,
precon:presweeps=1,
precon:postsweeps=1,
precon:max_iters=1,
precon:interpolator=D2,
precon:interp_max_elements=6,
mainSolver:monitor_residual=1,
mainSolver:store_res_history=1,
mainSolver:norm=L2,
mainSolver:print_vis_data=1,
mainSolver:max_iters=10000,
mainSolver:tolerance=1e-09,
mainSolver:gmres_n_restart=30,
mainSolver:convergence=RELATIVE_INI_CORE

Matrix Data

My currently used matrix I'm not able to share. If you need me to I can see if I can recreate this crash with a matrix that isn't sensitive.

Reproduction steps

Call order:

  • Setup:
    • AMGX_solver_register_print_callback
    • AMGX_initialize
    • AMGX_initialize_plugins
    • AMGX_install_signal_handler
    • AMGX_config_create (global config)
    • AMGX_resources_create_simple
    • AMGX_config_create (for the specific solver)
    • AMGX_matrix_create
    • AMGX_vector_create (both rhs and solution)
    • AMGX_solver_create
    • AMGX_matrix_upload_all
    • AMGX_solver_setup
      • Crash due to insufficient memory, exception is caught
  • Try to tear down AMGX
    • AMGX_solver_destroy
      • Results in process crashing, can't catch the exception

Additional context

-

@Samev Samev added the bug label Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant