Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] core dump when benchmarking with samples/dlrm on L20/H20 #464

Closed
samchugit opened this issue Dec 3, 2024 · 1 comment
Closed

[BUG] core dump when benchmarking with samples/dlrm on L20/H20 #464

samchugit opened this issue Dec 3, 2024 · 1 comment

Comments

@samchugit
Copy link

Describe the bug
core dump when benchmarking with samples/dlrm on L20/H20

To Reproduce
Steps to reproduce the behavior:

  1. clone HugeCTR source code, checkout to branch v24.06.00
  2. prepare datasets according to README.md. For convenience, only day0 of criteo 1tb dataset is used.
  3. run the following commands to pull image and start container
docker pull nvcr.io/nvidia/merlin/merlin-hugectr:24.06
docker run --gpus=all --rm -it --privileged --shm-size=8g \
  --ulimit memlock=-1 --ulimit stack=67108864 --cap-add SYS_NICE \
  -u $(id -u):$(id -g) -v <path_to_HugeCTR>:/workspace -v <path_to_dataset>/data \
  nvcr.io/nvidia/merlin/merlin-hugectr:24.06
  1. running train.py by the following commands
cd /workspace/samples/dlrm
pip install -r requirements.txt
python train.py
  1. an exception will occur
    image

Expected behavior
Running train.py without any exception.

Screenshots
If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • OS: Ubuntu 22.04
  • Graphic card: 8 * NVIDIA L20/H20
  • CUDA version: 12.2
  • Docker image: nvcr.io/nvidia/merlin/merlin-hugectr:24.06

Additional context

  1. Using debugging build HugeCTR, more information can be provided by dump core
(gdb)
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139851934344768) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139851934344768) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139851934344768, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f3280642476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f32806287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007f3262ead42a in __gnu_cxx::__verbose_terminate_handler() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007f3262eab20c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f3262eaa1e9 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f3262eaa959 in __gxx_personality_v0 () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f32804f6884 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#10 0x00007f32804f6f41 in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#11 0x00007f3262eab4cb in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007f30c448aa91 in HugeCTR::GPUResource::set_stream (this=0x7f2fa009e020, name="default", priority=0) at /hugectr/HugeCTR/include/gpu_resource.hpp:80
#13 0x00007f30c4df1751 in HugeCTR::StreamContext::~StreamContext (this=0x7f31d0de1ce0, __in_chrg=<optimized out>) at /hugectr/HugeCTR/include/gpu_resource.hpp:116
#14 0x00007f30c4df05f1 in HugeCTR::StreamContextScheduleable::run (this=0x7ef9f3abe4f0, gpu=std::shared_ptr<HugeCTR::GPUResource> (use count 28, weak count 0) = {...}, use_graph=true) at /hugectr/HugeCTR/src/pipeline.cpp:102
#15 0x00007f30c4df0fdc in HugeCTR::Pipeline::run_graph (this=0x55ae7a79b6e8) at /hugectr/HugeCTR/src/pipeline.cpp:152
#16 0x00007f30c4f208ef in _ZN7HugeCTR5Model23train_pipeline_with_ebcEv._omp_fn.1(void) () at /hugectr/HugeCTR/src/pybind/model_pipeline.cpp:468
#17 0x00007f3263cf7c0e in gomp_thread_start (xdata=<optimized out>) at ../../../src/libgomp/team.c:129
#18 0x00007f3280694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#19 0x00007f3280726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
  1. With CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 and cuda-gdb, more information can be provided
    image
@shijieliu
Copy link
Collaborator

After discussing with the user offline, we found it's because data is not preprocessed properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants