question about torch 2.1.0 integration #22

Pegessi · 2024-04-18T02:50:19Z

Thanks for your sharing! I'm greatly appreciate your work for reducing the cuda memory fragmentation. Recently I have integrated GMLake into torch2.1.0 and finished compiling without error. I would like to know how to confirm if GMLake is working properly, as I did not find any reduction in peak memory reserved during using Lora to train Llama2-7B.
garbage_collect_fused_blocks() function jumps to the error handling section, and does it causing GMLake not working?

Here are some running logs with only 6 iterations trainning steps.

node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x25f46060, ptr 0x12a0000000 of size 512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 256 physical blocks to ptr 0x12a0000000 of size 512.000000MB for allocate size 512.000000MB succeeded, takes 20.435480ms, total_fuse_size 32558.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2a994d70, ptr 0x12c0000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x12c0000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 40.207251ms, total_fuse_size 33512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2ba9e650, ptr 0x1320000000 of size 512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 256 physical blocks to ptr 0x1320000000 of size 512.000000MB for allocate size 512.000000MB succeeded, takes 20.692452ms, total_fuse_size 34024.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2bab4010, ptr 0x1340000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x1340000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 51.173343ms, total_fuse_size 34978.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x2b83a6b0, ptr 0x13a0000000 of size 512.000000MB
node-9658:4032281:4036703 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 256 physical blocks to ptr 0x13a0000000 of size 512.000000MB for allocate size 512.000000MB succeeded, takes 30.265250ms, total_fuse_size 35490.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x26fa6af0, ptr 0x13c0000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x13c0000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 49.731019ms, total_fuse_size 36444.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4159 fused block 0x28e575f0, ptr 0x13fc000000 of size 954.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO get_fused_fragmented_blocks():4171 try 0: fuse 477 physical blocks to ptr 0x13fc000000 of size 954.000000MB for allocate size 954.000000MB succeeded, takes 40.066690ms, total_fuse_size 37398.000000MB
{'train_runtime': 25.9383, 'train_samples_per_second': 1.851, 'train_steps_per_second': 0.231, 'train_loss': 1.7313324610392253, 'epoch': 1.0}
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO emptyCache():2425 garbage_collect_fused_blocks() return 0MB garbage memory
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3824 gc from fragmented_free_fused_blocks: blocks 0, size 0.000000MB
node-9658:4032281:4032281 [0] GMLAKE_INFO garbage_collect_fused_blocks():3893 gc from free_fused_blocks_in_release_order: blocks 0, size 0.000000MB

uygnef · 2024-04-30T08:40:19Z

hi @Pegessi
I am currently working on the master branch of the repository and have encountered a compilation error when attempting to build the project with PyTorch version 2.1.
which branch do you use?

mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3499:27: error: ‘struct c10::cuda::CUDACachingAllocator::Native::{anonymous}::HistoryChain’ has no member named ‘h’
3499 | block->history->h.context);
| ^
/mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp: In member function ‘void c10::cuda::CUDACachingAllocator::Native::DeviceCachingAllocator::record_trace(c10::cuda::CUDACachingAllocator::TraceEntry::Action, int64_t, size_t, cudaStream_t, int)’:
/mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3669:37: error: operands to ‘?:’ have different types ‘std::remove_reference<int&>::type’ {aka ‘int’} and ‘std::nullptr_t’
3669 | alloc_trace_record_context_ ? std::move(context) : nullptr);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp: At global scope:
/mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3802:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::recordHistory(bool, c10::cuda::CUDACachingAllocator::CreateContextFn, size_t, bool)’ marked ‘override’, but does not override
3802 | void recordHistory(
| ^~~~~~~~~~~~~
/mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3925:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureBegin(int, c10::cuda::CaptureId_t, c10::cuda::MempoolId_t)’ marked ‘override’, but does not override
3925 | void notifyCaptureBegin(
| ^~~~~~~~~~~~~~~~~~
/mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3934:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureAboutToEnd(int, c10::cuda::CaptureId_t)’ marked ‘override’, but does not override
3934 | void notifyCaptureAboutToEnd(int device, CaptureId_t graph_id) override {
| ^~~~~~~~~~~~~~~~~~~~~~~
/mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3939:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureEnded(int, c10::cuda::CaptureId_t)’ marked ‘override’, but does not override
3939 | void notifyCaptureEnded(int device, CaptureId_t graph_id) override {} // no-op
| ^~~~~~~~~~~~~~~~~~
/mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3941:8: error: ‘void c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::notifyCaptureDestroy(int, c10::cuda::MempoolId_t)’ marked ‘override’, but does not override
3941 | void notifyCaptureDestroy(int device, MempoolId_t mempool_id) override {
| ^~~~~~~~~~~~~~~~~~~~
/mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:3967:8: error: ‘bool c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator::needsPoolSpecificPeerAccess()’ marked ‘override’, but does not override
3967 | bool needsPoolSpecificPeerAccess() override {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
/mnt/pytorch/c10/cuda/CUDACachingAllocator.cpp:4031:24: error: cannot declare variable ‘c10::cuda::CUDACachingAllocator::Native::allocator’ to be of abstract type ‘c10::cuda::CUDACachingAllocator::Native::NativeCachingAllocator’
4031 | NativeCachingAllocator allocator;
| ^~~~~~~~~

Pegessi · 2024-05-09T12:33:51Z

I have integrated GMLake into torch2.1.0 manually by myself. These code cannot directly be used to replace the file in pytorch2.1.0
because of some changes about interfaces in cudacachingallocator.h&cpp. Although my manual version can be build successfully and logs display the virtual memory has been created, it's still not sure if GMLake works because my version doesn't reduce the peak memory during DNN training and overhead is great for the first few iterations.

dream110fly · 2024-09-19T11:35:45Z

@Pegessi Have you ever encountered the following problem?
when I patch GMLake to torch2.1.0, I found that sometimes when call func release_block where cudaFree small block, cuda will raise illegal memory

frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7f3e8791e1f2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x232c5 (0x7f3e878e52c5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #4: + 0x2ed2a (0x7f3e878f0d2a in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x1e4f5 (0x7f3e878e04f5 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #6: + 0x22a7e (0x7f3e878e4a7e in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #7: + 0x3a07a (0x7f3e878fc07a in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #8: + 0x3ac79 (0x7f3e878fcc79 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #9: + 0x3b143 (0x7f3e878fd143 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #10: THCPModule_emptyCache(_object*, _object*) + 0x37 (0x7f3e86a442b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #11: + 0x157a3e (0x5648cf980a3e in /usr/bin/python)
frame #12: _PyEval_EvalFrameDefault + 0x614a (0x5648cf971cfa in /usr/bin/python)
frame #13: _PyFunction_Vectorcall + 0x7c (0x5648cf9839fc in /usr/bin/python)
frame #14: _PyEval_EvalFrameDefault + 0x614a (0x5648cf971cfa in /usr/bin/python)
frame #15: _PyFunction_Vectorcall + 0x7c (0x5648cf9839fc in /usr/bin/python)
frame #16: _PyEval_EvalFrameDefault + 0x6bd (0x5648cf96c26d in /usr/bin/python)
frame #17: _PyFunction_Vectorcall + 0x7c (0x5648cf9839fc in /usr/bin/python)
frame #18: _PyEval_EvalFrameDefault + 0x6bd (0x5648cf96c26d in /usr/bin/python)
frame #19: _PyFunction_Vectorcall + 0x7c (0x5648cf9839fc in /usr/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x198c (0x5648cf96d53c in /usr/bin/python)
frame #21: + 0x13f9c6 (0x5648cf9689c6 in /usr/bin/python)
frame #22: PyEval_EvalCode + 0x86 (0x5648cfa5e256 in /usr/bin/python)
frame #23: + 0x260108 (0x5648cfa89108 in /usr/bin/python)
frame #24: + 0x2599cb (0x5648cfa829cb in /usr/bin/python)
frame #25: + 0x25fe55 (0x5648cfa88e55 in /usr/bin/python)
frame #26: _PyRun_SimpleFileObject + 0x1a8 (0x5648cfa88338 in /usr/bin/python)
frame #27: _PyRun_AnyFileObject + 0x43 (0x5648cfa87f83 in /usr/bin/python)
frame #28: Py_RunMain + 0x2be (0x5648cfa7aa5e in /usr/bin/python)
frame #29: Py_BytesMain + 0x2d (0x5648cfa5102d in /usr/bin/python)
frame #30: + 0x29d90 (0x7f3e88356d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #31: __libc_start_main + 0x80 (0x7f3e88356e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #32: _start + 0x25 (0x5648cfa50f25 in /usr/bin/python)
terminate called after throwing an instance of '
c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Pegessi changed the title ~~torch 2.1.0 use problem~~ question about torch 2.1.0 integration Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about torch 2.1.0 integration #22

question about torch 2.1.0 integration #22

Pegessi commented Apr 18, 2024 •

edited

Loading

uygnef commented Apr 30, 2024

Pegessi commented May 9, 2024 •

edited

Loading

dream110fly commented Sep 19, 2024

question about torch 2.1.0 integration #22

question about torch 2.1.0 integration #22

Comments

Pegessi commented Apr 18, 2024 • edited Loading

uygnef commented Apr 30, 2024

Pegessi commented May 9, 2024 • edited Loading

dream110fly commented Sep 19, 2024

Pegessi commented Apr 18, 2024 •

edited

Loading

Pegessi commented May 9, 2024 •

edited

Loading