Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiGPU malloc_device fails for second GPU (Out of Memory) #16038

Open
kaanolgu opened this issue Nov 10, 2024 · 0 comments
Open

MultiGPU malloc_device fails for second GPU (Out of Memory) #16038

kaanolgu opened this issue Nov 10, 2024 · 0 comments
Labels
bug Something isn't working hip Issues related to execution on HIP backend.

Comments

@kaanolgu
Copy link

kaanolgu commented Nov 10, 2024

Describe the bug

I am trying to use multiGPU code on AMD MI210 and I can't allocate anything on second GPU. Even N=1 sized allocation fails with HIP Out of Memory error. I tried these prebuilt nightly builds:

llvm-nighltly-2024-10-26
llvm-nighltly-2024-11-09
llvm-nighltly-2024-11-07

The code works if I use a single GPU and only allocate memory on single GPU

To reproduce

#include <sycl/sycl.hpp>
#include <iostream>

#define NUM_GPU 2
int main() {

  auto Devs = sycl::device::get_devices(sycl::info::device_type::gpu);


  std::vector<sycl::queue> Queues;
  // Insert not all devices only the required ones for model


  std::transform(Devs.begin(), Devs.begin() + NUM_GPU, std::back_inserter(Queues),
                 [](const sycl::device &D) { return sycl::queue{D}; });

  ////////////////////////////////////////////////////////////////////////
  if (Devs.size() > 1){
  if (!Devs[0].ext_oneapi_can_access_peer(
          Devs[1], sycl::ext::oneapi::peer_access::access_supported)) {
    std::cout << "P2P access is not supported by devices, exiting."
              << std::endl;

  }
  }
    std::cout <<"\n----------------------------------------"<< std::endl;

    std::cout << "Running on devices:" << std::endl;
    for(int i =0; i < Queues.size(); i++){

    std::cout << i << ":\t" << Queues[i].get_device().get_info<sycl::info::device::name>()
              << std::endl;
    }
std::cout <<"----------------------------------------"<< std::endl;


const int N = 1;
std::vector<uint32_t> data_host(N);
std::vector<uint32_t> data_host_2(N);

uint32_t* data_dev     = malloc_device<uint32_t>(N, Queues[0]); 
uint32_t* data_dev_2     = malloc_device<uint32_t>(N, Queues[1]); 

Queues[0].memcpy(data_dev, data_host.data(), N * sizeof(uint32_t));
Queues[1].memcpy(data_dev_2, data_host_2.data(), N * sizeof(uint32_t));
}

This is the compile command:

export PATH=$DPCPP_HOME/llvm-2024-10-26/bin:$PATH;
export LD_LIBRARY_PATH=$DPCPP_HOME/llvm-2024-10-26/lib:$LD_LIBRARY_PATH
sycl-ls
rm -rf a.out;clang++ -O3 -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx90a atomic_test.cpp ; ./a.out 

This is the output :

$ ./build.sh
[hip:gpu][hip:0] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
[hip:gpu][hip:1] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
[hip:gpu][hip:2] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]

----------------------------------------
Running on devices:
0:      AMD Instinct MI210
1:      AMD Instinct MI210
----------------------------------------
<HIP>[ERROR]: 
UR HIP ERROR:
        Value:           2
        Name:            hipErrorOutOfMemory
        Description:     out of memory
        Function:        getNextTransferStream
        Source Location: /__w/llvm/llvm/build/_deps/unified-runtime-src/source/adapters/hip/queue.cpp:106

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 38 (UR_RESULT_ERROR_OUT_OF_HOST_MEMORY)
./build.sh: line 16: 542041 Aborted                 (core dumped) ./a.out

Environment

Device  [Model : Revision]    Temp    Power  Partitions      SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
        Name (20 chars)       (Edge)  (Avg)  (Mem, Compute)                                                   
==============================================================================================================
0       [0x0c34 : 0x02]       39.0°C  41.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
        Instinct MI210                                                                                        
1       [0x0c34 : 0x02]       39.0°C  43.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
        Instinct MI210                                                                                        
2       [0x0c34 : 0x02]       34.0°C  42.0W  N/A, N/A        800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
        Instinct MI210                                                                                        

sycl-ls:

[hip:gpu][hip:0] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
[hip:gpu][hip:1] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
[hip:gpu][hip:2] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]

Platforms: 1
Platform [#1]:
    Version  : HIP 60140.9
    Name     : AMD HIP BACKEND
    Vendor   : AMD Corporation
    Devices  : 3
        Device [#0]:
        Type              : gpu
        Version           : gfx90a:sramecc+:xnack-
        Name              : AMD Instinct MI210
        Vendor            : AMD Corporation
        Driver            : HIP 60140.9
        UUID              : 4950100981019857102575252102995310254
        DeviceID          : 0
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_queue_profiling_tag
        info::device::sub_group_sizes: 64
        Architecture: amd_gpu_gfx90a
        Device [#1]:
        Type              : gpu
        Version           : gfx90a:sramecc+:xnack-
        Name              : AMD Instinct MI210
        Vendor            : AMD Corporation
        Driver            : HIP 60140.9
        UUID              : 5548569810097102505099994910097100100
        DeviceID          : 0
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_queue_profiling_tag
        info::device::sub_group_sizes: 64
        Architecture: amd_gpu_gfx90a
        Device [#2]:
        Type              : gpu
        Version           : gfx90a:sramecc+:xnack-
        Name              : AMD Instinct MI210
        Vendor            : AMD Corporation
        Driver            : HIP 60140.9
        UUID              : 1014852485797999898535199551015355
        DeviceID          : 0
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_native_assert ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_width ext_intel_legacy_image ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_queue_profiling_tag
        info::device::sub_group_sizes: 64
        Architecture: amd_gpu_gfx90a
default_selector()      : gpu, AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
accelerator_selector()  : No device of requested type available.
cpu_selector()          : No device of requested type available.
gpu_selector()          : gpu, AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
custom_selector(gpu)    : gpu, AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
custom_selector(cpu)    : No device of requested type available.
custom_selector(acc)    : No device of requested type available.

Additional context

No response

@kaanolgu kaanolgu added bug Something isn't working hip Issues related to execution on HIP backend. labels Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hip Issues related to execution on HIP backend.
Projects
None yet
Development

No branches or pull requests

1 participant