Add run time compilation #11225

oerling · 2024-10-10T21:21:15Z

Adds a CompiledModule abstraction on top of Cuda run time compilation.
Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready.
tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel.
Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation.
Add device properties to the Device* struct.

netlify · 2024-10-10T21:21:33Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`bf9260f`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/671811f26096ce000872d3d5

facebook-github-bot · 2024-10-10T21:28:08Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-10-10T21:33:45Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-10-10T23:45:48Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Differential Revision: D64205005 Pulled By: oerling

facebook-github-bot · 2024-10-11T19:18:31Z

This pull request was exported from Phabricator. Differential Revision: D64205005

Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Differential Revision: D64205005 Pulled By: oerling

facebook-github-bot · 2024-10-11T23:40:22Z

This pull request was exported from Phabricator. Differential Revision: D64205005

facebook-github-bot · 2024-10-16T21:22:37Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Differential Revision: D64205005 Pulled By: oerling

facebook-github-bot · 2024-10-16T22:18:05Z

This pull request was exported from Phabricator. Differential Revision: D64205005

facebook-github-bot · 2024-10-21T14:14:49Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-10-21T21:25:15Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Differential Revision: D64205005 Pulled By: oerling

facebook-github-bot · 2024-10-21T21:35:25Z

This pull request was exported from Phabricator. Differential Revision: D64205005

assignUser · 2024-10-21T21:35:32Z

CMakeLists.txt

@@ -381,6 +381,7 @@ if(${VELOX_ENABLE_GPU})
    add_compile_options("$<$<COMPILE_LANGUAGE:CUDA>:-G>")
  endif()
  include_directories("${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}")
+  find_package(CUDAToolkit REQUIRED)


Suggested change

find_package(CUDAToolkit REQUIRED)

include(FindCUDAToolkit)

See https://cmake.org/cmake/help/v3.23/module/FindCUDAToolkit.html#imported-targets

Hm... it's found which should then also create the targets.
After some local investigation (in the container) it seems we need to install the nvrtc library explicitly but even then cmake fails to find the library location for some reason. In a standalone cml.txt without anything but the include the target is not created because the lib is not found:

CUDA_nvrtc_LIBRARY-NOTFOUND

even if the lib is clearly on the system:

/usr/local/cuda-12.4/targets/x86_64-linux/lib/libnvrtc.so.12

and the path is correct in the find command:

find_library(CUDA_nvrtc_LIBRARY NAMES nvrtc HINTS /usr/local/cuda-12.4/targets/x86_64-linux/lib;/usr/local/cuda/targets/x86_64-linux/lib/stubs;/usr/local/cuda/targets/x86_64-linux/lib ENV CUDA_PATH PATH_SUFFIXES lib64/stubs lib/x64/stubs lib/stubs stubs )

On my own machine the same script works fine, used the same current cmake in the container as well but it still fails the same way.

assignUser · 2024-10-21T21:36:04Z

velox/experimental/wave/common/CMakeLists.txt

+  velox_exception
+  velox_common_base
+  velox_type
+  CUDA::nvrtc)


It seems to work without it but it probably makes sense to explicitly link to the runtime aswell via CUDA::cudart

assignUser · 2024-10-22T19:15:31Z

We were missing the nvrtc devel lib in the setup script, here is a diff that should fix it:

diff --git a/scripts/setup-centos9.sh b/scripts/setup-centos9.sh
index d8de1b50c..6d233eb19 100755
--- a/scripts/setup-centos9.sh
+++ b/scripts/setup-centos9.sh
@@ -220,7 +220,8 @@ function install_arrow {
 function install_cuda {
   # See https://developer.nvidia.com/cuda-downloads
   dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
-  dnf install -y cuda-nvcc-$(echo $1 | tr '.' '-') cuda-cudart-devel-$(echo $1 | tr '.' '-')
+  local dashed="$(echo $1 | tr '.' '-')"
+  dnf install -y cuda-nvcc-$dashed cuda-cudart-devel-$dashed cuda-nvrtc-devel-$dashed
 }
 
 function install_velox_deps {

facebook-github-bot · 2024-10-22T19:36:08Z

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Differential Revision: D64205005 Pulled By: oerling

facebook-github-bot · 2024-10-22T19:46:08Z

This pull request was exported from Phabricator. Differential Revision: D64205005

Yuhta · 2024-10-22T20:56:11Z

@assignUser The build was still failing with https://github.com/facebookincubator/velox/actions/runs/11467400697/job/31912022139, so we disabled the GPU build temporarily. Can you take a look at it to see what will make it to re-enable it in GitHub when you have time?

Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Reviewed By: Yuhta Differential Revision: D64205005 Pulled By: oerling

facebook-github-bot · 2024-10-22T20:58:39Z

This pull request was exported from Phabricator. Differential Revision: D64205005

facebook-github-bot · 2024-10-22T23:41:06Z

@oerling merged this pull request in f8397bc.

conbench-facebook · 2024-10-23T00:09:29Z

Conbench analyzed the 1 benchmark run on commit f8397bce.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

assignUser · 2024-10-23T11:47:42Z

@Yuhta actually that error was unrelated and the actual gpu build worked with the patch ^^ but no worries I'll open a PR to turn it back on.

Summary: #11225 requires CUDA's nvrtc library and the cuda driver stubs (on machines without a gpu) to be available. * Install nvrct and stubs in centos and ubuntu scripts * Turn GPU build back on. * Add missing links and remove some superflous ones * Turn all targets that link directly or indirectly against CUDA::cuda_driver into standalone targets as the stubbed symbols will throw a dload error on the gpu less runner. This way we keep them out of the mono library and avoid throwing errors in non-gpu tests. * Exclude tests that use the cuda driver stubs via label Pull Request resolved: #11335 Reviewed By: Yuhta Differential Revision: D65067732 Pulled By: pedroerp fbshipit-source-id: 4e33222659cf196ca0869ec98c5f35f7a27ee7da

oerling requested review from assignUser and majetideepak as code owners October 10, 2024 21:21

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 10, 2024

oerling force-pushed the wavegen-pr branch from 0480a22 to a4e7b81 Compare October 10, 2024 21:32

oerling force-pushed the wavegen-pr branch from a4e7b81 to 779f21d Compare October 10, 2024 23:43

oerling force-pushed the wavegen-pr branch from 779f21d to 01d7aa8 Compare October 11, 2024 19:18

facebook-github-bot added the fb-exported label Oct 11, 2024

oerling force-pushed the wavegen-pr branch from 01d7aa8 to 89fbbd5 Compare October 11, 2024 23:40

oerling force-pushed the wavegen-pr branch from 89fbbd5 to 711a760 Compare October 16, 2024 20:52

oerling force-pushed the wavegen-pr branch from 711a760 to 921a23c Compare October 16, 2024 22:17

oerling force-pushed the wavegen-pr branch from 921a23c to fa17b47 Compare October 21, 2024 14:11

oerling force-pushed the wavegen-pr branch 2 times, most recently from 4fec6d6 to 6b94b1e Compare October 21, 2024 21:23

oerling force-pushed the wavegen-pr branch from 6b94b1e to 16bd50b Compare October 21, 2024 21:35

assignUser reviewed Oct 21, 2024

View reviewed changes

oerling force-pushed the wavegen-pr branch from 16bd50b to 62cf92c Compare October 22, 2024 18:03

oerling force-pushed the wavegen-pr branch from 62cf92c to bcb198d Compare October 22, 2024 19:27

oerling force-pushed the wavegen-pr branch from bcb198d to 55b69cc Compare October 22, 2024 19:45

Yuhta approved these changes Oct 22, 2024

View reviewed changes

oerling force-pushed the wavegen-pr branch from 55b69cc to bf9260f Compare October 22, 2024 20:58

facebook-github-bot closed this in f8397bc Oct 22, 2024

facebook-github-bot added the Merged label Oct 22, 2024

assignUser mentioned this pull request Oct 23, 2024

Add nvrtc to setup scripts. #11335

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add run time compilation #11225

Add run time compilation #11225

oerling commented Oct 10, 2024

netlify bot commented Oct 10, 2024 •

edited

Loading

facebook-github-bot commented Oct 10, 2024

facebook-github-bot commented Oct 10, 2024

facebook-github-bot commented Oct 10, 2024

facebook-github-bot commented Oct 11, 2024

facebook-github-bot commented Oct 11, 2024

facebook-github-bot commented Oct 16, 2024

facebook-github-bot commented Oct 16, 2024

facebook-github-bot commented Oct 21, 2024

facebook-github-bot commented Oct 21, 2024

facebook-github-bot commented Oct 21, 2024

assignUser Oct 21, 2024 •

edited

Loading

assignUser Oct 22, 2024

assignUser Oct 22, 2024

assignUser Oct 21, 2024

assignUser commented Oct 22, 2024

facebook-github-bot commented Oct 22, 2024

facebook-github-bot commented Oct 22, 2024

Yuhta commented Oct 22, 2024

facebook-github-bot commented Oct 22, 2024

facebook-github-bot commented Oct 22, 2024

conbench-facebook bot commented Oct 23, 2024

assignUser commented Oct 23, 2024

Add run time compilation #11225

Add run time compilation #11225

Conversation

oerling commented Oct 10, 2024

netlify bot commented Oct 10, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

facebook-github-bot commented Oct 10, 2024

facebook-github-bot commented Oct 10, 2024

facebook-github-bot commented Oct 10, 2024

facebook-github-bot commented Oct 11, 2024

facebook-github-bot commented Oct 11, 2024

facebook-github-bot commented Oct 16, 2024

facebook-github-bot commented Oct 16, 2024

facebook-github-bot commented Oct 21, 2024

facebook-github-bot commented Oct 21, 2024

facebook-github-bot commented Oct 21, 2024

assignUser Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

assignUser Oct 22, 2024

Choose a reason for hiding this comment

assignUser Oct 22, 2024

Choose a reason for hiding this comment

assignUser Oct 21, 2024

Choose a reason for hiding this comment

assignUser commented Oct 22, 2024

facebook-github-bot commented Oct 22, 2024

facebook-github-bot commented Oct 22, 2024

Yuhta commented Oct 22, 2024

facebook-github-bot commented Oct 22, 2024

facebook-github-bot commented Oct 22, 2024

conbench-facebook bot commented Oct 23, 2024

assignUser commented Oct 23, 2024

netlify bot commented Oct 10, 2024 •

edited

Loading

assignUser Oct 21, 2024 •

edited

Loading