-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add run time compilation #11225
Add run time compilation #11225
Conversation
✅ Deploy Preview for meta-velox canceled.
|
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Differential Revision: D64205005 Pulled By: oerling
This pull request was exported from Phabricator. Differential Revision: D64205005 |
Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Differential Revision: D64205005 Pulled By: oerling
This pull request was exported from Phabricator. Differential Revision: D64205005 |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Differential Revision: D64205005 Pulled By: oerling
This pull request was exported from Phabricator. Differential Revision: D64205005 |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
4fec6d6
to
6b94b1e
Compare
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Differential Revision: D64205005 Pulled By: oerling
This pull request was exported from Phabricator. Differential Revision: D64205005 |
@@ -381,6 +381,7 @@ if(${VELOX_ENABLE_GPU}) | |||
add_compile_options("$<$<COMPILE_LANGUAGE:CUDA>:-G>") | |||
endif() | |||
include_directories("${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}") | |||
find_package(CUDAToolkit REQUIRED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
find_package(CUDAToolkit REQUIRED) | |
include(FindCUDAToolkit) |
See https://cmake.org/cmake/help/v3.23/module/FindCUDAToolkit.html#imported-targets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm... it's found which should then also create the targets.
After some local investigation (in the container) it seems we need to install the nvrtc library explicitly but even then cmake fails to find the library location for some reason. In a standalone cml.txt without anything but the include the target is not created because the lib is not found:
CUDA_nvrtc_LIBRARY-NOTFOUND
even if the lib is clearly on the system:
/usr/local/cuda-12.4/targets/x86_64-linux/lib/libnvrtc.so.12
and the path is correct in the find command:
find_library(CUDA_nvrtc_LIBRARY NAMES nvrtc HINTS /usr/local/cuda-12.4/targets/x86_64-linux/lib;/usr/local/cuda/targets/x86_64-linux/lib/stubs;/usr/local/cuda/targets/x86_64-linux/lib ENV CUDA_PATH PATH_SUFFIXES lib64/stubs lib/x64/stubs lib/stubs stubs )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my own machine the same script works fine, used the same current cmake in the container as well but it still fails the same way.
velox_exception | ||
velox_common_base | ||
velox_type | ||
CUDA::nvrtc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to work without it but it probably makes sense to explicitly link to the runtime aswell via CUDA::cudart
We were missing the nvrtc devel lib in the setup script, here is a diff that should fix it: diff --git a/scripts/setup-centos9.sh b/scripts/setup-centos9.sh
index d8de1b50c..6d233eb19 100755
--- a/scripts/setup-centos9.sh
+++ b/scripts/setup-centos9.sh
@@ -220,7 +220,8 @@ function install_arrow {
function install_cuda {
# See https://developer.nvidia.com/cuda-downloads
dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
- dnf install -y cuda-nvcc-$(echo $1 | tr '.' '-') cuda-cudart-devel-$(echo $1 | tr '.' '-')
+ local dashed="$(echo $1 | tr '.' '-')"
+ dnf install -y cuda-nvcc-$dashed cuda-cudart-devel-$dashed cuda-nvrtc-devel-$dashed
}
function install_velox_deps { |
@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Differential Revision: D64205005 Pulled By: oerling
This pull request was exported from Phabricator. Differential Revision: D64205005 |
@assignUser The build was still failing with https://github.com/facebookincubator/velox/actions/runs/11467400697/job/31912022139, so we disabled the GPU build temporarily. Can you take a look at it to see what will make it to re-enable it in GitHub when you have time? |
Summary: - Adds a CompiledModule abstraction on top of Cuda run time compilation. - Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready. - tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel. - Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation. - Add device properties to the Device* struct. Reviewed By: Yuhta Differential Revision: D64205005 Pulled By: oerling
This pull request was exported from Phabricator. Differential Revision: D64205005 |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
@Yuhta actually that error was unrelated and the actual gpu build worked with the patch ^^ but no worries I'll open a PR to turn it back on. |
Summary: #11225 requires CUDA's nvrtc library and the cuda driver stubs (on machines without a gpu) to be available. * Install nvrct and stubs in centos and ubuntu scripts * Turn GPU build back on. * Add missing links and remove some superflous ones * Turn all targets that link directly or indirectly against CUDA::cuda_driver into standalone targets as the stubbed symbols will throw a dload error on the gpu less runner. This way we keep them out of the mono library and avoid throwing errors in non-gpu tests. * Exclude tests that use the cuda driver stubs via label Pull Request resolved: #11335 Reviewed By: Yuhta Differential Revision: D65067732 Pulled By: pedroerp fbshipit-source-id: 4e33222659cf196ca0869ec98c5f35f7a27ee7da
Adds a CompiledModule abstraction on top of Cuda run time compilation.
Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready.
tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel.
Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation.
Add device properties to the Device* struct.