Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add run time compilation #11225

Closed
wants to merge 1 commit into from
Closed

Conversation

oerling
Copy link
Contributor

@oerling oerling commented Oct 10, 2024

  • Adds a CompiledModule abstraction on top of Cuda run time compilation.

  • Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready.

  • tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel.

  • Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation.

  • Add device properties to the Device* struct.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 10, 2024
Copy link

netlify bot commented Oct 10, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit bf9260f
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/671811f26096ce000872d3d5

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

oerling pushed a commit to oerling/velox-1 that referenced this pull request Oct 11, 2024
Summary:
- Adds a CompiledModule abstraction on top of Cuda run time compilation.

- Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready.

- tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel.

- Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation.

- Add device properties to the Device* struct.


Differential Revision: D64205005

Pulled By: oerling
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64205005

oerling pushed a commit to oerling/velox-1 that referenced this pull request Oct 11, 2024
Summary:
- Adds a CompiledModule abstraction on top of Cuda run time compilation.

- Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready.

- tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel.

- Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation.

- Add device properties to the Device* struct.


Differential Revision: D64205005

Pulled By: oerling
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64205005

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

oerling pushed a commit to oerling/velox-1 that referenced this pull request Oct 16, 2024
Summary:
- Adds a CompiledModule abstraction on top of Cuda run time compilation.

- Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready.

- tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel.

- Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation.

- Add device properties to the Device* struct.


Differential Revision: D64205005

Pulled By: oerling
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64205005

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@oerling oerling force-pushed the wavegen-pr branch 2 times, most recently from 4fec6d6 to 6b94b1e Compare October 21, 2024 21:23
@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

oerling pushed a commit to oerling/velox-1 that referenced this pull request Oct 21, 2024
Summary:
- Adds a CompiledModule abstraction on top of Cuda run time compilation.

- Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready.

- tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel.

- Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation.

- Add device properties to the Device* struct.


Differential Revision: D64205005

Pulled By: oerling
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64205005

@@ -381,6 +381,7 @@ if(${VELOX_ENABLE_GPU})
add_compile_options("$<$<COMPILE_LANGUAGE:CUDA>:-G>")
endif()
include_directories("${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}")
find_package(CUDAToolkit REQUIRED)
Copy link
Collaborator

@assignUser assignUser Oct 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
find_package(CUDAToolkit REQUIRED)
include(FindCUDAToolkit)

See https://cmake.org/cmake/help/v3.23/module/FindCUDAToolkit.html#imported-targets

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... it's found which should then also create the targets.
After some local investigation (in the container) it seems we need to install the nvrtc library explicitly but even then cmake fails to find the library location for some reason. In a standalone cml.txt without anything but the include the target is not created because the lib is not found:

CUDA_nvrtc_LIBRARY-NOTFOUND

even if the lib is clearly on the system:

/usr/local/cuda-12.4/targets/x86_64-linux/lib/libnvrtc.so.12

and the path is correct in the find command:

find_library(CUDA_nvrtc_LIBRARY NAMES nvrtc HINTS /usr/local/cuda-12.4/targets/x86_64-linux/lib;/usr/local/cuda/targets/x86_64-linux/lib/stubs;/usr/local/cuda/targets/x86_64-linux/lib ENV CUDA_PATH PATH_SUFFIXES lib64/stubs lib/x64/stubs lib/stubs stubs )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my own machine the same script works fine, used the same current cmake in the container as well but it still fails the same way.

velox_exception
velox_common_base
velox_type
CUDA::nvrtc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work without it but it probably makes sense to explicitly link to the runtime aswell via CUDA::cudart

@assignUser
Copy link
Collaborator

We were missing the nvrtc devel lib in the setup script, here is a diff that should fix it:

diff --git a/scripts/setup-centos9.sh b/scripts/setup-centos9.sh
index d8de1b50c..6d233eb19 100755
--- a/scripts/setup-centos9.sh
+++ b/scripts/setup-centos9.sh
@@ -220,7 +220,8 @@ function install_arrow {
 function install_cuda {
   # See https://developer.nvidia.com/cuda-downloads
   dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
-  dnf install -y cuda-nvcc-$(echo $1 | tr '.' '-') cuda-cudart-devel-$(echo $1 | tr '.' '-')
+  local dashed="$(echo $1 | tr '.' '-')"
+  dnf install -y cuda-nvcc-$dashed cuda-cudart-devel-$dashed cuda-nvrtc-devel-$dashed
 }
 
 function install_velox_deps {

@facebook-github-bot
Copy link
Contributor

@oerling has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

oerling pushed a commit to oerling/velox-1 that referenced this pull request Oct 22, 2024
Summary:
- Adds a CompiledModule abstraction on top of Cuda run time compilation.

- Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready.

- tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel.

- Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation.

- Add device properties to the Device* struct.


Differential Revision: D64205005

Pulled By: oerling
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64205005

@Yuhta
Copy link
Contributor

Yuhta commented Oct 22, 2024

@assignUser The build was still failing with https://github.com/facebookincubator/velox/actions/runs/11467400697/job/31912022139, so we disabled the GPU build temporarily. Can you take a look at it to see what will make it to re-enable it in GitHub when you have time?

Summary:
- Adds a CompiledModule abstraction on top of Cuda run time compilation.

- Adds a cache of run time compiled kernels. The cache returns a kernel immediately and leaves the kernel compiling in the background. The kernel's methods wait for the compilation to be ready.

- tests that runtime API and driver API streams are interchangeable when running a dynamically generated kernel.

- Add proper use of contexts, one per device. The contexts are needed because of using the driver API to handle run time compilation.

- Add device properties to the Device* struct.


Reviewed By: Yuhta

Differential Revision: D64205005

Pulled By: oerling
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64205005

@facebook-github-bot
Copy link
Contributor

@oerling merged this pull request in f8397bc.

Copy link

Conbench analyzed the 1 benchmark run on commit f8397bce.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@assignUser
Copy link
Collaborator

@Yuhta actually that error was unrelated and the actual gpu build worked with the patch ^^ but no worries I'll open a PR to turn it back on.

facebook-github-bot pushed a commit that referenced this pull request Oct 28, 2024
Summary:
#11225 requires CUDA's nvrtc library  and the cuda driver stubs (on machines without a gpu) to be available.

* Install nvrct and stubs in centos and ubuntu scripts
* Turn GPU build back on.
* Add missing links and remove some superflous ones
* Turn all targets that link directly or indirectly against CUDA::cuda_driver into standalone targets as the stubbed symbols will throw a dload error on the gpu less runner. This way we keep them out of the mono library and avoid throwing errors in non-gpu tests.
* Exclude tests that use the cuda driver stubs via label

Pull Request resolved: #11335

Reviewed By: Yuhta

Differential Revision: D65067732

Pulled By: pedroerp

fbshipit-source-id: 4e33222659cf196ca0869ec98c5f35f7a27ee7da
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants