Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][experimental] Add multi-GPU CI tests for accelerated DAG #45259

Merged
merged 67 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
2d22636
TorchTensor wrappers
stephanie-wang Apr 17, 2024
3a75b5a
test
stephanie-wang Apr 17, 2024
0a8a8da
copy
stephanie-wang Apr 17, 2024
32eaec9
update
stephanie-wang Apr 17, 2024
0f8d092
torch device
stephanie-wang Apr 18, 2024
e067f5f
errors
stephanie-wang Apr 18, 2024
e0774b8
test
stephanie-wang Apr 18, 2024
d065935
GPU
Apr 18, 2024
f0813c4
temp benchmark
Apr 18, 2024
5fb4166
with_type_hint
stephanie-wang Apr 24, 2024
84fe2c0
skip GPU tests
stephanie-wang Apr 24, 2024
6375fd0
Merge remote-tracking branch 'upstream/master' into dag-gpu-channels
stephanie-wang Apr 24, 2024
5bab410
clean
stephanie-wang Apr 25, 2024
611c6aa
init nccl group
Apr 25, 2024
3f16871
NCCL channel
Apr 26, 2024
b917eb8
NCCL group works
Apr 29, 2024
fea789b
micro
Apr 30, 2024
388621d
update
stephanie-wang Apr 30, 2024
c35b3bb
Merge remote-tracking branch 'upstream/master' into dag-gpu-channels
stephanie-wang Apr 30, 2024
3575431
torch
stephanie-wang Apr 30, 2024
e667dae
TODO
Apr 30, 2024
fb93b06
TODO
Apr 30, 2024
4fb6897
Merge branch 'master' into dag-gpu-channels
stephanie-wang May 1, 2024
7334e8a
typing
stephanie-wang May 1, 2024
fd59817
Merge branch 'dag-gpu-channels' of github.com:stephanie-wang/ray into…
stephanie-wang May 1, 2024
2788aa3
fix deadlock on shutdown
May 1, 2024
cd8ae72
Merge remote-tracking branch 'origin/dag-gpu-channels' into dag-nccl
May 1, 2024
601e600
files
May 2, 2024
649cbea
missing files
May 2, 2024
3d39945
move
stephanie-wang May 2, 2024
0b10d32
lint
May 2, 2024
f074c29
x
May 2, 2024
c5ae955
Merge remote-tracking branch 'upstream/master' into dag-nccl
stephanie-wang May 3, 2024
fdd277f
lint
stephanie-wang May 3, 2024
e21d36b
comment
stephanie-wang May 3, 2024
31b02ef
refactor
May 3, 2024
1b01954
doc
May 3, 2024
ef19f4e
call get_unique_id on nccl actor
May 8, 2024
69342ca
avoid nccl util import
May 8, 2024
f218a85
Update python/ray/dag/compiled_dag_node.py
stephanie-wang May 8, 2024
8489f9b
update
May 8, 2024
3a9049d
Merge remote-tracking branch 'upstream/master' into dag-nccl
May 8, 2024
450a228
Merge branch 'dag-nccl' of github.com:stephanie-wang/ray into dag-nccl
May 8, 2024
1b7ac31
update
stephanie-wang May 9, 2024
4a37388
Multiple GPU test
stephanie-wang May 9, 2024
acff481
multi gpu
stephanie-wang May 9, 2024
1ca7107
CI
stephanie-wang May 9, 2024
75b4daf
lint
stephanie-wang May 9, 2024
b5afe24
microbenchmark
stephanie-wang May 10, 2024
c208108
x
stephanie-wang May 10, 2024
97064af
core GPU build
stephanie-wang May 10, 2024
ed5e907
remove multi-GPU CI tests
stephanie-wang May 10, 2024
ba38c92
test
May 10, 2024
4e9df06
Merge branch 'dag-nccl' of github.com:stephanie-wang/ray into dag-nccl
stephanie-wang May 10, 2024
2f7c78e
doc
stephanie-wang May 10, 2024
c27d4db
Revert "remove multi-GPU CI tests"
stephanie-wang May 11, 2024
b1ce16d
Merge remote-tracking branch 'upstream/master' into dag-nccl
stephanie-wang May 11, 2024
c2017be
CI
stephanie-wang May 11, 2024
c8599cb
CUDA_VISIBLE_DEVICES?
stephanie-wang May 13, 2024
96082dc
debug
stephanie-wang May 13, 2024
8ddee51
fix
stephanie-wang May 13, 2024
e4979e4
debug
stephanie-wang May 13, 2024
d1806c1
fix
stephanie-wang May 14, 2024
997fd8e
fix build?
stephanie-wang May 14, 2024
5dab532
build
stephanie-wang May 14, 2024
7ec0dc6
fix?
stephanie-wang May 14, 2024
91e8a4b
remove debug
stephanie-wang May 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .buildkite/core.rayci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -327,3 +327,17 @@ steps:
- forge
- raycpubase
- corebuild

- label: ":ray: core: multi gpu tests"
tags:
- accelerated_dag
- gpu
instance_type: gpu-large
commands:
# This machine has 4 GPUs, and we need 2 GPUs, so allow 2 tests to run in
# parallel.
- bazel run //ci/ray_ci:test_in_docker -- //python/ray/tests/... //python/ray/dag/... core
--parallelism-per-worker 2 --gpus 2
--build-name coregpubuild
--only-tags multi_gpu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add || true, since the test fails it won't reach that sleep statement

depends_on: coregpubuild
6 changes: 1 addition & 5 deletions ci/docker/core.build.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,7 @@ RUN <<EOF

set -euo pipefail

pip install -U --ignore-installed \
-c python/requirements_compiled.txt \
-r python/requirements.txt \
-r python/requirements/test-requirements.txt \
-r python/requirements/ml/dl-cpu-requirements.txt
DL=1 ./ci/env/install-dependencies.sh

if [[ "$RAYCI_IS_GPU_BUILD" == "true" ]]; then
pip install -Ur ./python/requirements/ml/dl-gpu-requirements.txt
Expand Down
3 changes: 3 additions & 0 deletions ci/docker/core.build.wanda.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,14 @@ name: "corebuild-py$PYTHON"
froms: ["cr.ray.io/rayproject/oss-ci-base_build-py$PYTHON"]
dockerfile: ci/docker/core.build.Dockerfile
srcs:
- ci/env/install-dependencies.sh
- python/requirements.txt
- python/requirements_compiled.txt
- python/requirements/test-requirements.txt
- python/requirements/ml/dl-cpu-requirements.txt
- python/requirements/ml/dl-gpu-requirements.txt
build_args:
- DOCKER_IMAGE_BASE_BUILD=cr.ray.io/rayproject/oss-ci-base_build-py$PYTHON
- RAYCI_IS_GPU_BUILD
tags:
- cr.ray.io/rayproject/corebuild-py$PYTHON
13 changes: 13 additions & 0 deletions ci/pipeline/determine_tests_to_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ def get_commit_range():
RAY_CI_WORKFLOW_AFFECTED = 0
RAY_CI_RELEASE_TESTS_AFFECTED = 0
RAY_CI_COMPILED_PYTHON_AFFECTED = 0
RAY_CI_ACCELERATED_DAG_AFFECTED = 0

if is_pull_request():
commit_range = get_commit_range()
Expand Down Expand Up @@ -258,6 +259,14 @@ def get_commit_range():
if changed_file.endswith(compiled_extension):
RAY_CI_COMPILED_PYTHON_AFFECTED = 1
break

# Some accelerated DAG tests require GPUs so we only run them
# if Ray DAGs or experimental.channels were affected.
if changed_file.startswith("python/ray/dag") or changed_file.startswith(
"python/ray/experimental/channel"
):
RAY_CI_ACCELERATED_DAG_AFFECTED = 1

elif changed_file == ".buildkite/core.rayci.yml":
RAY_CI_PYTHON_AFFECTED = 1
RAY_CI_CORE_CPP_AFFECTED = 1
Expand Down Expand Up @@ -377,6 +386,7 @@ def get_commit_range():
RAY_CI_MACOS_WHEELS_AFFECTED = 1
RAY_CI_DASHBOARD_AFFECTED = 1
RAY_CI_RELEASE_TESTS_AFFECTED = 1
RAY_CI_ACCELERATED_DAG_AFFECTED = 1
else:
print(
"Unhandled source code change: {changed_file}".format(
Expand Down Expand Up @@ -458,6 +468,9 @@ def get_commit_range():
"RAY_CI_COMPILED_PYTHON_AFFECTED={}".format(
RAY_CI_COMPILED_PYTHON_AFFECTED
),
"RAY_CI_ACCELERATED_DAG_AFFECTED={}".format(
RAY_CI_ACCELERATED_DAG_AFFECTED
),
]
)

Expand Down
2 changes: 1 addition & 1 deletion python/ray/dag/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ load("//bazel:python.bzl", "doctest")
load("//bazel:python.bzl", "py_test_module_list")

doctest(
files = glob(["**/*.py"]),
files = glob(["**/*.py"], exclude=["**/experimental/**/*.py"]),
tags = ["team:core"],
deps = [":dag_lib"]
)
Expand Down
11 changes: 10 additions & 1 deletion python/ray/experimental/channel/torch_tensor_type.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,18 @@ def __init__(
def register_custom_serializer(outer: Any) -> None:
# Helper method to run on the DAG driver and actors to register custom
# serializers.
import torch

from ray.air._internal import torch_utils

default_device = torch_utils.get_devices()[0]
if ray.get_gpu_ids():
default_device = torch_utils.get_devices()[0]
else:
# torch_utils defaults to returning GPU 0 if no
# GPU IDs were assigned by Ray. We instead want
# the default to be CPU.
default_device = torch.device("cpu")

torch_tensor_serializer = _TorchTensorSerializer(default_device)

CUSTOM_SERIALIZERS = (
Expand Down
Loading