Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: test v6e TPU support #7

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

WIP: test v6e TPU support #7

wants to merge 2 commits into from

Conversation

filipedeo
Copy link

At some point I was trying to just trying to validate some assumptions on how things are connected, so this draft is far from a completed work.

After modifying this repo, I'd go to generative-recommenders and install the local version by running:

SKYPILOT_SOURCE_FOLDER=~/src/github.com/Shopify/skypilot
SCRIPT_DIR=$( dirname "$(realpath "$BASH_SOURCE")")
WHEEL_DIR=~/.sky/wheels/
START_DIR=`pwd`

echo ${SCRIPT_DIR}
# clean up the old wheels
rm -Rf ${WHEEL_DIR}*

# Build the wheel
cd ${SKYPILOT_SOURCE_FOLDER}
pip wheel -w ${WHEEL_DIR} -e '.[gcp, kubernetes]'

# Update the skypilot
cd ${SCRIPT_DIR}/.pipx/home/venvs/skypilot-nightly/
source bin/activate
pipx install --force --editable "${SKYPILOT_SOURCE_FOLDER}[gcp,kubernetes]"
deactivate

# # get back to current dir
cd $START_DIR

Didn't get as far as I wanted, still need to work through the node resources/labels and how skypilot check them in order to determine if we'll be able to schedule the pod. Example log:

❯ sky launch  ./skypilot/jax_v6e.yaml
Task from YAML spec: ./skypilot/jax_v6e.yaml
I 10-16 20:26:05 resources.py:566] Accelerators: {'tpu-v6e-4': 1}
I 10-16 20:26:05 resources.py:566] Accelerators: {'tpu-v6e-4': 1}
I 10-16 20:26:06 resources.py:566] Accelerators: {'tpu-v6e-4': 1}
I 10-16 20:26:06 resources.py:582] Using instance type "2CPU--8GB--1tpu-v6e-4" for TPU on Kubernetes.
I 10-16 20:26:06 optimizer.py:719] == Optimizer ==
I 10-16 20:26:06 optimizer.py:742] Estimated cost: $0.0 / hour
I 10-16 20:26:06 optimizer.py:742]
I 10-16 20:26:06 optimizer.py:867] Considered resources (1 node):
I 10-16 20:26:06 optimizer.py:937] -------------------------------------------------------------------------------------------------------------------------------------------------------------------
I 10-16 20:26:06 optimizer.py:937]  CLOUD        INSTANCE                vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                                             COST ($)   CHOSEN
I 10-16 20:26:06 optimizer.py:937] -------------------------------------------------------------------------------------------------------------------------------------------------------------------
I 10-16 20:26:06 optimizer.py:937]  Kubernetes   2CPU--8GB--1tpu-v6e-4   2       8         tpu-v6e-4:1    gke_shopify-ml-offline-sandbox_us-east5_ml-offline-sandbox-us-ea5-ft9   0.00          ✔
I 10-16 20:26:06 optimizer.py:937] -------------------------------------------------------------------------------------------------------------------------------------------------------------------
I 10-16 20:26:06 optimizer.py:937]
Launching a new cluster 'sky-1717-fdeo'. Proceed? [Y/n]: y
I 10-16 20:26:20 cloud_vm_ray_backend.py:4462] Creating a new cluster: 'sky-1717-fdeo' [1x Kubernetes(2CPU--8GB--1tpu-v6e-4, {'tpu-v6e-4': 1}, accelerator_args={'runtime_version': 'v2-alpha-tpuv6e'})].
I 10-16 20:26:20 cloud_vm_ray_backend.py:4462] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 10-16 20:26:21 cloud_vm_ray_backend.py:1980] Attempting to provision: Kubernetes(2CPU--8GB--1tpu-v6e-4, {'tpu-v6e-4': 1}, accelerator_args={'runtime_version': 'v2-alpha-tpuv6e'})
I 10-16 20:26:21 cloud_vm_ray_backend.py:1315] To view detailed progress: tail -n100 -f /Users/fdeo/sky_logs/sky-2024-10-16-20-26-05-475351/provision.log
I 10-16 20:26:21 cloud_vm_ray_backend.py:1389] Attempting to provision in region: gke_shopify-ml-offline-sandbox_us-east5_ml-offline-sandbox-us-ea5-ft9, zones: None
I 10-16 20:26:28 resources.py:566] Accelerators: {'tpu-v6e-4': 1}
I 10-16 20:26:28 resources.py:582] Using instance type "2CPU--8GB--1tpu-v6e-4" for TPU on Kubernetes.
I 10-16 20:26:28 provisioner.py:62] Launching on Kubernetes 'sky-1717-fdeo'.
I 10-16 20:26:31 resources.py:566] Accelerators: {'tpu-v6e-4': 1}
I 10-16 20:26:31 resources.py:582] Using instance type "2CPU--8GB--1tpu-v6e-4" for TPU on Kubernetes.
E 10-16 20:26:31 cloud_vm_ray_backend.py:1639] All provisioning attempts failed. Raising ResourcesUnavailableError.
W 10-16 20:26:31 cloud_vm_ray_backend.py:2013] TPU resource detected. ResourcesUnavailableError details: Failed to acquire resources in all zones in gke_shopify-ml-offline-sandbox_us-east5_ml-offline-sandbox-us-ea5-ft9. Try changing resource requirements or use another region.
W 10-16 20:26:31 cloud_vm_ray_backend.py:2014] Continuing with provisioning despite the error.
W 10-16 20:26:31 cloud_vm_ray_backend.py:2821] TPU resource detected. Suppressing ResourcesUnavailableError and continuing with the current configuration.
Clusters
No existing clusters.

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes({'tpu-v6e-4': 1}, accelerator_args={'runtime_version': 'v2-alpha-tpuv6e'})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@filipedeo filipedeo changed the title Fdeo/tpu v6e WIP: test v6e TPU support Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant