WIP: test v6e TPU support #7

filipedeo · 2024-10-17T05:44:08Z

At some point I was trying to just trying to validate some assumptions on how things are connected, so this draft is far from a completed work.

After modifying this repo, I'd go to generative-recommenders and install the local version by running:

SKYPILOT_SOURCE_FOLDER=~/src/github.com/Shopify/skypilot
SCRIPT_DIR=$( dirname "$(realpath "$BASH_SOURCE")")
WHEEL_DIR=~/.sky/wheels/
START_DIR=`pwd`

echo ${SCRIPT_DIR}
# clean up the old wheels
rm -Rf ${WHEEL_DIR}*

# Build the wheel
cd ${SKYPILOT_SOURCE_FOLDER}
pip wheel -w ${WHEEL_DIR} -e '.[gcp, kubernetes]'

# Update the skypilot
cd ${SCRIPT_DIR}/.pipx/home/venvs/skypilot-nightly/
source bin/activate
pipx install --force --editable "${SKYPILOT_SOURCE_FOLDER}[gcp,kubernetes]"
deactivate

# # get back to current dir
cd $START_DIR

Didn't get as far as I wanted, still need to work through the node resources/labels and how skypilot check them in order to determine if we'll be able to schedule the pod. Example log:

❯ sky launch  ./skypilot/jax_v6e.yaml
Task from YAML spec: ./skypilot/jax_v6e.yaml
I 10-16 20:26:05 resources.py:566] Accelerators: {'tpu-v6e-4': 1}
I 10-16 20:26:05 resources.py:566] Accelerators: {'tpu-v6e-4': 1}
I 10-16 20:26:06 resources.py:566] Accelerators: {'tpu-v6e-4': 1}
I 10-16 20:26:06 resources.py:582] Using instance type "2CPU--8GB--1tpu-v6e-4" for TPU on Kubernetes.
I 10-16 20:26:06 optimizer.py:719] == Optimizer ==
I 10-16 20:26:06 optimizer.py:742] Estimated cost: $0.0 / hour
I 10-16 20:26:06 optimizer.py:742]
I 10-16 20:26:06 optimizer.py:867] Considered resources (1 node):
I 10-16 20:26:06 optimizer.py:937] -------------------------------------------------------------------------------------------------------------------------------------------------------------------
I 10-16 20:26:06 optimizer.py:937]  CLOUD        INSTANCE                vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                                             COST ($)   CHOSEN
I 10-16 20:26:06 optimizer.py:937] -------------------------------------------------------------------------------------------------------------------------------------------------------------------
I 10-16 20:26:06 optimizer.py:937]  Kubernetes   2CPU--8GB--1tpu-v6e-4   2       8         tpu-v6e-4:1    gke_shopify-ml-offline-sandbox_us-east5_ml-offline-sandbox-us-ea5-ft9   0.00          ✔
I 10-16 20:26:06 optimizer.py:937] -------------------------------------------------------------------------------------------------------------------------------------------------------------------
I 10-16 20:26:06 optimizer.py:937]
Launching a new cluster 'sky-1717-fdeo'. Proceed? [Y/n]: y
I 10-16 20:26:20 cloud_vm_ray_backend.py:4462] Creating a new cluster: 'sky-1717-fdeo' [1x Kubernetes(2CPU--8GB--1tpu-v6e-4, {'tpu-v6e-4': 1}, accelerator_args={'runtime_version': 'v2-alpha-tpuv6e'})].
I 10-16 20:26:20 cloud_vm_ray_backend.py:4462] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 10-16 20:26:21 cloud_vm_ray_backend.py:1980] Attempting to provision: Kubernetes(2CPU--8GB--1tpu-v6e-4, {'tpu-v6e-4': 1}, accelerator_args={'runtime_version': 'v2-alpha-tpuv6e'})
I 10-16 20:26:21 cloud_vm_ray_backend.py:1315] To view detailed progress: tail -n100 -f /Users/fdeo/sky_logs/sky-2024-10-16-20-26-05-475351/provision.log
I 10-16 20:26:21 cloud_vm_ray_backend.py:1389] Attempting to provision in region: gke_shopify-ml-offline-sandbox_us-east5_ml-offline-sandbox-us-ea5-ft9, zones: None
I 10-16 20:26:28 resources.py:566] Accelerators: {'tpu-v6e-4': 1}
I 10-16 20:26:28 resources.py:582] Using instance type "2CPU--8GB--1tpu-v6e-4" for TPU on Kubernetes.
I 10-16 20:26:28 provisioner.py:62] Launching on Kubernetes 'sky-1717-fdeo'.
I 10-16 20:26:31 resources.py:566] Accelerators: {'tpu-v6e-4': 1}
I 10-16 20:26:31 resources.py:582] Using instance type "2CPU--8GB--1tpu-v6e-4" for TPU on Kubernetes.
E 10-16 20:26:31 cloud_vm_ray_backend.py:1639] All provisioning attempts failed. Raising ResourcesUnavailableError.
W 10-16 20:26:31 cloud_vm_ray_backend.py:2013] TPU resource detected. ResourcesUnavailableError details: Failed to acquire resources in all zones in gke_shopify-ml-offline-sandbox_us-east5_ml-offline-sandbox-us-ea5-ft9. Try changing resource requirements or use another region.
W 10-16 20:26:31 cloud_vm_ray_backend.py:2014] Continuing with provisioning despite the error.
W 10-16 20:26:31 cloud_vm_ray_backend.py:2821] TPU resource detected. Suppressing ResourcesUnavailableError and continuing with the current configuration.
Clusters
No existing clusters.

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes({'tpu-v6e-4': 1}, accelerator_args={'runtime_version': 'v2-alpha-tpuv6e'})
To keep retrying until the cluster is up, use the `--retry-until-up` flag.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

filipedeo added 2 commits October 3, 2024 13:15

sanity check

cf58934

WIP

02831c0

filipedeo changed the title ~~Fdeo/tpu v6e~~ WIP: test v6e TPU support Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: test v6e TPU support #7

WIP: test v6e TPU support #7

filipedeo commented Oct 17, 2024

WIP: test v6e TPU support #7

Are you sure you want to change the base?

WIP: test v6e TPU support #7

Conversation

filipedeo commented Oct 17, 2024