Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure] SkyPilot tries to create/update resource group instead of checking for existence #4520

Open
romilbhardwaj opened this issue Jan 2, 2025 · 0 comments

Comments

@romilbhardwaj
Copy link
Collaborator

We should change our logic to create/update resource group only when it does not exist:

# TODO (skypilot): this takes a long time (> 40 seconds) to run.
outputs = create_or_update(
resource_group_name=resource_group,
deployment_name=deployment_name,
parameters=parameters,
).result().properties.outputs

Else it fails for users who may not have full permissions.

User reported:

so in azure i can spin up a new vm for example using the following

az vm create \
  --resource-group my-rg \
  --name myGPUvm \
  --image Ubuntu2204 \
  --admin-username azureuser \
  --generate-ssh-keys \
  --size Standard_NC24ads_A100_v4 \
  --location westus

but when I try to do the same thing with skypilot it tells me my user doesn't have role authorizations
I 01-02 03:21:31 common.py:292] --------------------Start: bootstrap_instances --------------------
I 01-02 03:21:31 config.py:73] Using subscription id: xxxxx
I 01-02 03:21:31 config.py:140] Using cluster name: mycluster-9672
I 01-02 03:21:31 config.py:150] Using subnet mask: xxxxx
I 01-02 03:21:31 config.py:204] Creating/Updating deployment: skypilot-bootstrap-mycluster-9672
I 01-02 03:21:33 common.py:296] --------------------End:   bootstrap_instances --------------------
I 01-02 03:21:33 common.py:296] 
D 01-02 03:21:33 provisioner.py:150] Failed to provision 'mycluster' on Azure (all zones).
D 01-02 03:21:33 provisioner.py:152] bulk_provision for 'mycluster' failed. Stacktrace:
D 01-02 03:21:33 provisioner.py:152] Traceback (most recent call last):
D 01-02 03:21:33 provisioner.py:152]   File "/home/azureuser/.conda/envs/skypilot/lib/python3.11/site-packages/sky/provision/provisioner.py", line 141, in bulk_provision
D 01-02 03:21:33 provisioner.py:152]     return _bulk_provision(cloud, region, cluster_name,
D 01-02 03:21:33 provisioner.py:152]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
D 01-02 03:21:33 provisioner.py:152]   File "/home/azureuser/.conda/envs/skypilot/lib/python3.11/site-packages/sky/provision/provisioner.py", line 59, in _bulk_provision
D 01-02 03:21:33 provisioner.py:152]     config = provision.bootstrap_instances(provider_name, region_name,
D 01-02 03:21:33 provisioner.py:152]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
D 01-02 03:21:33 provisioner.py:152]   File "/home/azureuser/.conda/envs/skypilot/lib/python3.11/site-packages/sky/provision/__init__.py", line 50, in _wrapper
D 01-02 03:21:33 provisioner.py:152]     return impl(*args, **kwargs)
D 01-02 03:21:33 provisioner.py:152]            ^^^^^^^^^^^^^^^^^^^^^
D 01-02 03:21:33 provisioner.py:152]   File "/home/azureuser/.conda/envs/skypilot/lib/python3.11/site-packages/sky/provision/common.py", line 294, in wrapper
D 01-02 03:21:33 provisioner.py:152]     return func(*args, **kwargs)
D 01-02 03:21:33 provisioner.py:152]            ^^^^^^^^^^^^^^^^^^^^^
D 01-02 03:21:33 provisioner.py:152]   File "/home/azureuser/.conda/envs/skypilot/lib/python3.11/site-packages/sky/provision/azure/config.py", line 209, in bootstrap_instances
D 01-02 03:21:33 provisioner.py:152]     outputs = create_or_update(
D 01-02 03:21:33 provisioner.py:152]               ^^^^^^^^^^^^^^^^^
D 01-02 03:21:33 provisioner.py:152]   File "/home/azureuser/.conda/envs/skypilot/lib/python3.11/site-packages/azure/core/tracing/decorator.py", line 105, in wrapper_use_tracer
D 01-02 03:21:33 provisioner.py:152]     return func(*args, **kwargs)
D 01-02 03:21:33 provisioner.py:152]            ^^^^^^^^^^^^^^^^^^^^^
D 01-02 03:21:33 provisioner.py:152]   File "/home/azureuser/.conda/envs/skypilot/lib/python3.11/site-packages/azure/mgmt/resource/resources/v2022_09_01/operations/_operations.py", line 7071, in begin_create_or_update
D 01-02 03:21:33 provisioner.py:152]     raw_result = self._create_or_update_initial(
D 01-02 03:21:33 provisioner.py:152]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
D 01-02 03:21:33 provisioner.py:152]   File "/home/azureuser/.conda/envs/skypilot/lib/python3.11/site-packages/azure/mgmt/resource/resources/v2022_09_01/operations/_operations.py", line 6960, in _create_or_update_initial
D 01-02 03:21:33 provisioner.py:152]     raise HttpResponseError(response=response, error_format=ARMErrorFormat)
D 01-02 03:21:33 provisioner.py:152] azure.core.exceptions.HttpResponseError: (InvalidTemplateDeployment) The template deployment failed with error: 'Authorization failed for template resource 'ff6d12bc-bf6c-5039-9cdf-6f2a67498e20' of type 'Microsoft.Authorization/roleAssignments'. The client 'xxx' with object id 'xxx' does not have permission to perform action 'Microsoft.Authorization/roleAssignments/write' at scope '/subscriptions/xxxx'.'.
D 01-02 03:21:33 provisioner.py:152] Code: InvalidTemplateDeployment
D 01-02 03:21:33 provisioner.py:152] 
D 01-02 03:21:33 provisioner.py:157] Terminating the failed cluster.

my hellosky.yaml is just

resources:
  # Optional; if left out, automatically pick the cheapest cloud.
  cloud: azure
  # 8x NVIDIA A100 GPU
  instance_type: Standard_NC24ads_A100_v4
  region: westus

and in the ~/.sky/config.yaml

azure:
  resource_group_vm: my-rg

User only has contributor level permissions. Happens without the resource group specified in config.yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant