Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tests] fix smoke test race condition on first run #4494

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

aylei
Copy link
Contributor

@aylei aylei commented Dec 20, 2024

Fix the following error when I ran pytest tests/test_smoke.py on my laptop, which cause test failure:

D 12-20 16:53:56 provisioner.py:153] Traceback (most recent call last):
D 12-20 16:53:56 provisioner.py:153]   File "/Users/aylei/repo/skypilot-org/skypilot/sky/provision/provisioner.py", line 142, in bulk_provision
D 12-20 16:53:56 provisioner.py:153]     return _bulk_provision(cloud, region, cluster_name,
D 12-20 16:53:56 provisioner.py:153]   File "/Users/aylei/repo/skypilot-org/skypilot/sky/provision/provisioner.py", line 60, in _bulk_provision
D 12-20 16:53:56 provisioner.py:153]     config = provision.bootstrap_instances(provider_name, region_name,
D 12-20 16:53:56 provisioner.py:153]   File "/Users/aylei/repo/skypilot-org/skypilot/sky/provision/__init__.py", line 52, in _wrapper
D 12-20 16:53:56 provisioner.py:153]     return impl(*args, **kwargs)
D 12-20 16:53:56 provisioner.py:153]   File "/Users/aylei/repo/skypilot-org/skypilot/sky/provision/aws/config.py", line 71, in bootstrap_instances
D 12-20 16:53:56 provisioner.py:153]     node_cfg['IamInstanceProfile'] = _configure_iam_role(iam)
D 12-20 16:53:56 provisioner.py:153]   File "/Users/aylei/repo/skypilot-org/skypilot/sky/provision/aws/config.py", line 161, in _configure_iam_role
D 12-20 16:53:56 provisioner.py:153]     iam.meta.client.create_instance_profile(
D 12-20 16:53:56 provisioner.py:153]   File "/opt/homebrew/anaconda3/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 569, in _api_ca
ll
D 12-20 16:53:56 provisioner.py:153]     return self._make_api_call(operation_name, kwargs)
D 12-20 16:53:56 provisioner.py:153]   File "/opt/homebrew/anaconda3/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 1023, in _make_
api_call
D 12-20 16:53:56 provisioner.py:153]     raise error_class(parsed_response, operation_name)
D 12-20 16:53:56 provisioner.py:153] botocore.errorfactory.EntityAlreadyExistsException: An error occurred (EntityAlreadyExists) when calling the Cr
eateInstanceProfile operation: Instance Profile skypilot-v1 already exists.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_aws_stale_job_manual_restart
$ profile_name=skypilot-v1
$ aws iam remove-role-from-instance-profile --instance-profile-name "$profile_name" --role-name "$profile_name"
$ aws iam delete-instance-profile --instance-profile-name "$profile_name"

$ pytest tests/test_smoke.py::test_aws_stale_job_manual_restart tests/test_smoke.py::test_minimal tests/test_smoke.py::test_launch_fast --aws
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@aylei aylei marked this pull request as ready for review December 20, 2024 10:12
@romilbhardwaj romilbhardwaj requested a review from zpoint December 23, 2024 21:46
Copy link
Collaborator

@zpoint zpoint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei , The code looks good

I’m trying to test it with the following commands:

profile_name=skypilot-v1
aws iam remove-role-from-instance-profile --instance-profile-name "$profile_name" --role-name "$profile_name"
aws iam delete-instance-profile --instance-profile-name "$profile_name"
# Double check its deleted
aws iam get-instance-profile --instance-profile-name "$profile_name"

# Additional tests can be added for easier reproduction
pytest tests/test_smoke.py::test_aws_stale_job_manual_restart tests/test_smoke.py::test_minimal --aws

This seems flaky; it might be because profile.add_role also needs to address the concurrency issue. Could u also help fix this? @aylei

D 12-27 11:51:18 provisioner.py:151] Failed to provision 't-aws-stale-job-manua-1d-86' on AWS (us-east-2a,us-east-2b,us-east-2c).
D 12-27 11:51:18 provisioner.py:153] bulk_provision for 't-aws-stale-job-manua-1d-86' failed. Stacktrace:
D 12-27 11:51:18 provisioner.py:153] Traceback (most recent call last):
D 12-27 11:51:18 provisioner.py:153]   File "/Users/zepingguo/Desktop/skypilot/sky/provision/provisioner.py", line 142, in bulk_provision
D 12-27 11:51:18 provisioner.py:153]     return _bulk_provision(cloud, region, cluster_name,
D 12-27 11:51:18 provisioner.py:153]   File "/Users/zepingguo/Desktop/skypilot/sky/provision/provisioner.py", line 60, in _bulk_provision
D 12-27 11:51:18 provisioner.py:153]     config = provision.bootstrap_instances(provider_name, region_name,
D 12-27 11:51:18 provisioner.py:153]   File "/Users/zepingguo/Desktop/skypilot/sky/provision/__init__.py", line 52, in _wrapper
D 12-27 11:51:18 provisioner.py:153]     return impl(*args, **kwargs)
D 12-27 11:51:18 provisioner.py:153]   File "/Users/zepingguo/Desktop/skypilot/sky/provision/aws/config.py", line 71, in bootstrap_instances
D 12-27 11:51:18 provisioner.py:153]     node_cfg['IamInstanceProfile'] = _configure_iam_role(iam)
D 12-27 11:51:18 provisioner.py:153]   File "/Users/zepingguo/Desktop/skypilot/sky/provision/aws/config.py", line 254, in _configure_iam_role
D 12-27 11:51:18 provisioner.py:153]     profile.add_role(RoleName=role.name)
D 12-27 11:51:18 provisioner.py:153]   File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/site-packages/boto3/resources/factory.py", line 581, in do_action
D 12-27 11:51:18 provisioner.py:153]     response = action(self, *args, **kwargs)
D 12-27 11:51:18 provisioner.py:153]   File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/site-packages/boto3/resources/action.py", line 88, in __call__
D 12-27 11:51:18 provisioner.py:153]     response = getattr(parent.meta.client, operation_name)(*args, **params)
D 12-27 11:51:18 provisioner.py:153]   File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 569, in _api_call
D 12-27 11:51:18 provisioner.py:153]     return self._make_api_call(operation_name, kwargs)
D 12-27 11:51:18 provisioner.py:153]   File "/Users/zepingguo/miniconda3/envs/sky/lib/python3.10/site-packages/botocore/client.py", line 1023, in _make_api_call
D 12-27 11:51:18 provisioner.py:153]     raise error_class(parsed_response, operation_name)
D 12-27 11:51:18 provisioner.py:153] botocore.errorfactory.LimitExceededException: An error occurred (LimitExceeded) when calling the AddRoleToInstanceProfile operation: Cannot exceed quota for InstanceSessionsPerInstanceProfile: 1
D 12-27 11:51:18 provisioner.py:153] 
D 12-27 11:51:18 provisioner.py:158] Terminating the failed cluster.

@aylei
Copy link
Contributor Author

aylei commented Dec 27, 2024

@zpoint Thanks! Looks like add-role-to-instance-profile is not idempotent:

$ aws iam get-instance-profile --instance-profile-name $profile_name
{
    "InstanceProfile": {
        "Path": "/",
        "InstanceProfileName": "skypilot-v1",
        "InstanceProfileId": "AIPAXAJL2PF3IMPCUTZW2",
        "Arn": "arn:aws:iam::481665120630:instance-profile/skypilot-v1",
        "CreateDate": "2024-12-27T06:00:25Z",
        "Roles": [
            {
                "Path": "/",
                "RoleName": "skypilot-v1",
                "RoleId": "AROAXAJL2PF3LTM5OYQZP",
                "Arn": "arn:aws:iam::481665120630:role/skypilot-v1",
                "CreateDate": "2024-12-20T10:09:04Z",
                "AssumeRolePolicyDocument": {
                    "Version": "2008-10-17",
                    "Statement": [
                        {
                            "Effect": "Allow",
                            "Principal": {
                                "Service": "ec2.amazonaws.com"
                            },
                            "Action": "sts:AssumeRole"
                        }
                    ]
                }
            }
        ],
        "Tags": []
    }
}
$ aws iam add-role-to-instance-profile --instance-profile-name $profile_name --role-name skypilot-v1

An error occurred (LimitExceeded) when calling the AddRoleToInstanceProfile operation: Cannot exceed quota for InstanceSessionsPerInstanceProfile: 1

Besides, this quota is not allowed to increase according to https://docs.aws.amazon.com/IAM/latest/APIReference/API_AddRoleToInstanceProfile.html

I added a quick fix in the PR, PTAL❤️

@aylei aylei requested a review from zpoint December 27, 2024 11:13
Copy link
Collaborator

@zpoint zpoint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei

LGTM

profile = _get_instance_profile(instance_profile_name)
time.sleep(15) # wait for propagation
assert profile is not None, 'Failed to create instance profile'

# TODO(aylei): check if the role associated is the one we expect.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already checked in _ensure_instance_profile_role. Can we remove the TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check in _ensure_instance_profile_role only occurs when race happens, i.e. profile.roles is empty and the subsequent adding role action encounters LimitExceeded error. We still need to check the role in other cases.

I found this TODO is trivial, just added the missing part and removed the TODO.

Copy link
Collaborator

@zpoint zpoint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei

def _check_instance_profile_role(profile: Any, role_name: str):
if profile.roles and profile.roles[0].name != role_name:
# If the associated role is not the role we expect, error out
# to user instead of silently overriding the role.
Copy link
Collaborator

@zpoint zpoint Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the comment here? The logger.fatal statement below is quite clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, removed the log for simplicity~

Copy link
Collaborator

@zpoint zpoint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for fixing this @aylei! It looks good to me.

f'role {role_name} is not the same as the expected '
f'role. Please remove the existing role from the '
f'instance profile and try again.')
raise SystemExit(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we use the same handle_boto_error here as well to keep the error handling consistent and raise the same exc error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, boto error is specific to the errors returned by AWS boto3 SDK, simulating a boto error here adding extra indirections. Besides, handle_boto_error will prefix error message with Boto3 error: , which looks misleading, wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I was referring to raising the error from the caller directly, but I am ok with a custom exception.

If we are raising SystemExit here, it might cause the whole system to exit as we don't catch the system exit explicitly.
https://github.com/skypilot-org/skypilot/blob/master/sky/provision/provisioner.py#L136-L197

Just a note, we decide the failover across clouds based on the error message here: https://github.com/skypilot-org/skypilot/blob/master/sky/backends/cloud_vm_ray_backend.py#L1119-L1137. I assume if we fail in this case, we should skip the whole AWS cloud.

sky/provision/aws/config.py Outdated Show resolved Hide resolved
aylei and others added 3 commits January 2, 2025 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants