Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix smoke tests bug #4492

Merged
merged 30 commits into from
Jan 14, 2025
Merged

Conversation

zpoint
Copy link
Collaborator

@zpoint zpoint commented Dec 20, 2024

Some fix for #4438

Refer to release comment here and some remaining problem after resolved

Change regoin for:

 pytest tests/test_smoke.py::test_managed_jobs_storage --azure
 pytest tests/test_smoke.py::test_azure_best_tier_failover --azure
 pytest tests/test_smoke.py::test_azure_disk_tier --azure

And robust backward compatibility

Add a gke mark for pytest and modify generate_pipeline.py. The basic idea is to separate tests marked withgke to run on the GKE cluster, while other Kubernetes tests can run on the local cluster, without relying on GKE for buildkite.

Remove the me-central2 region test since its restricted and our account don't have permission

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@zpoint
Copy link
Collaborator Author

zpoint commented Dec 20, 2024

/quicktest-core

@zpoint
Copy link
Collaborator Author

zpoint commented Dec 20, 2024

/quicktest-core

1 similar comment
@zpoint
Copy link
Collaborator Author

zpoint commented Dec 25, 2024

/quicktest-core

@zpoint
Copy link
Collaborator Author

zpoint commented Dec 25, 2024

/quicktest-core

@zpoint zpoint modified the milestones: 100 jobs, v0.7.1 Dec 27, 2024
@zpoint zpoint requested review from cg505 and Michaelvll December 27, 2024 03:44
sky/clouds/gcp.py Outdated Show resolved Hide resolved
sky/data/mounting_utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint! Please see the comments below

sky/clouds/gcp.py Outdated Show resolved Hide resolved
sky/data/mounting_utils.py Outdated Show resolved Hide resolved
tests/backward_compatibility_tests.sh Outdated Show resolved Hide resolved
tests/smoke_tests/test_basic.py Show resolved Hide resolved
tests/smoke_tests/test_cluster_job.py Outdated Show resolved Hide resolved
@zpoint
Copy link
Collaborator Author

zpoint commented Jan 2, 2025

/quicktest-core

@zpoint
Copy link
Collaborator Author

zpoint commented Jan 2, 2025

/quicktest-core

tests/smoke_tests/test_images.py Outdated Show resolved Hide resolved
Comment on lines 129 to 133
# The line below ensures the cache directory is cleared before mounting to
# avoid "config error in file_cache [temp directory not empty]" error, which
# can occur after stopping and starting the same cluster on Azure.
# This helps ensure a clean state for blobfuse2 operations.
mount_cmd = (f'rm -rf {cache_path} && mkdir -p {cache_path} && '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this cause sky launch again on an existing UP cluster to clean up the cache? Could this cause issue if there is a running process using the cache?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about generating a new cache_path on every sky launch?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comments offline : )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now use the boot_time to generate the cache directory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last login: Thu Jan  9 07:08:36 2025 from 163.125.113.120
(base) azureuser@t-azure-storage-mount-b8-09-94a6-2815-0:~$ ls ~/.sky/blobfuse2_cache/
azureopendatastorage_nyctlc_1736406391  sky635694a6141885a3_sky-test-1736406192_1736406391
(base) azureuser@t-azure-storage-mount-b8-09-94a6-2815-0:~$ 

sky/data/mounting_utils.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from Michaelvll January 6, 2025 08:35
Comment on lines 126 to 130
# The line below ensures the cache directory is new before mounting to
# avoid "config error in file_cache [temp directory not empty]" error, which
# can occur after stopping and starting the same cluster on Azure.
# This helps ensure a clean state for blobfuse2 operations.
cache_path += f'_{int(time.time())}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this cause the same issue for launch on existing UP cluster where we create a new folder everytime?

Comment on lines 129 to 133
# The line below ensures the cache directory is cleared before mounting to
# avoid "config error in file_cache [temp directory not empty]" error, which
# can occur after stopping and starting the same cluster on Azure.
# This helps ensure a clean state for blobfuse2 operations.
mount_cmd = (f'rm -rf {cache_path} && mkdir -p {cache_path} && '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comments offline : )

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint! Mostly looks good to me know.

tests/skyserve/update/new_autoscaler_before.yaml Outdated Show resolved Hide resolved
tests/smoke_tests/test_images.py Outdated Show resolved Hide resolved
@zpoint zpoint requested a review from Michaelvll January 9, 2025 13:20
@zpoint zpoint mentioned this pull request Jan 10, 2025
10 tasks
@zpoint
Copy link
Collaborator Author

zpoint commented Jan 10, 2025

@Michaelvll, could we get this PR reviewed and unblock the usage of the comment /smoke-tests? I will create a new PR to fix the new test issues and freeze this PR to get it closer to being merged.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zpoint! LGTM.

Comment on lines +125 to +129
# The line below ensures the cache directory is new before mounting to
# avoid "config error in file_cache [temp directory not empty]" error, which
# can occur after stopping and starting the same cluster on Azure.
# This helps ensure a clean state for blobfuse2 operations.
remote_boot_time_cmd = 'date +%s -d "$(uptime -s)"'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we clean up the old cache folder that was created in a previous boot? We can add a TODO here first.

@zpoint zpoint merged commit 35f0cf4 into skypilot-org:master Jan 14, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants