[jobs] revamp scheduling for managed jobs #4485

cg505 · 2024-12-19T01:51:23Z

Detaches the job controller from ray worker and the ray driver program, and uses our own scheduling and parallelism control mechanism.

See the commands in sky/jobs/scheduler.py for more info.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

sky/jobs/scheduler.py

sky/jobs/state.py

cg505 · 2024-12-19T02:34:41Z

sky/jobs/scheduler.py

+                os.makedirs(logs_dir, exist_ok=True)
+                log_path = os.path.join(logs_dir, f'{managed_job_id}.log')
+
+                pid = subprocess_utils.launch_new_process_tree(


if scheduler is killed before this line (e.g. when running as part of a controller job), we will get stuck since the job will be submitted but the controller will never start. Todo figure out how to recover from this case

We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.

We are already using the exiting managed job skylet event for that, but the problem is that if it dies right here, there's no way to know if the scheduler is just about to start the process or if it already died. We need a way to check if the scheduler died or maybe a timestamp for the WAITING -> LAUNCHING transition.

Michaelvll

Thanks @cg505 for making this significant change! This is awesome! I glanced the code, and it mostly looks good. The main concern is the complexity and granularity we have for limiting the number of launches. Please see the comments below.

sky/backends/cloud_vm_ray_backend.py

sky/jobs/constants.py

sky/jobs/scheduler.py

Michaelvll · 2024-12-19T08:29:57Z

sky/jobs/scheduler.py

+                os.makedirs(logs_dir, exist_ok=True)
+                log_path = os.path.join(logs_dir, f'{managed_job_id}.log')
+
+                pid = subprocess_utils.launch_new_process_tree(


We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.

sky/jobs/scheduler.py

…d-jobs-skylet

cg505 · 2024-12-20T06:47:25Z

/quicktest-core

Michaelvll

Thanks @cg505! This PR looks pretty good to me! We should do some thorough test with managed jobs, especially testing for:

scheduling speed for jobs
special cases that may get the scheduling stuck
many jobs
cancellation of jobs
in parallel jobs scheduling

Michaelvll · 2025-01-02T21:06:22Z

sky/jobs/controller.py

@@ -191,6 +190,8 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
            f'Submitted managed job {self._job_id} (task: {task_id}, name: '
            f'{task.name!r}); {constants.TASK_ID_ENV_VAR}: {task_id_env_var}')

+        scheduler.wait_until_launch_okay(self._job_id)


The new API looks much better than before. Maybe we can turn this into a context so as to combine the wait and finish

self._strategy_executor.launch() may call scheduler.launch_finished and scheduler.wait_until_launch_okay in the recovery case, so I feel like the context wouldn't really be accurate.

Michaelvll · 2025-01-02T21:12:56Z

sky/jobs/state.py

+    db_utils.add_column_to_table(cursor, conn, 'job_info', 'schedule_state',
+                                 'TEXT')
+
+    db_utils.add_column_to_table(cursor, conn, 'job_info', 'pid',


Should it be called controller_pid?

Michaelvll · 2025-01-02T21:17:05Z

sky/jobs/state.py

+    There is no well-defined mapping from the managed job status to schedule
+    state or vice versa. (In fact, schedule state is defined on the job and
+    status on the task.)


I assume the mapping is more like, INACTIVE and WAITING only happens during PENDING state, and ALIVE_WAITING only happens during RECORVING state, etc. We can probably mention that.

A future TODO, reflect the schedule state in the displayed states or the description in sky jobs queue, so that a user has a more detailed idea of what actual state the job is in.

Michaelvll · 2025-01-02T21:36:19Z

sky/jobs/scheduler.py

+    except filelock.Timeout:
+        # If we can't get the lock, just exit. The process holding the lock
+        # should launch any pending jobs.
+        pass


Since we skip any scheduling call when there is an job holding the lock to start. Would it be possible that the following happens:

This function is called in job_done for job 1.

This function schedule the next job, and the next job passed the wait_until_launch_okay, but takes forever for launching due to resource availability issue.

The next pending job will not be scheduled until the next job transition, which may take a long time?

Also, there is a race between this function release the scheduling lock vs the underlying jobs.controller calling the scheduling again. Should the underlying jobs.controller to wait for this scheduler function to set some specific state before proceeding to trigger the scheduler?

the next job passed the wait_until_launch_okay, but takes forever for launching due to resource availability issue

If the launch is waiting for resources (in backoff), it will be ALIVE instead of LAUNCHING while sleeping.
https://github.com/skypilot-org/skypilot/pull/4485/files#diff-092d0b9d29f77447088331967c3a39d77095fb44328a6e3e89d614b1236f9ab3R391-R395

there is a race between this function release the scheduling lock vs the underlying jobs.controller calling the scheduling again

This race is not clear to me. I think whatever race you are thinking of is protected by holding the lock whenever a job transitions state.
Consider two processes that call maybe_start_waiting_jobs at the same time. Only one will get the lock, and the other will exit immediately, but no matter which process gets the lock, the function will do exactly the same thing. We guarantee this because as soon as the lock is held by either process, the scheduling and job states can only be changed by whatever processes is doing the scheduling. Any other job state changes will block until the scheduling finishes.

I guess we should clarify the following:

This function may return before actually scheduling all possible jobs, e.g. if another process is already scheduling.

However, if we call this function, we are guaranteed that all jobs that are currently possible to schedule will eventually be scheduled, either by this process or by another.

If the launch is waiting for resources (in backoff), it will be ALIVE instead of LAUNCHING while sleeping.
https://github.com/skypilot-org/skypilot/pull/4485/files#diff-092d0b9d29f77447088331967c3a39d77095fb44328a6e3e89d614b1236f9ab3R391-R395

I assume this may still take a while if the launch/failover takes a while, especially when it has to go through a large list of candidate regions/clouds/accelerator types, i.e. the time to schedule the next job may be non-deterministic and depends on the number of candidate locations to search for.

This race is not clear to me. I think whatever race you are thinking of is protected by holding the lock whenever a job transitions state.
Consider two processes that call maybe_start_waiting_jobs at the same time. Only one will get the lock, and the other will exit immediately, but no matter which process gets the lock, the function will do exactly the same thing. We guarantee this because as soon as the lock is held by either process, the scheduling and job states can only be changed by whatever processes is doing the scheduling. Any other job state changes will block until the scheduling finishes.

By race here, I mean the similar case as mentioned above, where we expect that if a job controller gets scheduled, the next job should start scheduling immediately once the jobs.controller calls the scheduling. However, due to the "race" (the scheduling function may be running for the current jobs.controller when it calls the schedule for the next job), the schedule for the next job will be skipped until we finish the failover for the launch.

Since maybe_start_waiting_jobs will directly launch all possible jobs that can be launched, I think this is okay. Will try to clarify in comments.

Michaelvll · 2025-01-03T01:43:40Z

sky/jobs/utils.py

    ]
    if show_all:
-        columns += ['STARTED', 'CLUSTER', 'REGION', 'FAILURE']
+        columns += ['STARTED', 'CLUSTER', 'REGION', 'FAILURE', 'SCHED. STATE']


nit: I would prefer to not have the sched. state column, instead, we may want to do something similar as kubectl describe pod where it shows detailed description of what the pod is working on in the same state. For example, we can maybe rename the FAILURE column to be DESCRIPTION.

Don't want to spend too much time on this but I'll take a look.

Yep, don't need to be a large change. Just adding the state as a description in the FAILURE column (now should rename to DESCRIPTION

Michaelvll · 2025-01-03T01:50:08Z

sky/jobs/utils.py

+            # Backwards compatibility: this job was submitted when ray was still
+            # used for managing the parallelism of job controllers. This code
+            # path can be removed before 0.11.0.


nit: add TODO in the comment to make it easier to search for.

Michaelvll · 2025-01-03T01:55:39Z

sky/jobs/utils.py

+        # Warning: it's totally possible for the controller job to transition to
+        # a terminal state during the course of this function. We will see that
+        # as an abnormal failure. However, set_failed() will not update the
+        # state in this case.


Suggested change

# Warning: it's totally possible for the controller job to transition to

# a terminal state during the course of this function. We will see that

# as an abnormal failure. However, set_failed() will not update the

# state in this case.

# Warning: it's totally possible for the controller job to transition to

# a terminal state during the course of this function. The set_failed() called below

# will not update the state for jobs already in terminal state, so it should be fine.

revamp scheduling for managed jobs

78eef52

cg505 commented Dec 19, 2024

View reviewed changes

sky/jobs/scheduler.py Outdated Show resolved Hide resolved

cg505 commented Dec 19, 2024

View reviewed changes

sky/jobs/state.py Outdated Show resolved Hide resolved

cg505 commented Dec 19, 2024

View reviewed changes

cg505 mentioned this pull request Dec 19, 2024

detach the managed job controller from job submission #4458

Closed

6 tasks

Michaelvll reviewed Dec 19, 2024

View reviewed changes

cg505 added 2 commits December 19, 2024 21:19

simplify locking mechanism

4c54642

additional fixes

aeaaf7b

cg505 marked this pull request as ready for review December 20, 2024 05:34

cg505 requested a review from Michaelvll December 20, 2024 05:34

cg505 changed the title ~~revamp scheduling for managed jobs~~ [jobs/ revamp scheduling for managed jobs Dec 20, 2024

cg505 changed the title ~~[jobs/ revamp scheduling for managed jobs~~ [jobs] revamp scheduling for managed jobs Dec 20, 2024

cg505 added 2 commits December 19, 2024 22:19

fix pid writing

3b4cf44

Merge branch 'master' of github.com:skypilot-org/skypilot into manage…

ea84bb4

…d-jobs-skylet

Michaelvll reviewed Jan 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jobs] revamp scheduling for managed jobs #4485

[jobs] revamp scheduling for managed jobs #4485

cg505 commented Dec 19, 2024

cg505 Dec 19, 2024

Michaelvll Dec 19, 2024

cg505 Dec 20, 2024

Michaelvll left a comment

Michaelvll Dec 19, 2024

cg505 commented Dec 20, 2024

Michaelvll left a comment

Michaelvll Jan 2, 2025

cg505 Jan 3, 2025

Michaelvll Jan 2, 2025

Michaelvll Jan 2, 2025

Michaelvll Jan 2, 2025

cg505 Jan 3, 2025

cg505 Jan 3, 2025

Michaelvll Jan 3, 2025

cg505 Jan 3, 2025

Michaelvll Jan 3, 2025

cg505 Jan 3, 2025

Michaelvll Jan 3, 2025

Michaelvll Jan 3, 2025

Michaelvll Jan 3, 2025

[jobs] revamp scheduling for managed jobs #4485

Are you sure you want to change the base?

[jobs] revamp scheduling for managed jobs #4485

Conversation

cg505 commented Dec 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cg505 commented Dec 20, 2024

Michaelvll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment