-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: optimise starting new workflows beyond "max concurrent" limit if job slots allow it #520
Comments
The complexity of this improvement depends on whether we set a maximum number of "job slots" manually (the same way as If manually, this improvement will require adding a new env variable for job slots and a further condition to the scheduler. The downside, it will not be flexible at all. It will work pretty well for a dedicated cluster like ATLAS, where we can calculate job slots for specific workflows they want to run. It might not be as good for general use cases as different workflows may have other memory requirements, etc. If detected automatically, the scheduling will be more flexible for ATLAS and all other users. But, we need to devise steps to determine the maximum number of job slots or workflows the cluster can run. |
I thought about automatically, because we know the total amount of memory in the cluster, and we know default memory requirements if users don't specify any. (And if a cluster admin adds some nodes or remove some nodes, the system will notice the next time the cluster node health stats are being gathered.) |
One more related observation regarding loosening the We may want to take into account the compute backends that the incoming workflow jobs intend to use. If a workflow uses HTCondor or Slurm, then there may be Example of one such scenario: $ kubectl get pod | grep -E '(NAME|reana-run)'
NAME READY STATUS RESTARTS AGE
reana-run-batch-17696032-c7a7-4247-985b-63d8b73ed079--1-gm77g 2/2 Running 0 3h1m
reana-run-batch-468ff766-03b8-48e7-9d53-647723109f0b--1-4b6l5 2/2 Running 0 3h7m
reana-run-batch-48b37df7-ded9-41cf-8e22-6651921a86dd--1-4vfjz 2/2 Running 0 3h1m
reana-run-batch-77b1fcc0-ade3-44a6-b604-60098880e91e--1-ts4p8 2/2 Running 0 3h19m
reana-run-batch-ef56e5af-4606-4e9e-963a-f7f089eb3ab2--1-rd2j8 2/2 Running 0 3h13m These pods use very little resources: $ kubectl top pods | grep -E '(NAME|reana-run-)'
NAME CPU(cores) MEMORY(bytes)
reana-run-batch-17696032-c7a7-4247-985b-63d8b73ed079--1-gm77g 24m 115Mi
reana-run-batch-468ff766-03b8-48e7-9d53-647723109f0b--1-4b6l5 6m 115Mi
reana-run-batch-48b37df7-ded9-41cf-8e22-6651921a86dd--1-4vfjz 1m 113Mi
reana-run-batch-77b1fcc0-ade3-44a6-b604-60098880e91e--1-ts4p8 3m 107Mi
reana-run-batch-ef56e5af-4606-4e9e-963a-f7f089eb3ab2--1-rd2j8 2m 172Mi and contribute only to the reana-run-batch-* slot occupation, not to reana-run-job-* slot occupation. This means that we could start many more HTC/HPC workflows, or start a lot of new K8s workflows alongside these. |
Current behaviour
Historically, REANA has been using the sole mechanism to limit the load on the cluster by means of
REANA_MAX_CONCURRENT_BATCH_WORKFLOWS
setting to allow e.g. maximum 30 concurrent workflows running on the cluster in parallel at any time. The purpose was solely to prevent cluster overload.Later, we have added a workflow scheduling based on DAG complexity estimator which looks at the incoming workflows and their init stage jobs and their claimed memory consumptions, and compares this number to the amount of available memory on the cluster, in order to see whether a workflow can fit into available number of "job slots", and in order to pick the most appropriate waiting workflow to start from the incoming queue. (Weighted by the number of workflows launched by each individual, to balance across different users too.)
This leads to a situation where for a set of workflows of the following DAG complexity:
where we can estimate the "optimal" cluster settings as follows:
For example, allowing 80 such workflows (via
REANA_MAX_CONCURRENT_BATCH_WORKFLOWS
settings) would need 10 runtime batch nodes to run 80 workflow orchestration pods, roughly 30 job nodes. However, to be more precise, we should count memory needs of these jobs; if one job needs 12 GiB of memory and our job nodes have 16 GiB of memory in total, we wouldn't be able to run more than 1 job per node anyway, and instead of 30 job nodes, we would need 30 x 8 = 240 job nodes! So, for a more precise estimate, it is necessary to take the amount of memory a job needs into account. Assuming 2200Mi memory consumption for each job, followed by observing typical pMSSM workflows in action, the cluster would need 80 x 3 = 240 "job slots", which at an average rate of packing 5 jobs per node (because 5*2200Mi is about maximum we can occupy on 16 GB RAM nodes), we will need about 240/5 = 48 job nodes.In theory, such a cluster would run with 100% of its resources being consumed optimally, seeing 80 concurrent workflows and 240 concurrent jobs running all the time.
In practice, we have tried this setup for recent ATLAS pMSSM AllSys workflow run, and instead of seeing 80 concurrent workflows and 240 concurrent jobs, we saw only 80 concurrent workflows and 170-190 typically running. This is fully natural due to some workflow jobs taking more time to finish than others, be in during n-tupling phase or later. (And, for later jobs, we cannot predict anything about their runtimes or complexity!) When some of these n-tupling jobs finish, the workflow may be slow to "conclude" due to later jobs, so there will be some "free n-tupling job slots" that liberated, but new workflows are prevented from starting there due to the
REANA_MAX_CONCURRENT_BATCH_WORKFLOWS
limit being already at the maximally allowed 80 number of concurrently running workflows already.Here is the typical job runtime distribution of all jobs (i.e. not only n-tupling ones):
Due to those few "outliers" when a job takes longer to run, new incoming jobs are prevented from starting.
Having 170-190 practically running jobs out of theoretical maximum 240 number is not too bad! This means that we have had about 75% CPU resource efficiency when running the full pMSSM workflow batch.
However, we may be perhaps able to do better.
Expected behaviour
Since we have a way to limit the consumption of incoming workflows by the "job slot" numbers, it would be a good idea to release the
REANA_MAX_CONCURRENT_BATCH_WORKFLOWS
constraint a bit. For example, 240-180 = 60 workflow slots were free, which means that we had a space for 60 / 3 n-tupling jobs = 20 more workflows, so we could have run not with a maximum limit of 80, but at 100, in the "stabilised regime" after the initial upload.Let's look whether we can release the
REANA_MAX_CONCURRENT_BATCH_WORKFLOWS
constraint a bit:if we put it too low, then "job slots" are not being consumed, which is bad.
if we put it too high, then all incoming workflows are scheduled for launch, and new jobs would be in Kubernetes "Pending" state, which is not desirable either.
We can design a scheduling improvement mechanism that would allow to temporarily overload
REANA_MAX_CONCURRENT_BATCH_WORKFLOWS
, so to speak, if an incoming workflow can be fitted into the number of available free "job slots" in the steady-state regime.Theoretically,
REANA_MAX_CONCURRENT_BATCH_WORKFLOWS
could be even removed, and we could rely on the number of free "job slots" only, provided that we don't requeue workflows unnecessarily often! (which would be another inefficiency)Practically, we can keep some kind of
REANA_MAX_CONCURRENT_BATCH_WORKFLOWS
limit still, so that if somebody submits 10k workflows, we wouldn't attempt to analyse the DAG complexity etc for all of them all the time.P.S. We could also think of changing the requeue policy, but that might be going too far beyond the initial focus of this issue, which was to allow small overloads of "max concurrent" limit if the available cluster job slots allow it.
The text was updated successfully, but these errors were encountered: