scheduler: optimise starting new workflows beyond "max concurrent" limit if job slots allow it #520

tiborsimko · 2022-07-08T13:59:12Z

Current behaviour

Historically, REANA has been using the sole mechanism to limit the load on the cluster by means of REANA_MAX_CONCURRENT_BATCH_WORKFLOWS setting to allow e.g. maximum 30 concurrent workflows running on the cluster in parallel at any time. The purpose was solely to prevent cluster overload.

Later, we have added a workflow scheduling based on DAG complexity estimator which looks at the incoming workflows and their init stage jobs and their claimed memory consumptions, and compares this number to the amount of available memory on the cluster, in order to see whether a workflow can fit into available number of "job slots", and in order to pick the most appropriate waiting workflow to start from the incoming queue. (Weighted by the number of workflows launched by each individual, to balance across different users too.)

This leads to a situation where for a set of workflows of the following DAG complexity:

where we can estimate the "optimal" cluster settings as follows:

N1 batch pod nodes at 8 CPU per node = we can run efficiently N1*8 concurrent workflows
N2 job pod nodes, where N2 would be typically 3*N1, because each batch pod process will start three parallel n-tupling job processes during the init phase.

For example, allowing 80 such workflows (via REANA_MAX_CONCURRENT_BATCH_WORKFLOWS settings) would need 10 runtime batch nodes to run 80 workflow orchestration pods, roughly 30 job nodes. However, to be more precise, we should count memory needs of these jobs; if one job needs 12 GiB of memory and our job nodes have 16 GiB of memory in total, we wouldn't be able to run more than 1 job per node anyway, and instead of 30 job nodes, we would need 30 x 8 = 240 job nodes! So, for a more precise estimate, it is necessary to take the amount of memory a job needs into account. Assuming 2200Mi memory consumption for each job, followed by observing typical pMSSM workflows in action, the cluster would need 80 x 3 = 240 "job slots", which at an average rate of packing 5 jobs per node (because 5*2200Mi is about maximum we can occupy on 16 GB RAM nodes), we will need about 240/5 = 48 job nodes.

In theory, such a cluster would run with 100% of its resources being consumed optimally, seeing 80 concurrent workflows and 240 concurrent jobs running all the time.

In practice, we have tried this setup for recent ATLAS pMSSM AllSys workflow run, and instead of seeing 80 concurrent workflows and 240 concurrent jobs, we saw only 80 concurrent workflows and 170-190 typically running. This is fully natural due to some workflow jobs taking more time to finish than others, be in during n-tupling phase or later. (And, for later jobs, we cannot predict anything about their runtimes or complexity!) When some of these n-tupling jobs finish, the workflow may be slow to "conclude" due to later jobs, so there will be some "free n-tupling job slots" that liberated, but new workflows are prevented from starting there due to the REANA_MAX_CONCURRENT_BATCH_WORKFLOWS limit being already at the maximally allowed 80 number of concurrently running workflows already.

Here is the typical job runtime distribution of all jobs (i.e. not only n-tupling ones):

Due to those few "outliers" when a job takes longer to run, new incoming jobs are prevented from starting.

Having 170-190 practically running jobs out of theoretical maximum 240 number is not too bad! This means that we have had about 75% CPU resource efficiency when running the full pMSSM workflow batch.

However, we may be perhaps able to do better.

Expected behaviour

Since we have a way to limit the consumption of incoming workflows by the "job slot" numbers, it would be a good idea to release the REANA_MAX_CONCURRENT_BATCH_WORKFLOWS constraint a bit. For example, 240-180 = 60 workflow slots were free, which means that we had a space for 60 / 3 n-tupling jobs = 20 more workflows, so we could have run not with a maximum limit of 80, but at 100, in the "stabilised regime" after the initial upload.

Let's look whether we can release the REANA_MAX_CONCURRENT_BATCH_WORKFLOWS constraint a bit:

if we put it too low, then "job slots" are not being consumed, which is bad.
if we put it too high, then all incoming workflows are scheduled for launch, and new jobs would be in Kubernetes "Pending" state, which is not desirable either.

We can design a scheduling improvement mechanism that would allow to temporarily overload REANA_MAX_CONCURRENT_BATCH_WORKFLOWS, so to speak, if an incoming workflow can be fitted into the number of available free "job slots" in the steady-state regime.

Theoretically, REANA_MAX_CONCURRENT_BATCH_WORKFLOWS could be even removed, and we could rely on the number of free "job slots" only, provided that we don't requeue workflows unnecessarily often! (which would be another inefficiency)

Practically, we can keep some kind of REANA_MAX_CONCURRENT_BATCH_WORKFLOWS limit still, so that if somebody submits 10k workflows, we wouldn't attempt to analyse the DAG complexity etc for all of them all the time.

P.S. We could also think of changing the requeue policy, but that might be going too far beyond the initial focus of this issue, which was to allow small overloads of "max concurrent" limit if the available cluster job slots allow it.

The text was updated successfully, but these errors were encountered:

VMois · 2022-07-14T13:41:01Z

The complexity of this improvement depends on whether we set a maximum number of "job slots" manually (the same way as REANA_MAX_CONCURRENT_BATCH_WORKFLOWS) or the number should be detected automatically.

If manually, this improvement will require adding a new env variable for job slots and a further condition to the scheduler. The downside, it will not be flexible at all. It will work pretty well for a dedicated cluster like ATLAS, where we can calculate job slots for specific workflows they want to run. It might not be as good for general use cases as different workflows may have other memory requirements, etc.

If detected automatically, the scheduling will be more flexible for ATLAS and all other users. But, we need to devise steps to determine the maximum number of job slots or workflows the cluster can run.

tiborsimko · 2022-08-09T07:22:09Z

The complexity of this improvement depends on whether we set a maximum number of "job slots" manually (the same way as REANA_MAX_CONCURRENT_BATCH_WORKFLOWS) or the number should be detected automatically.

I thought about automatically, because we know the total amount of memory in the cluster, and we know default memory requirements if users don't specify any. (And if a cluster admin adds some nodes or remove some nodes, the system will notice the next time the cluster node health stats are being gathered.)

tiborsimko · 2023-01-20T14:54:58Z

One more related observation regarding loosening the REANA_MAX_CONCURRENT_BATCH_WORKFLOWS hard constraint.

We may want to take into account the compute backends that the incoming workflow jobs intend to use. If a workflow uses HTCondor or Slurm, then there may be reana-run-batch-... pods, but no reana-run-job-... pod s.

Example of one such scenario:

$ kubectl get pod | grep -E '(NAME|reana-run)'
NAME                                                            READY   STATUS      RESTARTS   AGE
reana-run-batch-17696032-c7a7-4247-985b-63d8b73ed079--1-gm77g   2/2     Running     0          3h1m
reana-run-batch-468ff766-03b8-48e7-9d53-647723109f0b--1-4b6l5   2/2     Running     0          3h7m
reana-run-batch-48b37df7-ded9-41cf-8e22-6651921a86dd--1-4vfjz   2/2     Running     0          3h1m
reana-run-batch-77b1fcc0-ade3-44a6-b604-60098880e91e--1-ts4p8   2/2     Running     0          3h19m
reana-run-batch-ef56e5af-4606-4e9e-963a-f7f089eb3ab2--1-rd2j8   2/2     Running     0          3h13m

These pods use very little resources:

$ kubectl top pods | grep -E '(NAME|reana-run-)'
NAME                                                            CPU(cores)   MEMORY(bytes)
reana-run-batch-17696032-c7a7-4247-985b-63d8b73ed079--1-gm77g   24m          115Mi
reana-run-batch-468ff766-03b8-48e7-9d53-647723109f0b--1-4b6l5   6m           115Mi
reana-run-batch-48b37df7-ded9-41cf-8e22-6651921a86dd--1-4vfjz   1m           113Mi
reana-run-batch-77b1fcc0-ade3-44a6-b604-60098880e91e--1-ts4p8   3m           107Mi
reana-run-batch-ef56e5af-4606-4e9e-963a-f7f089eb3ab2--1-rd2j8   2m           172Mi

and contribute only to the reana-run-batch-* slot occupation, not to reana-run-job-* slot occupation.

This means that we could start many more HTC/HPC workflows, or start a lot of new K8s workflows alongside these.

VMois added priority/longterm type/enhancement system/scalability labels Jul 12, 2022

VMois self-assigned this Jul 14, 2022

VMois added the triage/needinfo label Jul 26, 2022

tiborsimko removed the triage/needinfo label Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: optimise starting new workflows beyond "max concurrent" limit if job slots allow it #520

scheduler: optimise starting new workflows beyond "max concurrent" limit if job slots allow it #520

tiborsimko commented Jul 8, 2022

VMois commented Jul 14, 2022 •

edited

Loading

tiborsimko commented Aug 9, 2022

tiborsimko commented Jan 20, 2023

scheduler: optimise starting new workflows beyond "max concurrent" limit if job slots allow it #520

scheduler: optimise starting new workflows beyond "max concurrent" limit if job slots allow it #520

Comments

tiborsimko commented Jul 8, 2022

VMois commented Jul 14, 2022 • edited Loading

tiborsimko commented Aug 9, 2022

tiborsimko commented Jan 20, 2023

VMois commented Jul 14, 2022 •

edited

Loading