Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance: optimise "blue spread" runtime and NotReady pod termination statuses #411

Open
tiborsimko opened this issue Oct 27, 2021 · 8 comments

Comments

@tiborsimko
Copy link
Member

The performance tests launching 200 simple workflows defined as:

inputs:
  files:
      - reana.yaml
workflow:
  type: serial
  specification:
    steps:
      - environment: 'python:3.8-slim-buster'
        commands:
          - echo "Done"

on a system with 1000 concurrent workflow limit and 14 batch nodes and 12 job nodes gives the following:

fastc_execution_progress

fastc_histogram_pending_time

One can see that there is a lot of "pending time" starting workflows, see sister issue reanahub/reana-server#410.
This issue focuses on optimising "blue spread" where some workflow runs are being terminated fast, whilst others take a long time.

Here is the timing information about this test run:

$ head -3 fastc_original_results.csv
name,run_number,created,started,ended,status,submit_date,submit_number
fastc-199,1,2021-10-27T12:08:08,2021-10-27T12:10:42,2021-10-27T12:10:48,finished,2021-10-27T12:08:19,199
fastc-198,1,2021-10-27T12:08:07,2021-10-27T12:10:42,2021-10-27T12:10:49,finished,2021-10-27T12:08:19,198
$ tail -3 fastc_original_results.csv
fastc-2,1,2021-10-27T12:04:30,2021-10-27T12:08:30,2021-10-27T12:09:09,finished,2021-10-27T12:08:09,2
fastc-1,1,2021-10-27T12:04:29,2021-10-27T12:08:29,2021-10-27T12:09:50,finished,2021-10-27T12:08:09,1
fastc-0,1,2021-10-27T12:04:28,2021-10-27T12:08:37,2021-10-27T12:09:18,finished,2021-10-27T12:08:09,0

Here is a snapshot of the cluster situation just a few seconds after the start of the test run:

2021-10-27T14:08:10 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:08:10 db workflow.status count:
 status  | count 
---------+-------
 pending |     3
 queued  |    12
 created |   185
(3 rows)

2021-10-27T14:08:10 run-b pods count:
        ContainerCreating     4 ############################################################ 
2021-10-27T14:08:10 run-j pods count:

Here is the situation some 20 seconds later where some of the workflows start to get finished:

2021-10-27T14:08:42 db pg_stat_activity count:
 count 
-------
    88
(1 row)

2021-10-27T14:08:42 db workflow.status count:
  status  | count 
----------+-------
 pending  |    40
 queued   |   103
 running  |    54
 finished |     3
(4 rows)

2021-10-27T14:08:42 run-b pods count:
                  Running    79 ############################################################ 
                 NotReady    11 ######### 
              Terminating     4 #### 
        ContainerCreating     4 #### 
2021-10-27T14:08:45 run-j pods count:
        ContainerCreating    36 ############################################################ 
                Completed     8 ############## 
                  Pending     3 ##### 

Here is the situation around the "shoulder" of the plot, some 110 seconds after the start of the test run:

2021-10-27T14:09:51 db pg_stat_activity count:
 count 
-------
   167
(1 row)

2021-10-27T14:09:51 db workflow.status count:
  status  | count 
----------+-------
 running  |    24
 finished |    73
 pending  |   103
(3 rows)

2021-10-27T14:09:51 run-b pods count:
                 NotReady    97 ############################################################ 
              Terminating    41 ########################## 
                  Running    28 ################## 
        ContainerCreating     3 ## 
2021-10-27T14:09:51 run-j pods count:
        ContainerCreating    10 ############################################################ 

Here is the situation near the completion of the test run:

2021-10-27T14:10:12 db pg_stat_activity count:
 count 
-------
   168
(1 row)

2021-10-27T14:10:12 db workflow.status count:
  status  | count 
----------+-------
 running  |    42
 finished |   109
 pending  |    49
(3 rows)

2021-10-27T14:10:12 run-b pods count:
                 NotReady    84 ############################################################ 
              Terminating    57 ######################################### 
                  Running     7 ##### 
2021-10-27T14:10:12 run-j pods count:

and ten seconds later:

2021-10-27T14:10:23 db pg_stat_activity count:
 count 
-------
   150
(1 row)

2021-10-27T14:10:23 db workflow.status count:
  status  | count 
----------+-------
 running  |    39
 finished |   131
 pending  |    30
(3 rows)

2021-10-27T14:10:23 run-b pods count:
                 NotReady    69 ############################################################ 
              Terminating    65 ######################################################### 
2021-10-27T14:10:23 run-j pods count:

We can see here is that even though all workflows have apparently successfully finished, some 39 are still being reported as "running" in the database.

This may indicate an issue with run batch termination tasks. When a workflow finishes, it is flopped into a "NotReady" state, before passing to "Terminating" state, and the DB state does not seem to fully correlate to those. Some of them the DB will still see as "Running".

This means that if someone uses a concurrent limit on how many workflows can run in parallel, the scheduler may not start them as fast as it could. This could contribute to the "orange hill" issue, even though for the issue at hand this is not the case, since the concurrent limit was set very high, and all 200 workflows could have started in parallel. The issue reanahub/reana-server#410 is dedicated to investigating that.

In the issue at hand, we are concerned with the apparent difference between what DB sees as the running workflows and what the K8s has as the running pods. The present observations can mean that even if some workflows successfully finished and produced output and are basically over, they are "stuck" in NotReady -> Terminating stages, leading to slower DB update, and thus contributing to the "blue spread" factor, and actually somewhat "artificially" inflating it.

We should investigate the run batch termination tasks again and if possible we should speed up the termination procedures, so that workflows are marked as "finished" in DB as fast as possible.

(Note that this test was run with REANA_WORKFLOW_TERMINATION_QUOTA_UPDATE_POLICY set to none, so that no termination time was lost in CPU and Disk quota update parts. This was verified via quota reporting before and after the test, both zero.)

P.S. On another news, there are only a few cases of "Pending" pods, meaning that jobs could have been allocated mostly well across the cluster, so these should contribute less to the overall reported "blue" runtimes in the plot.

@tiborsimko
Copy link
Member Author

For reference, here is the full DB and K8S status snapshot throughout the run:

2021-10-27T14:04:21 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:04:21 db workflow.status count:
 status | count 
--------+-------
(0 rows)

2021-10-27T14:04:21 run-b pods count:
2021-10-27T14:04:21 run-j pods count:

2021-10-27T14:04:32 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:04:32 db workflow.status count:
 status  | count 
---------+-------
 created |     4
(1 row)

2021-10-27T14:04:32 run-b pods count:
2021-10-27T14:04:32 run-j pods count:

2021-10-27T14:04:42 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:04:42 db workflow.status count:
 status  | count 
---------+-------
 created |    13
(1 row)

2021-10-27T14:04:42 run-b pods count:
2021-10-27T14:04:42 run-j pods count:

2021-10-27T14:04:52 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:04:52 db workflow.status count:
 status  | count 
---------+-------
 created |    22
(1 row)

2021-10-27T14:04:53 run-b pods count:
2021-10-27T14:04:53 run-j pods count:

2021-10-27T14:05:03 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:05:03 db workflow.status count:
 status  | count 
---------+-------
 created |    32
(1 row)

2021-10-27T14:05:03 run-b pods count:
2021-10-27T14:05:03 run-j pods count:

2021-10-27T14:05:13 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:05:13 db workflow.status count:
 status  | count 
---------+-------
 created |    41
(1 row)

2021-10-27T14:05:13 run-b pods count:
2021-10-27T14:05:13 run-j pods count:

2021-10-27T14:05:24 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:05:24 db workflow.status count:
 status  | count 
---------+-------
 created |    51
(1 row)

2021-10-27T14:05:24 run-b pods count:
2021-10-27T14:05:24 run-j pods count:

2021-10-27T14:05:34 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:05:34 db workflow.status count:
 status  | count 
---------+-------
 created |    60
(1 row)

2021-10-27T14:05:34 run-b pods count:
2021-10-27T14:05:34 run-j pods count:

2021-10-27T14:05:44 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:05:44 db workflow.status count:
 status  | count 
---------+-------
 created |    69
(1 row)

2021-10-27T14:05:44 run-b pods count:
2021-10-27T14:05:45 run-j pods count:

2021-10-27T14:05:55 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:05:55 db workflow.status count:
 status  | count 
---------+-------
 created |    78
(1 row)

2021-10-27T14:05:55 run-b pods count:
2021-10-27T14:05:55 run-j pods count:

2021-10-27T14:06:05 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:06:05 db workflow.status count:
 status  | count 
---------+-------
 created |    88
(1 row)

2021-10-27T14:06:05 run-b pods count:
2021-10-27T14:06:05 run-j pods count:

2021-10-27T14:06:15 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:06:16 db workflow.status count:
 status  | count 
---------+-------
 created |    97
(1 row)

2021-10-27T14:06:16 run-b pods count:
2021-10-27T14:06:16 run-j pods count:

2021-10-27T14:06:26 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:06:26 db workflow.status count:
 status  | count 
---------+-------
 created |   107
(1 row)

2021-10-27T14:06:26 run-b pods count:
2021-10-27T14:06:26 run-j pods count:

2021-10-27T14:06:36 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:06:36 db workflow.status count:
 status  | count 
---------+-------
 created |   116
(1 row)

2021-10-27T14:06:36 run-b pods count:
2021-10-27T14:06:37 run-j pods count:

2021-10-27T14:06:47 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:06:47 db workflow.status count:
 status  | count 
---------+-------
 created |   125
(1 row)

2021-10-27T14:06:47 run-b pods count:
2021-10-27T14:06:47 run-j pods count:

2021-10-27T14:06:57 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:06:57 db workflow.status count:
 status  | count 
---------+-------
 created |   134
(1 row)

2021-10-27T14:06:57 run-b pods count:
2021-10-27T14:06:57 run-j pods count:

2021-10-27T14:07:07 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:07:07 db workflow.status count:
 status  | count 
---------+-------
 created |   144
(1 row)

2021-10-27T14:07:08 run-b pods count:
2021-10-27T14:07:08 run-j pods count:

2021-10-27T14:07:18 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:07:18 db workflow.status count:
 status  | count 
---------+-------
 created |   154
(1 row)

2021-10-27T14:07:18 run-b pods count:
2021-10-27T14:07:18 run-j pods count:

2021-10-27T14:07:28 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:07:28 db workflow.status count:
 status  | count 
---------+-------
 created |   163
(1 row)

2021-10-27T14:07:28 run-b pods count:
2021-10-27T14:07:28 run-j pods count:

2021-10-27T14:07:39 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:07:39 db workflow.status count:
 status  | count 
---------+-------
 created |   173
(1 row)

2021-10-27T14:07:39 run-b pods count:
2021-10-27T14:07:39 run-j pods count:

2021-10-27T14:07:49 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:07:49 db workflow.status count:
 status  | count 
---------+-------
 created |   182
(1 row)

2021-10-27T14:07:49 run-b pods count:
2021-10-27T14:07:49 run-j pods count:

2021-10-27T14:07:59 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:07:59 db workflow.status count:
 status  | count 
---------+-------
 created |   192
(1 row)

2021-10-27T14:07:59 run-b pods count:
2021-10-27T14:08:00 run-j pods count:

2021-10-27T14:08:10 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:08:10 db workflow.status count:
 status  | count 
---------+-------
 pending |     3
 queued  |    12
 created |   185
(3 rows)

2021-10-27T14:08:10 run-b pods count:
        ContainerCreating     4 ############################################################ 
2021-10-27T14:08:10 run-j pods count:

2021-10-27T14:08:20 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:08:20 db workflow.status count:
 status  | count 
---------+-------
 pending |    53
 queued  |   147
(2 rows)

2021-10-27T14:08:20 run-b pods count:
        ContainerCreating    33 ############################################################ 
                  Running    19 ################################### 
                  Pending     2 #### 
2021-10-27T14:08:20 run-j pods count:

2021-10-27T14:08:31 db pg_stat_activity count:
 count 
-------
    45
(1 row)

2021-10-27T14:08:31 db workflow.status count:
 status  | count 
---------+-------
 pending |    59
 queued  |   114
 running |    27
(3 rows)

2021-10-27T14:08:31 run-b pods count:
                  Running    63 ############################################################ 
        ContainerCreating    23 ###################### 
                  Pending     1 # 
2021-10-27T14:08:31 run-j pods count:
        ContainerCreating    19 ############################################################ 
                  Pending     4 ############# 
                Completed     1 #### 

2021-10-27T14:08:42 db pg_stat_activity count:
 count 
-------
    88
(1 row)

2021-10-27T14:08:42 db workflow.status count:
  status  | count 
----------+-------
 pending  |    40
 queued   |   103
 running  |    54
 finished |     3
(4 rows)

2021-10-27T14:08:42 run-b pods count:
                  Running    79 ############################################################ 
                 NotReady    11 ######### 
              Terminating     4 #### 
        ContainerCreating     4 #### 
2021-10-27T14:08:45 run-j pods count:
        ContainerCreating    36 ############################################################ 
                Completed     8 ############## 
                  Pending     3 ##### 

2021-10-27T14:08:56 db pg_stat_activity count:
 count 
-------
   119
(1 row)

2021-10-27T14:08:56 db workflow.status count:
  status  | count 
----------+-------
 pending  |    38
 queued   |    94
 running  |    58
 finished |    10
(4 rows)

2021-10-27T14:08:56 run-b pods count:
                  Running    54 ############################################################ 
                 NotReady    39 ############################################ 
              Terminating     9 ########## 
        ContainerCreating     4 ##### 
2021-10-27T14:08:56 run-j pods count:
        ContainerCreating    15 ############################################################ 
                Completed     8 ################################ 
              Terminating     2 ######## 

2021-10-27T14:09:07 db pg_stat_activity count:
 count 
-------
   126
(1 row)

2021-10-27T14:09:07 db workflow.status count:
  status  | count 
----------+-------
 pending  |    48
 queued   |    65
 running  |    62
 finished |    25
(4 rows)

2021-10-27T14:09:07 run-b pods count:
                 NotReady    70 ############################################################ 
                  Running    32 ############################ 
              Terminating    24 ##################### 
        ContainerCreating     7 ###### 
                  Pending     2 ## 
2021-10-27T14:09:07 run-j pods count:
        ContainerCreating     7 ############################################################ 
                  Pending     1 ######### 

2021-10-27T14:09:18 db pg_stat_activity count:
 count 
-------
   147
(1 row)

2021-10-27T14:09:18 db workflow.status count:
  status  | count 
----------+-------
 pending  |    68
 queued   |    38
 running  |    55
 finished |    39
(4 rows)

2021-10-27T14:09:18 run-b pods count:
                 NotReady    60 ############################################################ 
                  Running    59 ########################################################### 
              Terminating    36 #################################### 
        ContainerCreating     5 ##### 
2021-10-27T14:09:18 run-j pods count:
        ContainerCreating    23 ############################################################ 
                Completed     1 ### 

2021-10-27T14:09:29 db pg_stat_activity count:
 count 
-------
   172
(1 row)

2021-10-27T14:09:29 db workflow.status count:
  status  | count 
----------+-------
 pending  |    86
 queued   |    20
 running  |    44
 finished |    50
(4 rows)

2021-10-27T14:09:29 run-b pods count:
                 NotReady    63 ############################################################ 
                  Running    60 ########################################################## 
              Terminating    42 ######################################## 
        ContainerCreating     7 ####### 
2021-10-27T14:09:30 run-j pods count:
        ContainerCreating    27 ############################################################ 
                Completed     3 ####### 
              Terminating     1 ### 
                  Pending     1 ### 

2021-10-27T14:09:40 db pg_stat_activity count:
 count 
-------
   171
(1 row)

2021-10-27T14:09:40 db workflow.status count:
  status  | count 
----------+-------
 pending  |    92
 queued   |    12
 running  |    40
 finished |    56
(4 rows)

2021-10-27T14:09:40 run-b pods count:
                 NotReady    87 ############################################################ 
                  Running    42 ############################# 
              Terminating    40 ############################ 
        ContainerCreating     4 ### 
2021-10-27T14:09:41 run-j pods count:
        ContainerCreating     8 ############################################################ 
                Completed     2 ############### 

2021-10-27T14:09:51 db pg_stat_activity count:
 count 
-------
   167
(1 row)

2021-10-27T14:09:51 db workflow.status count:
  status  | count 
----------+-------
 running  |    24
 finished |    73
 pending  |   103
(3 rows)

2021-10-27T14:09:51 run-b pods count:
                 NotReady    97 ############################################################ 
              Terminating    41 ########################## 
                  Running    28 ################## 
        ContainerCreating     3 ## 
2021-10-27T14:09:51 run-j pods count:
        ContainerCreating    10 ############################################################ 

2021-10-27T14:10:02 db pg_stat_activity count:
 count 
-------
   175
(1 row)

2021-10-27T14:10:02 db workflow.status count:
  status  | count 
----------+-------
 running  |     7
 finished |    92
 pending  |   101
(3 rows)

2021-10-27T14:10:02 run-b pods count:
                 NotReady    87 ############################################################ 
              Terminating    46 ################################ 
                  Running    22 ################ 
2021-10-27T14:10:02 run-j pods count:
        ContainerCreating    13 ############################################################ 

2021-10-27T14:10:12 db pg_stat_activity count:
 count 
-------
   168
(1 row)

2021-10-27T14:10:12 db workflow.status count:
  status  | count 
----------+-------
 running  |    42
 finished |   109
 pending  |    49
(3 rows)

2021-10-27T14:10:12 run-b pods count:
                 NotReady    84 ############################################################ 
              Terminating    57 ######################################### 
                  Running     7 ##### 
2021-10-27T14:10:12 run-j pods count:

2021-10-27T14:10:23 db pg_stat_activity count:
 count 
-------
   150
(1 row)

2021-10-27T14:10:23 db workflow.status count:
  status  | count 
----------+-------
 running  |    39
 finished |   131
 pending  |    30
(3 rows)

2021-10-27T14:10:23 run-b pods count:
                 NotReady    69 ############################################################ 
              Terminating    65 ######################################################### 
2021-10-27T14:10:23 run-j pods count:

2021-10-27T14:10:33 db pg_stat_activity count:
 count 
-------
   132
(1 row)

2021-10-27T14:10:33 db workflow.status count:
  status  | count 
----------+-------
 running  |    21
 finished |   157
 pending  |    22
(3 rows)

2021-10-27T14:10:33 run-b pods count:
              Terminating    73 ############################################################ 
                 NotReady    44 ##################################### 
2021-10-27T14:10:33 run-j pods count:

2021-10-27T14:10:43 db pg_stat_activity count:
 count 
-------
   114
(1 row)

2021-10-27T14:10:43 db workflow.status count:
  status  | count 
----------+-------
 finished |   183
 running  |    17
(2 rows)

2021-10-27T14:10:44 run-b pods count:
              Terminating    80 ############################################################ 
                 NotReady    18 ############## 
2021-10-27T14:10:44 run-j pods count:

2021-10-27T14:10:54 db pg_stat_activity count:
 count 
-------
    92
(1 row)

2021-10-27T14:10:54 db workflow.status count:
  status  | count 
----------+-------
 finished |   200
(1 row)

2021-10-27T14:10:54 run-b pods count:
              Terminating    78 ############################################################ 
2021-10-27T14:10:54 run-j pods count:

2021-10-27T14:11:04 db pg_stat_activity count:
 count 
-------
    66
(1 row)

2021-10-27T14:11:04 db workflow.status count:
  status  | count 
----------+-------
 finished |   200
(1 row)

2021-10-27T14:11:04 run-b pods count:
              Terminating    54 ############################################################ 
2021-10-27T14:11:04 run-j pods count:

2021-10-27T14:11:15 db pg_stat_activity count:
 count 
-------
    40
(1 row)

2021-10-27T14:11:15 db workflow.status count:
  status  | count 
----------+-------
 finished |   200
(1 row)

2021-10-27T14:11:15 run-b pods count:
              Terminating    33 ############################################################ 
2021-10-27T14:11:15 run-j pods count:

2021-10-27T14:11:25 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:11:25 db workflow.status count:
  status  | count 
----------+-------
 finished |   200
(1 row)

2021-10-27T14:11:25 run-b pods count:
              Terminating     9 ############################################################ 
2021-10-27T14:11:25 run-j pods count:

2021-10-27T14:11:35 db pg_stat_activity count:
 count 
-------
    25
(1 row)

2021-10-27T14:11:35 db workflow.status count:
  status  | count 
----------+-------
 finished |   200
(1 row)

2021-10-27T14:11:35 run-b pods count:
2021-10-27T14:11:36 run-j pods count:

@VMois
Copy link

VMois commented Oct 27, 2021

suggestion: One possible solution, which was already mentioned during internal meetings, is to have a higher priority for finished messages (workflows that are reported finished) in job-status queue.

One thing to keep in mind is that progress tracking should also be sent with finished message. This is not the case right now for the Yadage engine (other engines probably too).

@tiborsimko
Copy link
Member Author

One example of a NotReady pod termination:

Whilst the pod logs were showing "finished":

$ kubectl logs reana-run-batch-fb8113f6-e80d-4b5b-9f23-84d76e39c79b-ksgz7 workflow-engine
...
2021-11-01 14:23:57,952 | adage | MainThread | INFO | adage state loop done.
2021-11-01 14:23:57,952 | adage | MainThread | INFO | execution valid. (in terms of execution order)
2021-11-01 14:23:57,952 | adage | MainThread | INFO | workflow completed successfully.
2021-11-01 14:23:57,952 | yadage.steering_api | MainThread | INFO | done. dumping workflow to disk.
2021-11-01 14:23:57,960 | yadage.steering_api | MainThread | INFO | visualizing workflow.
2021-11-01 14:23:58,593 | reana-workflow-engine-yadage | MainThread | INFO | Workflow finished. Publishing...
2021-11-01 14:23:58,595 | root | MainThread | INFO | Workflow fb8113f6-e80d-4b5b-9f23-84d76e39c79b finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/fb8113f6-e80d-4b5b-9f23-84d76e39c79b.

The DB status was:

$ reana-client status -w fb8113f6-e80d-4b5b-9f23-84d76e39c79b
NAME                                                            RUN_NUMBER   CREATED               STARTED               STATUS    PROGRESS
pMSSM-NoSys-1200Mi-10nodeb-56nodej-250max-21k8s-mqprio-1k-235   1            2021-11-01T13:47:51   2021-11-01T14:16:00   running   3/6

Note that the workflow was reported as "finished" only much later:

$ reana-client status -w fb8113f6-e80d-4b5b-9f23-84d76e39c79b
NAME                                                            RUN_NUMBER   CREATED               STARTED               ENDED                 STATUS     PROGRESS
pMSSM-NoSys-1200Mi-10nodeb-56nodej-250max-21k8s-mqprio-1k-235   1            2021-11-01T13:47:51   2021-11-01T14:16:00   2021-11-01T14:28:35   finished   6/6 

Note the total time difference: workflow pod logs indicate that yadage finished at 14:23:58, whilst DB logs indicate that the workflow finished only at 14:28:35. So the pod was "unnecessarily blue" during almost five minutes.

Note also the DB progress being 3/6 when the pod says it finished... Quite out of sync?

(This is with r-commons 314 and r-w-controller 413 PRs.)

@VMois
Copy link

VMois commented Nov 1, 2021

  1. What is the context (local test or cluster)? How many workflows were launched?

  2. This is related to job-status-consumer speed so would be beneficial to see a small snapshot of logs and how much time passes between each processed MQ message (between Received message log lines). I assume quotas are disabled but there might be still something that slows down the consumer.

note: load on k8s cluster might play a role cause job-status-consumer is calling k8s API to delete run-batch and it can take some time.

suggestion: disable the removal of run-batch pods when a workflow is finished and see how it will affect the speed of reporting (will it be much less than 5 minutes reported above?)

  1. Possibility of out-of-sync situation was mentioned in the comment for r-commons 314. From the logs, the out-of-sync situation is not present - the last log line states finished and 6/6. In terms of the previous log line - running 3/6, it makes sense, I expect job-status-consumer haven't yet processed messages with updates so I would call it a delay not really out-of-sync situation.

@VMois
Copy link

VMois commented Nov 1, 2021

  1. In addition, prefetch_count (RabbitMQ related) equals 1 (in current PR) which makes job-status-consumer slower. Bigger values needed.

@tiborsimko
Copy link
Member Author

WRT 1, this is with ATLAS pMSSM 1k workflow test on a 66 node cluster, see the NAME above.

WRT 2, we are doing a lot of stuff in reana_workflow_controller/consumer.py in on_message(), such as a DB query to see about the status of the workflow... Here is one excerpt for illustration:

$ kubectl logs deployment/reana-workflow-controller job-status-consumer | grep -c '2021-11-01 14:28'
293
$ kubectl logs deployment/reana-workflow-controller job-status-consumer | grep  '2021-11-01 14:28:52'
2021-11-01 14:28:52,004 | root | MainThread | INFO |  [x] Received workflow_uuid: 3e1ed0f7-5da7-4bd9-a8b5-21fa198f586a status: RunStatus.running
2021-11-01 14:28:52,037 | root | MainThread | INFO |  [x] Received workflow_uuid: 4a4db77a-6177-4ba8-b836-ea9aa882fdd5 status: RunStatus.running
2021-11-01 14:28:52,071 | root | MainThread | INFO |  [x] Received workflow_uuid: e0ee4593-2b88-4b45-b6ff-72c3f66042d1 status: RunStatus.running
2021-11-01 14:28:52,094 | root | MainThread | INFO |  [x] Received workflow_uuid: adb8a0a5-79e7-4c6d-aa2e-758d35e10084 status: RunStatus.running
2021-11-01 14:28:52,129 | root | MainThread | INFO |  [x] Received workflow_uuid: a46bae97-ebce-451d-8b2f-fbab2499e092 status: RunStatus.running
2021-11-01 14:28:52,150 | root | MainThread | INFO |  [x] Received workflow_uuid: e32b2682-de4f-47fe-847b-aa7aab9472ab status: RunStatus.running
2021-11-01 14:28:52,167 | root | MainThread | INFO |  [x] Received workflow_uuid: e0132e06-4bec-4385-9752-a6e74c1f74c4 status: RunStatus.running
2021-11-01 14:28:52,203 | root | MainThread | INFO |  [x] Received workflow_uuid: ef250fd8-49d9-4802-bb30-286a25bfacb5 status: RunStatus.running
2021-11-01 14:28:52,238 | root | MainThread | INFO |  [x] Received workflow_uuid: e2124a59-5c64-4b56-83d0-44e96ba3b6cf status: RunStatus.running
2021-11-01 14:28:52,260 | root | MainThread | INFO |  [x] Received workflow_uuid: 80702651-bc28-4eb2-8e66-c6c891e8f61e status: RunStatus.running
2021-11-01 14:28:52,282 | root | MainThread | INFO |  [x] Received workflow_uuid: 1b337af1-9e51-49cd-b8ee-7e4c8f6b8d50 status: RunStatus.running
2021-11-01 14:28:52,314 | root | MainThread | INFO |  [x] Received workflow_uuid: 68b0ad92-6bf9-4338-a11a-2d45bf8a3cbe status: RunStatus.running
2021-11-01 14:28:52,337 | root | MainThread | INFO |  [x] Received workflow_uuid: c516b730-6fea-46fb-888a-0e9b4c309959 status: RunStatus.running
2021-11-01 14:28:52,375 | root | MainThread | INFO |  [x] Received workflow_uuid: 8e482984-f63d-4b5f-9588-07c276f890b1 status: RunStatus.finished

It is much easier to profile "live" locally to see where the time is spent.

WRT 3, one could call it "severely delayed"; I have seen workflows which were finished, i.e. the batch pod reported done, and the output files were present, and DB indicated not running 3/6, but even running 1/6, as if the workflow was just starting!

@tiborsimko
Copy link
Member Author

WRT 4, I ran with the exact versions of r-w-controller@pr-413 and r-commons@pr-314.

@VMois
Copy link

VMois commented Nov 30, 2021

@tiborsimko what should we do with this issue? Do we still have the same problem with benchmarks on a bigger cluster? If not, we can close this issue.

P.S we had a PR that optimizes the deletion of batch pods and looks like it solved some of the pain points (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants