Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-use the last desired patch on empty batch #3453

Merged
merged 3 commits into from May 17, 2024

Conversation

nikola-jokic
Copy link
Member

Fixes #3450

@nikola-jokic nikola-jokic added the gha-runner-scale-set Related to the gha-runner-scale-set mode label Apr 18, 2024
@verdel
Copy link

verdel commented Apr 20, 2024

@nikola-jokic, a quick question. Do I understand correctly that the empty message here occurs because when the GetMessage function is called, if there are no messages in the queue, the server returns http.StatusAccepted here as a result of the long poll connection after 50 seconds, or does the server response actually come with http.StatusOk here but with an empty body?

I also understand correctly that with each empty message, the PatchID in the EphemeralRunnerSet will be reset to 0?

I'm asking all these questions because it seems we've encountered a problem that arises when several factors coincide:

  • We have a GitHub Workflow that is waiting for a Runner.
  • We create an AutoscalingRunnerSet in the cluster, and a Pod with a listener is launched.
  • The listener receives TotalAssignedJob and patches the EphemeralRunnerSet.
  • On the same iteration, a request is made to retrieve a message from the queue, and a ScaleDown patch is created for the EphemeralRunnerSet due to a nil message

After this, the GitHub Workflow remains in a queued state until another job for this type of runner appears, because the runner didn't have time to process the job due to the ScaleDown patch. Once a new job appears, the process will repeat, and there will be two jobs in the queue waiting to be executed.

According to the code in this PR, it should correct the situation, as when an nil message is received, the patch will contain the number of replicas from the previous patch.

@nikola-jokic
Copy link
Member Author

Hey @verdel,

You are 100% right. It was a mistake on my end where on an empty batch, I forced the minimum required number of runners to avoid wasting resources. However, I did not account for the case where the runner can be pending for the duration of the empty batch. This change would "force" the last known desired state that the listener knows about.

@verdel
Copy link

verdel commented Apr 22, 2024

Is it normal behavior that the PatchID in the EphemeralRunnerSet will be reset to 0 for each empty batch?

@nikola-jokic
Copy link
Member Author

It should be okay, since patch ID 0 would only mean: "make sure this state is matched". However, your question just gave me a better idea. Let me test it out and I'll get back to you.

@verdel
Copy link

verdel commented Apr 22, 2024

It should be okay, since patch ID 0 would only mean: "make sure this state is matched"

Thank you very much for the detailed explanation.

Comment on lines +234 to +241
if targetRunnerCount == w.config.MinRunners {
// We have an empty batch, and the last patch was the min runners.
// Since this is an empty batch, and we are at the min runners, they should all be idle.
// If controller created few more pods on accident (during scale down events),
// this situation allows the controller to scale down to the min runners.
// However, it is important to keep the patch sequence increasing so we don't ignore one batch.
desiredPatchID = 0
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this code is unnecessary. We want to trigger a reconcile in case we receive an empty batch and targetRunnerCount == w.config.MinRunners, by changing the PatchID value, but this will happen anyway because of this code:

w.patchSeq++
desiredPatchID := w.patchSeq

Why reset this counter to zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of setting patchID to 0 is to remove ephemeral runners that are created, but shouldn't have been.
There is a rare case where the listener receives scale down event. Let's say the desired state was 5 and the 1 has finished.
During the patch, 2 ephemeral runners are finished.
This situation would cause ephemeral runner set controller to create 1 more pod (We match the current state of the cluster, which is 3, the patch says we need to have a state of 4). Very rare case but can happen when runners are finishing at the same time. At that point, the ephemeral runner set will create 1 more runner.

If the cluster is busy, this situation can actually improve the performance, since on the next job, you will already have a runner ready to pick up the job, so it is re-used.
But, when your cluster becomes idle, that is when we really have to remove these runners that are created due to a timing difference. So to self correct, we force the min runners at the time they are all idle.

@verdel
Copy link

verdel commented Apr 22, 2024

@nikola-jokic, and I still have a question; perhaps you could help with some additional information.

The empty message (empty batch) here occurs because when the GetMessage function is called, if there are no messages in the queue, the server returns http.StatusAccepted here as a result of the long poll connection after 50 seconds, or does the server response actually come with http.StatusOk here but with an empty body?

@nikola-jokic
Copy link
Member Author

Thank you, @verdel, for asking all the right questions! This question made me realize that we don't need to force the state on every empty batch. Whenever we scale up, we are matching the desired state. The scale down is already handled by the very nature of the ephemeral runner, so we simply need to clean them up. Only when the scale set is completely idle, i.e. there are no jobs coming and there are no busy runners, only then we "force" the min runners state. This allows the controller to self correct.

@verdel
Copy link

verdel commented Apr 22, 2024

Only when the scale set is completely idle, i.e. there are no jobs coming and there are no busy runners, only then we "force" the min runners state

How will you determine the current number of busy runners and that it equals zero? It seems you want to obtain the number of active runners every time you receive an empty batch (every 50 seconds), and if it equals zero, then patch the EphemeralRunnerSet to set Replicas=0, am I understanding this correctly?

Since I am not very familiar with the specifics of the controller's operation, I cannot fully understand whether the following situation might recur due to an incorrect determination of the number of busy runners:

  • We have a runner pod in pending status, created due to a task appearing in the queue
  • We do not count it as active/busy and will create a patch with Replicas=0 at the next empty batch

@nikola-jokic
Copy link
Member Author

I'll try to explain my reasoning.

When the job is received, the number of runners is numberOfJobs + minRunners. Let's say minRunners == 0 and numberOfJobs == 1. The listener would issue a patch for target replicas 1 and set the lastPatch to 1.

Now, we start creating a pod, and it is in a pending state. Next batch comes in, and it is empty. So count == 0 && jobsCompleted == 0.
We would hit the if case and then, targetRunnerCount = max(w.lastPatch, targetRunnerCount) would be the w.lastPatch. The lastPatch can go down only when jobCompleted count is > 0. Since lastPatch is 1, and minRunners is 0, we would only increment the patchID. This ensures that when we have busy runners (or runners that are about to be busy), we don't scale down. At this point, we probably don't even need to issue a patch, but after 50s, we can trigger the check to ensure that if someone deleted the ephemeral runner resource by accident, we can recover from that.

Now let's say that the job is done. We get the message saying we need 0 runners, and 1 job is done. We are not hitting the if case, and we calculate the targetRunnerCount, which would be 0. This value is set as the last patch, and patchID is incremented. On the next empty batch, we are hitting the same if case. At this point, the number of runners must be the number of idle runners (i.e. the minRunners count). The reason all of them must be idle is:

  1. Last patch was to minRunners, so actions back-end told us that we need 0 runners, and there were runners that are done.
  2. targetRunnerCount must always be >= minRunners
  3. The only way lastPatch can be less than the previous lastPatch is when we have jobsCompleted, so we can't scale down before the runner picks up the job

@verdel
Copy link

verdel commented Apr 23, 2024

Let's summarize to make sure we understand all the cases in the same way.

Originally, the whole idea of this patch was to restore the number of runners in the case that a runner in a pending state was terminated instead of a runner who had completed their task and was in the process of finishing up. It is also necessary to handle the case when something forcibly terminated the work of a runner who had not yet completed their task.

In general, the only way to handle all cases is to patch the EphemeralRunnerSet regardless of whether there is an event from GitHub. You use an empty batch for this.

These empty batches result from long polling requests to the GitHub API and occur every 50 seconds. The GitHub API on its end has a timeout of 50 seconds and apparently returns HTTP 202 if there are no tasks in the queue within 50 seconds from the start of the request.

Now, a question about the operation of the controller that monitors the EphemeralRunnerSet. Let's imagine the situation:

  • The MinRunners value in the EphemeralRunnerSet is 0
  • A job is started in GitHub Actions
  • An event comes in: TotalAssignedJobs=1, jobsCompleted=0 (am I understanding this correctly?)
  • The Replicas value increases to 1, an ephemeral runner pod starts, and it begins performing the task.
  • Another job starts in GitHub Actions.
  • An event comes in: TotalAssignedJobs=2, jobsCompleted=0 (am I understanding this correctly?)
  • The Replicas value increases to 2, another ephemeral runner pod starts, and it is in a Pending state
  • The first ephemeral runner pod reports completing the work but hasn't finished yet. We receive an event: TotalAssignedJobs=1, jobsCompleted=1 (am I understanding this correctly?)
  • The Replicas value decreases to 1, and the controller monitoring the EphemeralRunnerSet might terminate the pod in Pending state

But if at this moment in the EphemeralRunnerSet it is indicated that Replicas=1, why wouldn’t the controller create 1 pod again after the first pod finishes?

The same happens in the case where the EphemeralRunnerSet indicates Replicas=1, and we forcibly terminate an ephemeral runner pod. Why wouldn’t the controller restore the number of replicas to the number indicated in the EphemeralRunnerSet?

Overall, the implementation outlined in this PR solves the problem. If we want to change the EphemeralRunnerSet to forcibly start the Reconcile process, which will align the number of available ephemeral runner pod with the Replicas value, the only available method is to change the PatchID (since Replicas in this case does not change).

Resetting the PatchID to 0 is also a normal solution (this will also solve the issue with this counter overflowing). I was initially a bit confused by the name of the PatchID attribute and the fact that it increments in the case of non-empty batches and resets to 0 in the case of empty ones. I just couldn't understand that it’s not quite an "revision id", which theoretically cannot decrease but should only increase. In reality, it’s just a trigger that will initiate a forced Reconcile, and its value just needs to be different from the previous one.

To repeat, the only question I have left is why the controller responsible for processing the EphemeralRunnerSet wouldn’t align the number of runner pods with the Replicas value without the forced start of Reconcile, triggered by changing the EphemeralRunnerSet?

@nikola-jokic
Copy link
Member Author

This summary is almost completely correct. We do not scale based on jobs assigned, we scale based on the statistics. So on each message that is not an empty batch, the server would let us know what is the desired amount of runners (not counting the minRunners) to do a job. So in your example, when the new job comes in, you would have count == 2.

To answer your question, the controller should not always match the number of replicas because the time it takes for the listener to patch the new desired replicas can be and usually is greater than the time the controller can react to ephemeral runner changes.

Each change on the ephemeral runner triggers the reconciliation loop of the ephemeral runner set. If we try and match the state each time the pod is done, we would end up creating the pod, then the patch to scale down would happen, then we would have to clean it up.

Also, important thing to note, patch ID can avoid these problems when there is a job spike. When we start creating ephemeral runners, we tag them with the patch ID. Then each transition from pending to running would not trigger the unnecessary count check and would postpone ephemeral runner cleanups. This ensures that ARC scales up as quickly as possible, since each cleanup would also again trigger the reconciliation loop, causing slower reaction to scaling up.

Every state change is communicated by the listener so only then, the ephemeral runner controller will need to correct the state. Also, when we start scaling down, these checks are going to be performed more frequently, since the maximum of all patchIDs from ephemeral runners can only be less than or equal to the patch ID of the ephemeral runner set (It will be less than when ephemeral runners with the maximum patch ID are done and cleaned up). However, in this case, we can tolerate these checks, and this is when we can potentially overscale. That is why we need the patch ID 0 to self correct when the cluster is idle.

@verdel
Copy link

verdel commented Apr 23, 2024

I carefully reviewed the code of the ephemeral runner set controller and saw how the PatchID is used. Thank you very much for the detailed response.

I was puzzled by this part of the code. In what situation could count become less than or equal to zero if the case condition is total > ephemeralRunnerSet.Spec.Replicas?

Copy link
Collaborator

@Link- Link- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move forward with this right now, but as we discussed we should invest time to improve the maintainability of this down the line.

@Link- Link- merged commit ab92e4e into master May 17, 2024
15 checks passed
@Link- Link- deleted the nikola-jokic/reuse-last-known-state branch May 17, 2024 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gha-runner-scale-set Related to the gha-runner-scale-set mode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0.9.1 does not work with on-demand provisioned nodes
3 participants