Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pod label with allocation state #3664

Open
towolf opened this issue Feb 20, 2024 · 8 comments
Open

Add pod label with allocation state #3664

towolf opened this issue Feb 20, 2024 · 8 comments
Labels
kind/feature New features for Agones

Comments

@towolf
Copy link
Contributor

towolf commented Feb 20, 2024

We collect pod-level performance metrics with kubelet and cAdvisor. This is useful for us to rightsize resource requests and to maximize utilization of the clusters. However to make the calculation more accurate it would be good if we could discriminate idle Ready pods from Allocated ones.

In this picture you can see that the Ready pods in the buffer form a band at the bottom of the graph, and this band distorts the percentiles and averages.

image

To be able to exclude these from the data, I would have to be able to select a label that holds the allocation status, preferably on the pod, which would be attached to the metrics.

Now pod labels are mutable and my idea was that the Agones GameServer controller could add and update an allocated=true|false label on the pod.

This would — if I understand this well enough — be an additional label that does not interfere with the Agones operation, but it would double the cardinality of metrics.

What do you think? Is this feasible?

@towolf towolf added the kind/feature New features for Agones label Feb 20, 2024
@roberthbailey
Copy link
Member

The only other concern I have is that if as it doesn't appear that we are currently updating the Pod resource when we allocate (source), then this would increase the load on the kubernetes control plane (by requiring an update to a second object in etcd). This could have performance implications on allocation throughput.

@towolf
Copy link
Contributor Author

towolf commented Feb 27, 2024

Would that really be that bad?

Just today I wished there were another thing we could update when we allocate: The safe-to-evict annotation. This is rather more annoying. For us, Ready pods are safe to evict, but Allocated ones aren't.

So I guess according your statement, that is also not possible?

@roberthbailey
Copy link
Member

Would that really be that bad?

It depends on your use case. When agones was first envisioned the common scenario was game sessions would last tens of minutes, so doing a bit of extra work on an allocation wasn't a big deal. However, we now also see use cases where game sessions last tens of seconds and having a high allocation rate is critical.

@roberthbailey
Copy link
Member

Just today I wished there would be another thing we would update when we allocate: The safe-to-evict annotation. This is rather more annoying. For us Ready pods are safe to evict, but Allocated ones aren't.

@zmerlynn - Have we considered allowing Ready game servers to be evicted?

I think there is a race condition here:

  1. The cluster autoscaler decide to remove a node and evict all pods
  2. The game server gets allocated and Agones applies the annotation
  3. The pod gets drained and the allocated game server is shutdown abruptly

With kubernetes being eventually consistent, there isn't a way to guarantee that the cluster autoscaler doesn't decide to evict and remove a node while game servers are being allocated.

@markmandel
Copy link
Member

So to the original comment on this post:

Would that really be that bad?

Yes 😄 any extra load affects performance.

That being said, if this is something you want, this is where I would strongly advocate for you building this as a third party controller you could add to your Agones cluster (and please open source it! - we'd be happy to take a listing on the website). If it becomes something that's utilised by a large majority of users then we can consider it for main contribution.

We love our ecosystem projects: https://agones.dev/site/docs/third-party-content/libraries-tools/

And this wouldn't be a complicated controller to write either.

Just today I wished there would be another thing we would update when we allocate: The safe-to-evict annotation. This is rather more annoying. For us Ready pods are safe to evict, but Allocated ones aren't.

@roberthbailey captured the potential race condition perfectly.

But if you want to control eviction, https://agones.dev/site/docs/advanced/controlling-disruption/ is the place to read.

@zmerlynn
Copy link
Collaborator

@towolf What's your session length? And this is GKE if I remember, right?

Adding safe-to-evict=false on allocation is infeasible due to the race that @roberthbailey mentioned - the best we could do in that situation is insist on a long terminationGracePeriodSeconds, but CA by default will ignore tGPS > 10m, so that doesn't entirely close the race. If you're on GKE we might have options to talk about, though - happy to chat on Slack.

@towolf
Copy link
Contributor Author

towolf commented Feb 28, 2024

@zmerlynn We have a wide range of session lengths, some are only a minute, some go on for hours. We kill any gameservers that are still allocated after 24 hours. We have shared "lobby" servers where players meet and can hang out all day. So, I'm not sure if tGPS is suitable for us at all. We're considering somehow shortening the sessions, but they'll never be shorter than 10 minutes.

We are mostly on GKE now but our baremetal contingent is RKE2.

Here's the spectrum:

image

image

@markmandel
Copy link
Member

As an FYI, we're working on #3680 to show how to write a small controller to drop into a cluster, to lower the barrier to entry there (the code is remarkably small - you can do it in one main.go).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New features for Agones
Projects
None yet
Development

No branches or pull requests

4 participants