Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Add timeout to contexts in client calls #6125

Open
wants to merge 18 commits into
base: v2
Choose a base branch
from

Conversation

sakoush
Copy link
Member

@sakoush sakoush commented Dec 4, 2024

What this PR does / why we need it:

This PR introduced timeouts to client grpc calls in the places that they make sense. This will allow the system in some cases not to wait indefinitely for a response. One example is when we are making a model infer call in `modelgateway.
There are other places where we have introduced a timeout logic on top of GRPC calls and in these cases we left them as they are.

Which issue(s) this PR fixes:

Fixes # INFRA-1252 (internal)

Special notes for your reviewer:

@sakoush sakoush added the v2 label Dec 4, 2024
@sakoush sakoush requested a review from lc525 as a code owner December 4, 2024 14:03
@sakoush sakoush force-pushed the INFRA-1252/fix_context_modelgateway branch from a487916 to 34a4c27 Compare December 4, 2024 16:21
@sakoush sakoush force-pushed the INFRA-1252/fix_context_modelgateway branch from 21a12c1 to 6033617 Compare December 9, 2024 21:25
return nil
})
if retryErr != nil {
logger.Error(err, "Failed to remove finalizer after retries")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have !event.Active but latestExperiment.ObjectMeta.DeletionTimestamp.IsZero() ? Because in that case we don't even attempt to remove the finalizer, and this error message might be confusing. For example, I don't know if RetryOnConflict doesn't return an error on context timeout. I suppose we still have the underlying error wrapped here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should we continue after logging here (rather than continuing with the status update)?

@@ -161,10 +161,12 @@ func (s *SchedulerClient) SubscribeModelEvents(ctx context.Context, grpcClient s
// Handle terminated event to remove finalizer
if canRemoveFinalizer(latestVersionStatus.State.State) {
retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
latestModel := &v1alpha1.Model{}
ctxWithTimeout, cancel := context.WithTimeout(ctx, constants.K8sAPICallTimeout)
Copy link
Member

@lc525 lc525 Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I've realised here (but it's true for the other places where you have introduced the context) is that, despite the name K8sAPICallTimeout covers not just one k8s API call but multiple calls done in one function. So in the example here, both the s.Get(...) and s.Update(...) need to complete within a total time < than the timeout. Especially true when we're passing the same timeout context further down the call stack as is the case with updateModelStatus. Might still be fine given that it's set to a generous 2 minutes by default

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the intention to simplify things. I can come up with a better naming for the constant to make it clearer.

@@ -67,7 +67,7 @@ parameters:
t.Run(test.name, func(t *testing.T) {
fakeClientset := fake.NewSimpleClientset(test.secret)
s := NewSecretsHandler(fakeClientset, test.secret.Namespace)
data, err := s.GetSecretConfig(test.secretName)
data, err := s.GetSecretConfig(test.secretName, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this converted into a "1 nanosecond" Duration by default? Even for a mock client, perhaps it would be better to set it to 1 * Millisecond to avoid any flaky tests.


// inference
const (
InferTimeoutDefault = 10 * time.Minute // TODO: expose this as a config (map)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would like us to expose the inference timeout more explicitly and preferably in one place. Atm it's not entirely clear (at least to me) where things timeout (envoy? mlserver? agent? one of the gateways? etc). Perhaps we should add a jira ticket to handle this.

Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, I feel more confident with having timeouts everywhere, but I think this change will require intensive testing before the next release, to make sure things are solid (for example, under network partitions, slow connections, etc). Also, I think we might need to expose some of those timeouts as settings further down the line.

I've left a couple of comments/clarifying questions, but all are minor. The git diff did not help on this particular PR, as lots of code appeared changed but most was just moved slightly. Hopefully I haven't missed anything obvious.

Thank you very much for taking this on, it was long overdue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants