-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Add timeout to contexts in client calls #6125
base: v2
Are you sure you want to change the base?
Conversation
a487916
to
34a4c27
Compare
21a12c1
to
6033617
Compare
return nil | ||
}) | ||
if retryErr != nil { | ||
logger.Error(err, "Failed to remove finalizer after retries") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have !event.Active
but latestExperiment.ObjectMeta.DeletionTimestamp.IsZero()
? Because in that case we don't even attempt to remove the finalizer, and this error message might be confusing. For example, I don't know if RetryOnConflict doesn't return an error on context timeout. I suppose we still have the underlying error wrapped here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should we continue
after logging here (rather than continuing with the status update)?
@@ -161,10 +161,12 @@ func (s *SchedulerClient) SubscribeModelEvents(ctx context.Context, grpcClient s | |||
// Handle terminated event to remove finalizer | |||
if canRemoveFinalizer(latestVersionStatus.State.State) { | |||
retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error { | |||
latestModel := &v1alpha1.Model{} | |||
ctxWithTimeout, cancel := context.WithTimeout(ctx, constants.K8sAPICallTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I've realised here (but it's true for the other places where you have introduced the context) is that, despite the name K8sAPICallTimeout
covers not just one k8s API call but multiple calls done in one function. So in the example here, both the s.Get(...)
and s.Update(...)
need to complete within a total time < than the timeout. Especially true when we're passing the same timeout context further down the call stack as is the case with updateModelStatus
. Might still be fine given that it's set to a generous 2 minutes by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the intention to simplify things. I can come up with a better naming for the constant to make it clearer.
@@ -67,7 +67,7 @@ parameters: | |||
t.Run(test.name, func(t *testing.T) { | |||
fakeClientset := fake.NewSimpleClientset(test.secret) | |||
s := NewSecretsHandler(fakeClientset, test.secret.Namespace) | |||
data, err := s.GetSecretConfig(test.secretName) | |||
data, err := s.GetSecretConfig(test.secretName, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this converted into a "1 nanosecond" Duration by default? Even for a mock client, perhaps it would be better to set it to 1 * Millisecond
to avoid any flaky tests.
|
||
// inference | ||
const ( | ||
InferTimeoutDefault = 10 * time.Minute // TODO: expose this as a config (map)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I would like us to expose the inference timeout more explicitly and preferably in one place. Atm it's not entirely clear (at least to me) where things timeout (envoy? mlserver? agent? one of the gateways? etc). Perhaps we should add a jira ticket to handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, I feel more confident with having timeouts everywhere, but I think this change will require intensive testing before the next release, to make sure things are solid (for example, under network partitions, slow connections, etc). Also, I think we might need to expose some of those timeouts as settings further down the line.
I've left a couple of comments/clarifying questions, but all are minor. The git diff did not help on this particular PR, as lots of code appeared changed but most was just moved slightly. Hopefully I haven't missed anything obvious.
Thank you very much for taking this on, it was long overdue.
What this PR does / why we need it:
This PR introduced timeouts to client grpc calls in the places that they make sense. This will allow the system in some cases not to wait indefinitely for a response. One example is when we are making a model infer call in `modelgateway.
There are other places where we have introduced a timeout logic on top of GRPC calls and in these cases we left them as they are.
Which issue(s) this PR fixes:
Fixes # INFRA-1252 (internal)
Special notes for your reviewer: