fix: Add timeout to contexts in client calls #6125

sakoush · 2024-12-04T14:03:24Z

What this PR does / why we need it:

This PR introduced timeouts to client grpc calls in the places that they make sense. This will allow the system in some cases not to wait indefinitely for a response. One example is when we are making a model infer call in `modelgateway.
There are other places where we have introduced a timeout logic on top of GRPC calls and in these cases we left them as they are.

Which issue(s) this PR fixes:

Fixes # INFRA-1252 (internal)

Special notes for your reviewer:

lc525 · 2024-12-10T15:09:01Z

operator/scheduler/experiment.go

+				return nil
+			})
+			if retryErr != nil {
+				logger.Error(err, "Failed to remove finalizer after retries")


Can we have !event.Active but latestExperiment.ObjectMeta.DeletionTimestamp.IsZero() ? Because in that case we don't even attempt to remove the finalizer, and this error message might be confusing. For example, I don't know if RetryOnConflict doesn't return an error on context timeout. I suppose we still have the underlying error wrapped here.

Also should we continue after logging here (rather than continuing with the status update)?

lc525 · 2024-12-10T15:18:56Z

operator/scheduler/model.go

@@ -161,10 +161,12 @@ func (s *SchedulerClient) SubscribeModelEvents(ctx context.Context, grpcClient s
 		// Handle terminated event to remove finalizer
 		if canRemoveFinalizer(latestVersionStatus.State.State) {
 			retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
-				latestModel := &v1alpha1.Model{}
+				ctxWithTimeout, cancel := context.WithTimeout(ctx, constants.K8sAPICallTimeout)


One thing I've realised here (but it's true for the other places where you have introduced the context) is that, despite the name K8sAPICallTimeout covers not just one k8s API call but multiple calls done in one function. So in the example here, both the s.Get(...) and s.Update(...) need to complete within a total time < than the timeout. Especially true when we're passing the same timeout context further down the call stack as is the case with updateModelStatus. Might still be fine given that it's set to a generous 2 minutes by default

This is the intention to simplify things. I can come up with a better naming for the constant to make it clearer.

lc525 · 2024-12-10T16:04:38Z

scheduler/pkg/agent/k8s/secrets_test.go

@@ -67,7 +67,7 @@ parameters:
 		t.Run(test.name, func(t *testing.T) {
 			fakeClientset := fake.NewSimpleClientset(test.secret)
 			s := NewSecretsHandler(fakeClientset, test.secret.Namespace)
-			data, err := s.GetSecretConfig(test.secretName)
+			data, err := s.GetSecretConfig(test.secretName, 1)


Isn't this converted into a "1 nanosecond" Duration by default? Even for a mock client, perhaps it would be better to set it to 1 * Millisecond to avoid any flaky tests.

lc525 · 2024-12-10T16:17:13Z

scheduler/pkg/util/constants.go

+
+// inference
+const (
+	InferTimeoutDefault = 10 * time.Minute // TODO: expose this as a config (map)?


In general, I would like us to expose the inference timeout more explicitly and preferably in one place. Atm it's not entirely clear (at least to me) where things timeout (envoy? mlserver? agent? one of the gateways? etc). Perhaps we should add a jira ticket to handle this.

lc525

lgtm, I feel more confident with having timeouts everywhere, but I think this change will require intensive testing before the next release, to make sure things are solid (for example, under network partitions, slow connections, etc). Also, I think we might need to expose some of those timeouts as settings further down the line.

I've left a couple of comments/clarifying questions, but all are minor. The git diff did not help on this particular PR, as lots of code appeared changed but most was just moved slightly. Hopefully I haven't missed anything obvious.

Thank you very much for taking this on, it was long overdue.

sakoush added the v2 label Dec 4, 2024

sakoush requested a review from lc525 as a code owner December 4, 2024 14:03

sakoush force-pushed the INFRA-1252/fix_context_modelgateway branch from a487916 to 34a4c27 Compare December 4, 2024 16:21

sakoush added 18 commits December 9, 2024 21:23

add timeout context from infer call (modelgateway)

4a4ca6e

refactor utility func

2e77dd8

add timeout context to pipeline gateway

9d9643a

remove timeout setting from kafka ctx

3222080

fix caller

7f537cc

set timeout context on process request

acd071e

fix test

02de053

add a test for grpc call timeout

634c2ff

add agent k8s api call timeout

1986c0c

add context timeout for shutting down services

4deacf7

refactor function name

f641af9

add timeout for controller k8s api calls

5b08b6a

add timeout for control plane context

d5cb307

add note

9e3e635

add timeout context to reconcile logic

94050d6

fix circular dep for tracing

48a5ab1

add note

04f4aee

fix lint (operator)

6033617

sakoush force-pushed the INFRA-1252/fix_context_modelgateway branch from 21a12c1 to 6033617 Compare December 9, 2024 21:25

lc525 reviewed Dec 10, 2024

View reviewed changes

lc525 approved these changes Dec 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add timeout to contexts in client calls #6125

fix: Add timeout to contexts in client calls #6125

sakoush commented Dec 4, 2024 •

edited

Loading

lc525 Dec 10, 2024

lc525 Dec 10, 2024

lc525 Dec 10, 2024 •

edited

Loading

sakoush Dec 10, 2024

lc525 Dec 10, 2024

lc525 Dec 10, 2024

lc525 left a comment

fix: Add timeout to contexts in client calls #6125

Are you sure you want to change the base?

fix: Add timeout to contexts in client calls #6125

Conversation

sakoush commented Dec 4, 2024 • edited Loading

lc525 Dec 10, 2024

Choose a reason for hiding this comment

lc525 Dec 10, 2024

Choose a reason for hiding this comment

lc525 Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

sakoush Dec 10, 2024

Choose a reason for hiding this comment

lc525 Dec 10, 2024

Choose a reason for hiding this comment

lc525 Dec 10, 2024

Choose a reason for hiding this comment

lc525 left a comment

Choose a reason for hiding this comment

sakoush commented Dec 4, 2024 •

edited

Loading

lc525 Dec 10, 2024 •

edited

Loading