-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a default timeout for Connector.Connect #235
Comments
It looks like the app module doesn't support tweaking the driver parameters that control that retry loop. We can fix that easily enough. You're not seeing the affected connection retry forever, right? With the default parameters it should be capped at 1 second: Line 215 in ab30b33
go-dqlite/internal/protocol/connector.go Lines 50 to 52 in ab30b33
|
It might be prudent to also wrap all the contexts in the driver with a timeout as well. Unfortunately, we can't provide our own contexts in those locations, so a context wrapped in a Lines 517 to 525 in fd457b7
I don't mind providing a patch to the driver package for this. |
You can do something like this to set a default timeout if not specified by the passed context: _, ok := ctx.Deadline()
if !ok {
// Set default timeout of 30s if no deadline context provided.
var cancel context.CancelFunc
ctx, cancel = context.WithTimeout(ctx, time.Duration(30*time.Second))
defer cancel()
} That way if someone does want a longer timeout you're not forcing a particular policy in the driver. |
FWIW LXD sets a default timeout on the transaction rather than each query: https://github.com/lxc/lxd/blob/master/lxd/db/query/transaction.go#L15-L16 Which has been useful while we transition to passing context into each query too. It also avoids us accidentally keeping transactions open too long and blocking writes to the DB. |
@tomponline The problem is that dqlite hangs forever when the database is unreachable. If this happens during an
Except the context isn't available via the
We do the same thing in Juju. |
Does cancelling the transaction not end the query too. I've not observed LXD getting stuck on Execs when the transaction is cancelled. |
Agreed. I suppose if using The reason I added the 10s timeout to |
Maybe the non tx related functions, such as plain That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period. |
So if the context has a deadline attached then the following is enacted. So for instances that don't supply a context with a deadline, we could set it there also. go-dqlite/internal/protocol/protocol.go Lines 58 to 62 in fd457b7
|
Sounds like a plan to me. |
Am I correct that the only place where a user can't provide a context is Not sure why the @SimonRichardson are you positively seeing a The logs attached in the very first message of this issue seem a retry loop for getting a If // The provided context.Context is for dialing purposes only
// (see net.DialContext) and should not be stored or used for
// other purposes. A default timeout should still be used
// when dialing as a connection pool may call Connect
// asynchronously to any query. I'm a not totally convinced that a default timeout should be added to |
As mentioned, as far as I understand it the only place where we would need to add a default timeout to achieve the behavior above should be |
Not 100% positive yet. |
FWIW I kind of see why you wouldn't want If Juju's main use case for cancellation/rollback is graceful shutdown (e.g. of a worker), then I'd say that a top-level context that you can cancel would be the way to go (as opposed to timeouts), and I presume that's already the case in Juju's code. Assuming that's true, as long as all juju's database calls get passed a context linked to that top-level one, then all transactions should already immediately abort/rollback as soon as the top-level context is cancelled (except for |
This is correct.
We're using context. |
Ok, so assuming that the only hang observed is the one originally pasted at the very beginning of this issue (the connect loop of |
Let me get back to you on this. I've not been able replicate it. |
First thing to check on the go-dqlite site would that cancelling a context actually indirectly interrupts that loop. Roughly something like:
|
Sorry, I'm just now starting to look at this, for my understanding
Isn't it expected that you hang forever if you don't use I agree though that it's prudent to add a default context with a sane timeout to |
Yes, we are using I leave it to discretion of the Dqlite team as to the specific mitigation, but I feel this behaviour would violate the assumptions of consumers. |
Juju has occasion to change the bind address of the initial Dqlite node upon entering HA.
When this happens, existing connections keep trying to connect, with our logs repeated output like this:
An existing call to
Exec
is locked up at this point:This is problematic given Juju's architecture, because our worker loops can be blocked from actually shutting down.
Particularly bad is that this happens with
db.Ping
, which we use for health checks.The text was updated successfully, but these errors were encountered: