Add a default timeout for Connector.Connect #235

manadart · 2023-03-20T12:30:09Z

Juju has occasion to change the bind address of the initial Dqlite node upon entering HA.

When this happens, existing connections keep trying to connect, with our logs repeated output like this:

2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:14 WARNING juju.worker.dbaccessor connector.go:77 attempt 2507: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused
2023-03-09 08:57:15 WARNING juju.worker.dbaccessor connector.go:77 attempt 2508: server 127.0.0.1:17666: dial: dial tcp 127.0.0.1:17666: connect: connection refused

An existing call to Exec is locked up at this point:

#       0x46a8d4        time.Sleep+0x134                                                                /u
sr/local/go/src/runtime/time.go:195
#       0x1c14f28       github.com/canonical/go-dqlite/internal/protocol.makeRetryStrategies.func1+0x48 /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/internal/protocol/connector.go:321
#       0x1c12799       github.com/Rican7/retry.shouldAttempt+0x39                                      /h
ome/joseph/go/pkg/mod/github.com/!rican7/[email protected]/retry.go:32
#       0x1c126f6       github.com/Rican7/retry.Retry+0x96                                              /h
ome/joseph/go/pkg/mod/github.com/!rican7/[email protected]/retry.go:19
#       0x1c131c4       github.com/canonical/go-dqlite/internal/protocol.(*Connector).Connect+0xa4      /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/internal/protocol/connector.go:74
#       0x1c42628       github.com/canonical/go-dqlite/driver.(*Connector).Connect+0x268                /h
ome/joseph/go/pkg/mod/github.com/canonical/[email protected]/driver/driver.go:261
#       0x1c05d22       database/sql.(*DB).conn+0x762                                                   /u
sr/local/go/src/database/sql/sql.go:1393
#       0x1c0ce52       database/sql.(*Stmt).connStmt+0x252                                             /u
sr/local/go/src/database/sql/sql.go:2724
#       0x1c0c364       database/sql.(*Stmt).ExecContext+0x184                                          /u
sr/local/go/src/database/sql/sql.go:2628
#       0x46e7204       database/sql.(*Stmt).Exec+0x44                                                  /u
sr/local/go/src/database/sql/sql.go:2651

This is problematic given Juju's architecture, because our worker loops can be blocked from actually shutting down.

Particularly bad is that this happens with db.Ping, which we use for health checks.

The text was updated successfully, but these errors were encountered:

cole-miller · 2023-03-20T23:11:12Z

It looks like the app module doesn't support tweaking the driver parameters that control that retry loop. We can fix that easily enough. You're not seeing the affected connection retry forever, right? With the default parameters it should be capped at 1 second:

go-dqlite/app/app.go

Line 215 in ab30b33

    
           driver, err := driver.New(store, driver.WithDialFunc(driverDial), driver.WithLogFunc(o.Log))

go-dqlite/internal/protocol/connector.go

Lines 50 to 52 in ab30b33

    
           if config.BackoffCap == 0 { 
        
           	config.BackoffCap = time.Second 
        
           }

SimonRichardson · 2023-06-07T14:00:30Z

It might be prudent to also wrap all the contexts in the driver with a timeout as well. Unfortunately, we can't provide our own contexts in those locations, so a context wrapped in a WithTimeout of 30 seconds or greater. If a commit can't complete within 30 seconds then it's probably never going to succeed and we should probably give up then.

go-dqlite/driver/driver.go

Lines 517 to 525 in fd457b7

    
           func (tx *Tx) Commit() error { 
        
           	ctx := context.Background() 
        
           	if _, err := tx.conn.ExecContext(ctx, "COMMIT", nil); err != nil { 
        
           		return driverError(tx.log, err) 
        
           	} 
        
           	return nil 
        
           }

I don't mind providing a patch to the driver package for this.

tomponline · 2023-06-07T14:07:14Z

You can do something like this to set a default timeout if not specified by the passed context:

	_, ok := ctx.Deadline()
	if !ok {
		// Set default timeout of 30s if no deadline context provided.
		var cancel context.CancelFunc
		ctx, cancel = context.WithTimeout(ctx, time.Duration(30*time.Second))
		defer cancel()
	}

That way if someone does want a longer timeout you're not forcing a particular policy in the driver.

tomponline · 2023-06-07T14:09:48Z

FWIW LXD sets a default timeout on the transaction rather than each query:

https://github.com/lxc/lxd/blob/master/lxd/db/query/transaction.go#L15-L16

Which has been useful while we transition to passing context into each query too.

It also avoids us accidentally keeping transactions open too long and blocking writes to the DB.

SimonRichardson · 2023-06-07T14:18:32Z

@tomponline The problem is that dqlite hangs forever when the database is unreachable. If this happens during an Exec or ExecContext then it will block. The problem I'm showing is that this is still a possibility because the driver does a COMMIT (there are many more) with a context that isn't bound to a timeout. So can hang.

You can do something like this to set a default timeout if not specified by the passed context:

Except the context isn't available via the *Tx and the conn does have a context, except it's deprecated. So we shouldn't attempt to use it.

FWIW LXD sets a default timeout on the transaction rather than each query:

We do the same thing in Juju.

https://github.com/juju/juju/blob/09db847e363a757089dcb569158d7ea0dc3a3a7a/database/txn/transaction.go#L172-L173

tomponline · 2023-06-07T14:20:14Z

Does cancelling the transaction not end the query too. I've not observed LXD getting stuck on Execs when the transaction is cancelled.

tomponline · 2023-06-07T14:24:46Z

Except the context isn't available via the *Tx and the conn does have a context, except it's deprecated. So we shouldn't attempt to use it.

Agreed. I suppose if using Exec and not ExecContext one would not expect there to be a default timeout in the driver and indefinite blocking is a possibility (in any driver, not just dqlite).

The reason I added the 10s timeout to BeginTx in LXD was I found that it did cause tx.Exec() to stop if dqlite was blocking for some reason. Because we did see some indefinite blocking before that and dont see them anymore.

tomponline · 2023-06-07T14:38:57Z

Maybe the non tx related functions, such as plain Exec Ping etc. could have a default timeout added, and functions that accept a context like ExecContext, BeginTx etc could add one if the context supplied doesn't have a deadline?

That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.

SimonRichardson · 2023-06-07T14:39:32Z

So if the context has a deadline attached then the following is enacted. So for instances that don't supply a context with a deadline, we could set it there also.

go-dqlite/internal/protocol/protocol.go

Lines 58 to 62 in fd457b7

    
           if deadline, ok := ctx.Deadline(); ok { 
        
           	p.conn.SetDeadline(deadline) 
        
           	budget = time.Until(deadline) 
        
           	defer p.conn.SetDeadline(time.Time{}) 
        
           }

SimonRichardson · 2023-06-07T14:39:59Z

That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.

Sounds like a plan to me.

freeekanayaka · 2023-06-07T14:57:11Z

So if the context has a deadline attached then the following is enacted. So for instances that don't supply a context with a deadline, we could set it there also.

go-dqlite/internal/protocol/protocol.go

Lines 58 to 62 in fd457b7

if deadline, ok := ctx.Deadline(); ok {

p.conn.SetDeadline(deadline)

budget = time.Until(deadline)

defer p.conn.SetDeadline(time.Time{})

}

Am I correct that the only place where a user can't provide a context is tx.Commit() ?

Not sure why the go/sql package has chosen to not support contexts for tx.Commit() (and I don't see any newly introduced extension that changes that), perhaps because that is kind of dangerous and leads to undefined behavior in general (has the tx succeeded or not?). In any case a COMMIT query in dqlite can't hang forever, because at some point (relatively quickly with default values, a few seconds) the leader will step down and stop trying to commit if it can't reach a majority.

@SimonRichardson are you positively seeing a tx.Commit() call that hangs?

The logs attached in the very first message of this issue seem a retry loop for getting a driver.Conn, which I would say is caused by Connector.Connect() not being passed a ctx with a deadline. I didn't check Go's code, but I'd expect the ctx passed to Connector.Connect() to be the same as whatever the user has passed to tx.ExecContext and friends.

If tx.Exec() is being used instead of tx.ExecContext() that might explaining it, and we may want to add a default timeout to our Connector.Connect() implementation, as also suggested by the Go documentation for Connector.Connect:

        // The provided context.Context is for dialing purposes only
	// (see net.DialContext) and should not be stored or used for
	// other purposes. A default timeout should still be used
	// when dialing as a connection pool may call Connect
	// asynchronously to any query.

I'm a not totally convinced that a default timeout should be added to protocol.Call() though, as proposed here, since as said tx.Commit() should not hang, and everything else can be controlled by the user, unless I'm missing something.

freeekanayaka · 2023-06-07T15:00:33Z

Maybe the non tx related functions, such as plain Exec Ping etc. could have a default timeout added, and functions that accept a context like ExecContext, BeginTx etc could add one if the context supplied doesn't have a deadline?

That way folks won't find themselves blocked indefinitely, but also the driver is never overriding an externally requested timeout period.

As mentioned, as far as I understand it the only place where we would need to add a default timeout to achieve the behavior above should be protocol.Connector.Connect().

SimonRichardson · 2023-06-07T15:25:58Z

are you positively seeing a tx.Commit() call that hangs?

Not 100% positive yet.

freeekanayaka · 2023-06-07T16:54:39Z

FWIW I kind of see why you wouldn't want tx.Commit() to get a context parameter. Essentially it seems that what go/sql supports is effectively sugar around transaction rollback (you just cancel the context you pass to the various tx methods), which is a normal database operation. On the other hand, once a COMMIT has been issued there's usually no way to rollback, and it either succeeds or fail.

If Juju's main use case for cancellation/rollback is graceful shutdown (e.g. of a worker), then I'd say that a top-level context that you can cancel would be the way to go (as opposed to timeouts), and I presume that's already the case in Juju's code. Assuming that's true, as long as all juju's database calls get passed a context linked to that top-level one, then all transactions should already immediately abort/rollback as soon as the top-level context is cancelled (except for tx.Commit() which will finish in at most a few seconds, possibly returning an error due to cluster unavailability). Relying on timeouts for graceful shutdown would actually be sub-optimal. If juju's database calls are not already all making use of context.Context, it'd be nice to change that, in order to support a more streamlined graceful shutdown sequence.

SimonRichardson · 2023-06-07T17:03:03Z

and I presume that's already the case in Juju's code

This is correct.

If juju's database calls are not already all making use of context.Context, it'd be nice to change that

We're using context.

freeekanayaka · 2023-06-07T17:19:48Z

and I presume that's already the case in Juju's code

This is correct.

If juju's database calls are not already all making use of context.Context, it'd be nice to change that

We're using context.

Ok, so assuming that the only hang observed is the one originally pasted at the very beginning of this issue (the connect loop of protocol.Connector.Connect()), we'd need to investigate why the top-level Juju's context cancellation is not propagated down to that loop.

SimonRichardson · 2023-06-07T17:25:44Z

Let me get back to you on this. I've not been able replicate it.

freeekanayaka · 2023-06-07T17:28:57Z

First thing to check on the go-dqlite site would that cancelling a context actually indirectly interrupts that loop. Roughly something like:

Run some successful queries against an in-memory cluster of dqlite.Node
Stop the dqlite cluster calling dqlite.Node.Close() against all nodes, but leave the db object open
Span a goroutine that calls db.ExecContext("...", ctx)
Cancel the ctx and check that it stops the connect loop that the go/sql code should have triggered by calling protocol.Connector.Connect() in order to get a working connection.
The goroutine should terminate immediately

MathieuBordere · 2023-09-01T10:12:31Z

Sorry, I'm just now starting to look at this, for my understanding

#       0x46e7204       database/sql.(*Stmt).Exec+0x44                                                  /u
sr/local/go/src/database/sql/sql.go:2651

Isn't it expected that you hang forever if you don't use ExecContext? Can't you replace existing Exec calls with ExecContext? Same with Ping, can't you use PingContext?

I agree though that it's prudent to add a default context with a sane timeout to protocol.Connector.Connect.

manadart · 2023-09-14T11:23:26Z

Yes, we are using <thing>Context exclusively now.

I leave it to discretion of the Dqlite team as to the specific mitigation, but I feel this behaviour would violate the assumptions of consumers.

MathieuBordere added Bug Confirmed to be a bug Incomplete Waiting on more information from reporter labels Jun 12, 2023

pedroleaoc assigned MathieuBordere Aug 17, 2023

cole-miller self-assigned this Sep 18, 2023

cole-miller mentioned this issue Sep 18, 2023

internal/protocol: Default timeout for Call #274

Closed

cole-miller unassigned MathieuBordere and cole-miller Sep 2, 2024

cole-miller changed the title ~~sql.DB instances hang forever when database is unreachable~~ Add a default timeout for Connector.Connect Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a default timeout for Connector.Connect #235

Add a default timeout for Connector.Connect #235

manadart commented Mar 20, 2023

cole-miller commented Mar 20, 2023

SimonRichardson commented Jun 7, 2023

tomponline commented Jun 7, 2023 •

edited

Loading

tomponline commented Jun 7, 2023 •

edited

Loading

SimonRichardson commented Jun 7, 2023

tomponline commented Jun 7, 2023

tomponline commented Jun 7, 2023 •

edited

Loading

tomponline commented Jun 7, 2023 •

edited

Loading

SimonRichardson commented Jun 7, 2023

SimonRichardson commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023

SimonRichardson commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023

SimonRichardson commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023

SimonRichardson commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023

MathieuBordere commented Sep 1, 2023

manadart commented Sep 14, 2023

Add a default timeout for Connector.Connect #235

Add a default timeout for Connector.Connect #235

Comments

manadart commented Mar 20, 2023

cole-miller commented Mar 20, 2023

SimonRichardson commented Jun 7, 2023

tomponline commented Jun 7, 2023 • edited Loading

tomponline commented Jun 7, 2023 • edited Loading

SimonRichardson commented Jun 7, 2023

tomponline commented Jun 7, 2023

tomponline commented Jun 7, 2023 • edited Loading

tomponline commented Jun 7, 2023 • edited Loading

SimonRichardson commented Jun 7, 2023

SimonRichardson commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023

SimonRichardson commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023

SimonRichardson commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023

SimonRichardson commented Jun 7, 2023

freeekanayaka commented Jun 7, 2023

MathieuBordere commented Sep 1, 2023

manadart commented Sep 14, 2023

tomponline commented Jun 7, 2023 •

edited

Loading

tomponline commented Jun 7, 2023 •

edited

Loading

tomponline commented Jun 7, 2023 •

edited

Loading

tomponline commented Jun 7, 2023 •

edited

Loading