Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queries failing due to no hosts available in the pool. #325

Closed
ouamer-dahmani opened this issue Oct 31, 2024 · 3 comments
Closed

Queries failing due to no hosts available in the pool. #325

ouamer-dahmani opened this issue Oct 31, 2024 · 3 comments
Assignees

Comments

@ouamer-dahmani
Copy link

Hello,

I am encountering issues where queries are not being retried despite a retry policy being configured when creating a new Cluster object.

Reads and writes work fine but then at some point we get errors on some of them: gocql: no hosts available in the pool.
Delving in the code I see that it should indeed retry the queries (I forced a query execution error in the debugger).

I then added logging to the cluster:

cluster.Logger = logger
cluster.QueryObserver = logger
cluster.BatchObserver = logger
cluster.ConnectObserver = logger

The logger gets called for queries that succeed but never for those that fail. I wonder if it is because the queries are not even ran once due to no hosts being in the connection pool?

I sometimes see connection events before the failures (can be a few milliseconds or minutes) but that is not always the case and they are not error logs either.
Connect: Dial Duration: 5.383348ms, Host: 10.173.92.242

I know that the network on my kubernetes cluster is a bit flaky sometimes but I assume this should be taken care of gracefully with reconnections on the connection pool and retries on the queries.

I am running version v1.13.0 of the driver.
I see that v1.14.X have changes around connections but am unsure they are related to the issues I am having and have held off on updating due to lack of time to test it out.

@dkropachev
Copy link
Collaborator

Could you please provide your ClusterConfig including HostSelectionPolicy and retry policy.

@ouamer-dahmani
Copy link
Author

Hello!

It is equivalent to the following. I used high values to see if it would help pass through the potential instability.

cluster := gocql.NewCluster(cfg.Hosts...)
cluster.Keyspace = cfg.Keyspace
cluster.Timeout = 5 * time.Second
cluster.RetryPolicy = &gocql.ExponentialBackoffRetryPolicy{
	Min:        500 * time.Millisecond,
	Max:        5 * time.Second,
	NumRetries: 5,
}
cluster.Consistency = gocql.LocalQuorum
cluster.Authenticator = cfg.Authenticator
cluster.PoolConfig.HostSelectionPolicy = gocql.RoundRobinHostPolicy()
cluster.DisableInitialHostLookup = false
cluster.DisableShardAwarePort = true

@dkropachev
Copy link
Collaborator

@ouamer-dahmani , what most likely happens is this:

  1. Due to the unstable connection driver looses connections to all nodes at some point.
  2. When it happens executor does not even get to retry policy, it just iterates over hosts provided by RoundRobinHostPolicy to find one that has connections to it and could be used to execute query. Since it finds no such hosts, it end up returning &Iter{err: ErrNoConnections}

It works the same way on modern version as well, so you can't fix it by upgrading the driver.
I would suggest to manually retry on this error, until we fix retry logic

I am closing this issue in favor of #326.
But feel free to continue discussion here if it is related to given case.

@dkropachev dkropachev self-assigned this Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants