Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contact points randomization #1029

Open
wprzytula opened this issue Jul 9, 2024 · 5 comments
Open

Contact points randomization #1029

wprzytula opened this issue Jul 9, 2024 · 5 comments
Assignees
Labels
area/load-balancing cpp-rust-driver-p1 Functionality required by cpp-rust-driver
Milestone

Comments

@wprzytula
Copy link
Collaborator

Problem

When driver is given an ordered lists of initial contact points, control connection is attempted to be open to nodes in that order. If the cluster is operating normally, the first node always accepts the control connection and becomes burdened with it (in a way of having to send events when triggered and topology & schema metadata when queried) until either the connected driver disconnects or the node breaks down. This imbalance should be avoided.

Solution

cpp-driver by default enables random shuffling of the initial contact points list. This ensures proper load balancing over nodes wrt control connection. We should do the same, and, similarly to cpp-driver, expose a config option to disable this behaviour (mainly useful for deterministic testing).

@wprzytula wprzytula added area/load-balancing cpp-rust-driver-p1 Functionality required by cpp-rust-driver labels Jul 9, 2024
@wprzytula wprzytula added this to the 1.0.0 milestone Jul 9, 2024
@mykaul
Copy link
Contributor

mykaul commented Jul 9, 2024

Also / similar but separate issue: all shards (but 0) are also contacted in the same order. This causes a small storm when a node comes back up. I understand we have to contact shard 0 first, but the rest should be in a random order.

@Lorak-mmk
Copy link
Collaborator

Why do we have to contact shard 0 first?

@wprzytula
Copy link
Collaborator Author

Why do we have to contact shard 0 first?

Because we only know that nodes contain at least shard 0. No other shard's presence is guaranteed.

@piodul
Copy link
Collaborator

piodul commented Jul 9, 2024

Because we only know that nodes contain at least shard 0. No other shard's presence is guaranteed.

The driver does not choose a shard when establishing the first connection. Usually, the first connection should be established to the non-shard-aware port and Scylla will choose the shard that is least loaded with connections.

After the first connection is made and the driver learns how many shards the node has, it will start connecting to all other shards at once (that's how it works right now).

@mykaul
Copy link
Contributor

mykaul commented Jul 9, 2024

Because we only know that nodes contain at least shard 0. No other shard's presence is guaranteed.

Indeed.

The driver does not choose a shard when establishing the first connection. Usually, the first connection should be established to the non-shard-aware port and Scylla will choose the shard that is least loaded with connections.

True.

After the first connection is made and the driver learns how many shards the node has, it will start connecting to all other shards at once (that's how it works right now).

Yes. And it may cause (as we've seen in the past) a connection storm. As all drivers now seeing a new node is up will do the same. Ideally, it should randomize and pace the connections to all other shards.
(of course, there are other optimizations I'd love to see on the connection phase, which will reduce and lessen the impact).

@wprzytula wprzytula self-assigned this Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/load-balancing cpp-rust-driver-p1 Functionality required by cpp-rust-driver
Projects
None yet
Development

No branches or pull requests

4 participants