Contact points randomization #1029

wprzytula · 2024-07-09T09:17:17Z

Problem

When driver is given an ordered lists of initial contact points, control connection is attempted to be open to nodes in that order. If the cluster is operating normally, the first node always accepts the control connection and becomes burdened with it (in a way of having to send events when triggered and topology & schema metadata when queried) until either the connected driver disconnects or the node breaks down. This imbalance should be avoided.

Solution

cpp-driver by default enables random shuffling of the initial contact points list. This ensures proper load balancing over nodes wrt control connection. We should do the same, and, similarly to cpp-driver, expose a config option to disable this behaviour (mainly useful for deterministic testing).

mykaul · 2024-07-09T09:28:35Z

Also / similar but separate issue: all shards (but 0) are also contacted in the same order. This causes a small storm when a node comes back up. I understand we have to contact shard 0 first, but the rest should be in a random order.

Lorak-mmk · 2024-07-09T09:46:43Z

Why do we have to contact shard 0 first?

wprzytula · 2024-07-09T09:52:40Z

Why do we have to contact shard 0 first?

Because we only know that nodes contain at least shard 0. No other shard's presence is guaranteed.

piodul · 2024-07-09T09:56:31Z

Because we only know that nodes contain at least shard 0. No other shard's presence is guaranteed.

The driver does not choose a shard when establishing the first connection. Usually, the first connection should be established to the non-shard-aware port and Scylla will choose the shard that is least loaded with connections.

After the first connection is made and the driver learns how many shards the node has, it will start connecting to all other shards at once (that's how it works right now).

mykaul · 2024-07-09T10:07:31Z

Because we only know that nodes contain at least shard 0. No other shard's presence is guaranteed.

Indeed.

The driver does not choose a shard when establishing the first connection. Usually, the first connection should be established to the non-shard-aware port and Scylla will choose the shard that is least loaded with connections.

True.

After the first connection is made and the driver learns how many shards the node has, it will start connecting to all other shards at once (that's how it works right now).

Yes. And it may cause (as we've seen in the past) a connection storm. As all drivers now seeing a new node is up will do the same. Ideally, it should randomize and pace the connections to all other shards.
(of course, there are other optimizations I'd love to see on the connection phase, which will reduce and lessen the impact).

wprzytula added area/load-balancing cpp-rust-driver-p1 Functionality required by cpp-rust-driver labels Jul 9, 2024

wprzytula added this to the 1.0.0 milestone Jul 9, 2024

wprzytula self-assigned this Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contact points randomization #1029

Contact points randomization #1029

wprzytula commented Jul 9, 2024

mykaul commented Jul 9, 2024

Lorak-mmk commented Jul 9, 2024

wprzytula commented Jul 9, 2024

piodul commented Jul 9, 2024

mykaul commented Jul 9, 2024

Contact points randomization #1029

Contact points randomization #1029

Comments

wprzytula commented Jul 9, 2024

Problem

Solution

mykaul commented Jul 9, 2024

Lorak-mmk commented Jul 9, 2024

wprzytula commented Jul 9, 2024

piodul commented Jul 9, 2024

mykaul commented Jul 9, 2024