Make cluster meet reliable under link failures #461

srgsanky · 2024-05-08T02:46:26Z

When there is a link failure while an ongoing MEET request is sent the sending node stops sending anymore MEET and starts sending PINGs. Since every node responds to PINGs from unknown nodes with a PONG, the receiving node never adds the sending node. But the sending node adds the receiving node when it sees a PONG. This can lead to asymmetry in cluster membership. This changes makes the sender keep sending MEET until it sees a PONG, avoiding the asymmetry.

srgsanky · 2024-05-08T02:48:05Z

Posting this for initial comments. I can migrate the test based on the new framework once #442 is merged.

codecov · 2024-05-08T03:13:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.94%. Comparing base (d52c8f3) to head (2ff9879).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #461      +/-   ##
============================================
+ Coverage     69.83%   69.94%   +0.11%     
============================================
  Files           109      109              
  Lines         61801    61808       +7     
============================================
+ Hits          43158    43234      +76     
+ Misses        18643    18574      -69

Files	Coverage Δ
src/cluster_legacy.c	`86.50% <100.00%> (+0.28%)`	⬆️
src/debug.c	`53.60% <100.00%> (+0.13%)`	⬆️

... and 14 files with indirect coverage changes

srgsanky · 2024-05-08T17:14:34Z

@madolson @hpatro @PingXie can one of you help review this change?

tests/cluster/tests/30-reliable-meet.tcl

src/cluster_legacy.c

madolson · 2024-05-13T01:23:16Z

src/cluster_legacy.c

- * normal PING packets. */
- node->flags &= ~CLUSTER_NODE_MEET;
-
+ /* NOTE: We cannot clear the MEET flag from the node until we get a response


This makes sense to me, but I thought we briefly discussed just only sending the meet once, and on reconnect just not sending another meet. The previously logic was "We'll only send a single meet", I'm wondering if there was any logic somewhere that relied on that behavior.

Are we getting into the territory that CLUSTER MEET should only be sent when it was generated by the user/admin?

The initial MEET is still triggered by an admin/user command.

/* We can clear the flag after the first packet is sent. * If we'll never receive a PONG, we'll never send new packets * to this node. Instead after the PONG is received and we * are no longer in meet/handshake status, we want to send * normal PING packets. */

The previous code comment above doesn't restrict the MEET to be sent exactly once. I can't think of any scenario where sending multiple MEETs will break - this is similar to when an admin sends MEET multiple times. We send MEET only when the connection is established in clusterLinkConnectHandler, so we still limit sending MEET when connection is newly setup and not in every cluster cron run.

this is similar to when an admin sends MEET multiple times

I believe the cluster deduplicate multiple identical requests to the same IP/port.

So, is there a worst case scenario where the node which was intended to connect to earlier has changed and we send the MEET command to the wrong node with this change?

I can't think of any. I went through the code and it seems okay to get a double MEET. My ask is can we update the test to also cause this edge case. I believe we can drop the pong messages on the source node, kill the connection, and trigger a second meet message.

In the test case I am using multiple MEETs to make sure we don't stop dropping the link early.

wait_for_condition 1000 50 { [CI $b cluster_stats_messages_meet_received] >= 3 } else { fail "Cluster node $a never sent multiple MEETs to $b" }

B isn't processing these right? It's just immediately dropping the first three and not processing them, it only ever processes the 4th one once correct?

hpatro · 2024-05-13T14:23:40Z

src/cluster_legacy.c

- * normal PING packets. */
- node->flags &= ~CLUSTER_NODE_MEET;
-
+ /* NOTE: We cannot clear the MEET flag from the node until we get a response


Are we getting into the territory that CLUSTER MEET should only be sent when it was generated by the user/admin?

hpatro · 2024-05-13T14:24:13Z

src/server.h

+ int cluster_close_link_on_packet_drop; /* Debug config that goes along with cluster_drop_packet_filter.
+ When set, the link is closed on packet drop. */


Is there any alternative you thought of for testing without introducing this particular flag?

In order to test this, the target node has to close the physical connection and not just simply drop the packet. Simply dropping the packet is not a valid scenario in production as we use TCP and we expect reliable transmission.

The only other approach that I can think of is to use iptables to drop packets while the connection is setup. I couldn't find examples in tcl tests that use iptables. Do you have a better approach in mind?

In the cluster tests for failover, we use SIGPAUSE to stop node. (There's some TCL helper function for it, like proc pause_process I think.) Then, it doesn't respond to any network traffic. SIGCONT is used to wake it up again.

SIGPAUSE and SIGCONT may not reproduce the scenario that this PR is fixing. We want the connect to be established successfully so that the sender sends the MEET message. The connection has to be torn down before the MEET is received by the receiver. This makes it tricky.

hpatro · 2024-05-13T14:32:01Z

I think it's worth investing on this redis/redis#11095 to avoid this issue altogether.

srgsanky · 2024-05-13T19:36:02Z

I think it's worth investing on this redis/redis#11095 to avoid this issue altogether.

Thanks, I wasn't aware of this linked issue. IMO these two issues can be solved independently. The linked issue tries to make the admin experience better for MEET command where as this PR tries to address a specific gap in MEET implementation.

With SYNC MEET, we will have to make changes to admin client timeout. This timeout can possible trickle up the stack in a control plane implementation.
If we choose to attempt handshake for a longer period of time, we either have to filter out nodes in handshake in cluster nodes output for a non-admin client or make the clients filter out the nodes with this new flag. This can require a client side change to avoid connecting to a node in handshake, experiencing availability issues.

The problem addressed in this PR (asymmetric cluster membership) can happen with SYNC MEET as well due to link failures. So, it is worth solving it. The handshake nodes will still be removed after the handshake timeout (same as node_timeout of 15s). Wdyt?

madolson · 2024-05-13T20:02:34Z

The problem addressed in this PR (asymmetric cluster membership) can happen with SYNC MEET as well due to link failures. So, it is worth solving it. The handshake nodes will still be removed after the handshake timeout (same as node_timeout of 15s). Wdyt?

Yeah, I still believe this a problem even with the #11095.

zuiderkwast · 2024-05-13T20:31:18Z

Awesome material for our next release which will be full of cluster improvements. Is it worth mentioning in release notes?

Btw @srgsanky you need to commit with -s. See the instructions on the DCO CI job's details page.

madolson · 2024-05-13T20:32:35Z

Awesome material for our next release which will be full of cluster improvements. Is it worth mentioning in release notes?

I would also be inclined to backport it.

tests/cluster/tests/30-reliable-meet.tcl

Signed-off-by: Sankar <[email protected]>

srgsanky · 2024-05-19T21:12:21Z

Awesome material for our next release which will be full of cluster improvements. Is it worth mentioning in release notes?

Btw @srgsanky you need to commit with -s. See the instructions on the DCO CI job's details page.

When I tried to merge the new changes into my fork, I ended up with a merge commit

* 2ff9879fa (HEAD -> unstable, origin/unstable, origin/HEAD) Moved test under unit and addressed other comments
*   b826ef77a Merge branch 'valkey-io:unstable' into unstable
|\
| * d52c8f30e Include stddef in zmalloc.h (#516)
| * dcc9fd4fe Resolve numtests counter error (#514)
...
| * 315b7573c Update server function's name to valkey (#456)
* | 49a884c06 Make cluster meet reliable under link failures
|/
* 4e944cede Migrate kvstore.c unit tests to new test framework. (#446)

I want to signoff just 49a884c, but the rebase is adding a signoff to all commits 315b757..d52c8f3 which are not made by me.

Do you have any recommendation to fix this?

As an alternate option, I can start fresh and add a new commit from the tip of unstable. I am not sure if I will be able to reuse this PR.

zuiderkwast · 2024-05-19T22:15:50Z

I believe it's possible to undo a merge by git reset --hard 49a884c06 (the commit before the merge commit), then rebase to add the --signoff, then do git merge unstable again. The commit you added after merge commit can be cherry-picked after all this. Just remember the commit id.

If nothing works, then it's always possible to start from scratch with a new branch and cherry-pick all your commits into it. Then you can rename the branches and force-push to this PR's branch.

madolson · 2024-05-20T02:20:49Z

@srgsanky The commit missing the DCO is just the top one. You should just be able to do git commit -s --amend with a no-op and force push over what you have.

madolson

Just some minor nitpicks around the tests, it overall LGTM.

madolson · 2024-05-20T02:23:28Z

tests/unit/cluster/cluster-reliable-meet.tcl

+tags {tls:skip external:skip cluster} {
+
+set base_conf [list cluster-enabled yes]
+start_multiple_servers 2 [list overrides $base_conf] {
+
+test "Cluster nodes are reachable" {


Some tests have this indentation since we wanted to preserve git history, when we ported them from the old to new framework. For new tests, they should ideally be indented. Maybe we should just format them at this point, since we are already losing so much history.

madolson · 2024-05-20T02:28:14Z

tests/unit/cluster/cluster-reliable-meet.tcl

+set b 0
+set a 1
+
+test "Cluster nodes haven't met each other" {
+ assert {[llength [get_cluster_nodes $a]] == 1}
+ assert {[llength [get_cluster_nodes $b]] == 1}
+}


Suggested change

set b 0

set a 1

test "Cluster nodes haven't met each other" {

assert {[llength [get_cluster_nodes $a]] == 1}

assert {[llength [get_cluster_nodes $b]] == 1}

}

test "Cluster nodes haven't met each other" {

assert {[llength [get_cluster_nodes 1]] == 1}

assert {[llength [get_cluster_nodes 0]] == 1}

}

If you much prefer a and b, I'm okay with, but it seems just as useful to just call them 0 and 1 to me.

madolson · 2024-05-20T02:29:13Z

tests/unit/cluster/cluster-reliable-meet.tcl

+
+ set b_port [srv 0 port]
+
+ R $a CLUSTER MEET 127.0.0.1 $b_port


Suggested change

R $a CLUSTER MEET 127.0.0.1 $b_port

R 1 CLUSTER MEET 127.0.0.1 [srv 0 port]

Similar to previous point about avoiding renaming the nodes a and b.

madolson · 2024-05-20T02:29:57Z

tests/cluster/tests/includes/init-tests.tcl

@@ -44,6 +44,7 @@ test "Cluster nodes hard reset" {
 R $id config set repl-diskless-load disabled
 R $id config set cluster-announce-hostname ""
 R $id DEBUG DROP-CLUSTER-PACKET-FILTER -1
+ R $id DEBUG CLOSE-CLUSTER-LINK-ON-PACKET-DROP 0


Suggested change

R $id DEBUG CLOSE-CLUSTER-LINK-ON-PACKET-DROP 0

I assume this is no longer needed now that the tests were moved?

madolson · 2024-05-20T02:30:58Z

tests/unit/cluster/cluster-reliable-meet.tcl

+ wait_for_condition 1000 50 {
+ [CI $b cluster_stats_messages_meet_received] >= 3
+ } else {
+ fail "Cluster node $a never sent multiple MEETs to $b"


Suggested change

fail "Cluster node $a never sent multiple MEETs to $b"

fail "Cluster node $a never sent multiple MEETs to $b"

We should figure out a linter for TCL.

madolson · 2024-05-20T02:33:08Z

src/cluster_legacy.c

- * normal PING packets. */
- node->flags &= ~CLUSTER_NODE_MEET;
-
+ /* NOTE: We cannot clear the MEET flag from the node until we get a response


B isn't processing these right? It's just immediately dropping the first three and not processing them, it only ever processes the 4th one once correct?

madolson reviewed May 13, 2024

View reviewed changes

tests/cluster/tests/30-reliable-meet.tcl Outdated Show resolved Hide resolved

madolson reviewed May 13, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

madolson reviewed May 13, 2024

View reviewed changes

hpatro reviewed May 13, 2024

View reviewed changes

madolson added the release-notes This issue should get a line item in the release notes label May 13, 2024

madolson reviewed May 13, 2024

View reviewed changes

tests/cluster/tests/30-reliable-meet.tcl Outdated Show resolved Hide resolved

srgsanky added 2 commits May 19, 2024 12:14

Merge branch 'valkey-io:unstable' into unstable

b826ef7

Moved test under unit and addressed other comments

2ff9879

Signed-off-by: Sankar <[email protected]>

srgsanky force-pushed the unstable branch 2 times, most recently from d8aa71c to 2ff9879 Compare May 19, 2024 21:22

madolson reviewed May 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make cluster meet reliable under link failures #461

Make cluster meet reliable under link failures #461

srgsanky commented May 8, 2024

srgsanky commented May 8, 2024

codecov bot commented May 8, 2024 •

edited

srgsanky commented May 8, 2024

madolson May 13, 2024

hpatro May 13, 2024

madolson May 13, 2024

srgsanky May 13, 2024

madolson May 13, 2024

hpatro May 13, 2024

madolson May 13, 2024

srgsanky May 19, 2024

madolson May 20, 2024

hpatro May 13, 2024

hpatro May 13, 2024

srgsanky May 13, 2024

zuiderkwast May 14, 2024

srgsanky May 19, 2024

zuiderkwast May 19, 2024

hpatro commented May 13, 2024

srgsanky commented May 13, 2024

madolson commented May 13, 2024

zuiderkwast commented May 13, 2024

madolson commented May 13, 2024

srgsanky commented May 19, 2024

zuiderkwast commented May 19, 2024

madolson commented May 20, 2024

madolson left a comment

madolson May 20, 2024

madolson May 20, 2024

madolson May 20, 2024

madolson May 20, 2024

madolson May 20, 2024

madolson May 20, 2024

		int cluster_close_link_on_packet_drop; /* Debug config that goes along with cluster_drop_packet_filter.
		When set, the link is closed on packet drop. */

	R $a CLUSTER MEET 127.0.0.1 $b_port
	R 1 CLUSTER MEET 127.0.0.1 [srv 0 port]

	fail "Cluster node $a never sent multiple MEETs to $b"
	fail "Cluster node $a never sent multiple MEETs to $b"

Make cluster meet reliable under link failures #461

Are you sure you want to change the base?

Make cluster meet reliable under link failures #461

Conversation

srgsanky commented May 8, 2024

srgsanky commented May 8, 2024

codecov bot commented May 8, 2024 • edited

Codecov Report

srgsanky commented May 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hpatro commented May 13, 2024

srgsanky commented May 13, 2024

madolson commented May 13, 2024

zuiderkwast commented May 13, 2024

madolson commented May 13, 2024

srgsanky commented May 19, 2024

zuiderkwast commented May 19, 2024

madolson commented May 20, 2024

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 8, 2024 •

edited