Cluster does not recover from temporary network partition #2140
Replies: 10 comments 5 replies
-
This is definitely not the case in production systems across hundreds of installs I've seen personally. Can you describe your cluster setup more completely? Are you using Docker, or cloud instances with private networking, or bare metal installs (e.g. Raspberry Pis)? |
Beta Was this translation helpful? Give feedback.
-
This is running on a VPN networking appliance based on Debian 9, run in a (amd64) VM (typically VMWare) or on real iron. We are using the provided pre-build .deb (stretch) packages. Our cluster configuration is a bit unusual, in that we adjust sharding (and n) to keep a shard on every node, so that every node always has a full copy of data. We build the cluster one node at a time; each new node joins the cluster and then replicates a shard metadata to the new node. We are maintaining one fairly small database (at most a few thousand small documents), with a cluster size of 1 - 4 nodes (each of which must always have data available, hence the resharding). Otherwise, it's pretty standard stuff, I think. |
Beta Was this translation helpful? Give feedback.
-
/cc @nickva ever seen this? this might be an actual bug but I don't know if it's one we care to fix with 4.0 plans. |
Beta Was this translation helpful? Give feedback.
-
@rwpfeifer I'm sorry that we don't have any more information to provide here. The only thing I can think of is that if you are actually tearing down the network interface itself, that When you kill CouchDB, No Erlang distributed process can survive epmd being restarted from underneath it, to my knowledge, so the only workaround for you would be to ensure that when you interrupt networking, you do not also tear down the virtual interface in the VM guest at the same time. |
Beta Was this translation helpful? Give feedback.
-
Unfortunately I'm facing the same issue. I have two VMs on Digital Ocean that are connected trough the internal lans. After some days or hours of activity the machines just split brain. I guess "behind the curtain" the VM may be moving from one physical host to another one. The logs get filled with things like:
(That is the "remote" machine IP). Restarting CouchDB fixes the problem. CouchDB 2.3.1 on FreeBSD 12.1 |
Beta Was this translation helpful? Give feedback.
-
This is the "beginning" of the problem, that happened at 14.26 today: Machine A
Machine B
|
Beta Was this translation helpful? Give feedback.
-
Are the IP's changing? What does /_membership show on the various nodes? Are you still reading/writing when this partition is not resolving itself? |
Beta Was this translation helpful? Give feedback.
-
I had the same problem happen again half an hour ago on a different couple of machines. Same kind of symptoms, same kind of logs. Netstat during the broken condition showed an active TCP connection between the machine, and with tcpdump I could see traffic flowing. Yet at the same time in the log I had dozens of
and
I'll investigate further, if you have any diagnosing suggestion feel free to tell me.... |
Beta Was this translation helpful? Give feedback.
-
Hey guys this could be whats happening to us, was there any resolution? |
Beta Was this translation helpful? Give feedback.
-
this problem occured to us on two couchdb setup in azure kubernetes cluster. one setup has run for over a year or two since before i came to the team, but in Feburary the nodes suddenly partitioned. the cluster recovered after a problematic pod is restarted. another setup is a cluster we used to validate the backup of couchdb
maybe we should write a script as the health checker to detect the partition situation and let kubernetes kill the pod. |
Beta Was this translation helpful? Give feedback.
-
Discovered that if a network connectivity issue makes a node in a couchDB cluster unreachable (routing issue, someone trips over cable, etc), after about a minute or so the affected node will disconnect and never attempt to reconnect. This leaves the cluster broken, and the only apparent way to recover is to manually restart couchDB, which re-establishes connections.
To duplicate:
I set up a small cluster (3 nodes, couchDB 2.3.1 on Debian 9) and verifed a database replicates across them. Noted that there was an open TCP socket to port 9100 from each peer.
Disconnected network (virtual, on VirtualBox VM) to one of them. After about a minute the sockets involving the affected node closed. Also noticed that an attempt to update a database hung unti the socket closed (then completed with success).
Upon re-connecting the affected node, noted that that node in no longer synced to te rest of the cluster, and never recovers. There is apparently no mechanism to re-establish the broken connections. Stopping and re-starting any node's couchDB will re-establish normal operation. This does not appear to be related to link state or other conditions; simple loss of routing is confirmed to cause.
This would seem to be a fairly glaring reliability issue. If there is some mechanism handle this, it does not appear in the documentation.
Beta Was this translation helpful? Give feedback.
All reactions