Cluster does not recover from temporary network partition #2140

rpfeifer-swi · 2019-08-23T21:07:48Z

rpfeifer-swi
Aug 23, 2019

Discovered that if a network connectivity issue makes a node in a couchDB cluster unreachable (routing issue, someone trips over cable, etc), after about a minute or so the affected node will disconnect and never attempt to reconnect. This leaves the cluster broken, and the only apparent way to recover is to manually restart couchDB, which re-establishes connections.

To duplicate:
I set up a small cluster (3 nodes, couchDB 2.3.1 on Debian 9) and verifed a database replicates across them. Noted that there was an open TCP socket to port 9100 from each peer.

Disconnected network (virtual, on VirtualBox VM) to one of them. After about a minute the sockets involving the affected node closed. Also noticed that an attempt to update a database hung unti the socket closed (then completed with success).

Upon re-connecting the affected node, noted that that node in no longer synced to te rest of the cluster, and never recovers. There is apparently no mechanism to re-establish the broken connections. Stopping and re-starting any node's couchDB will re-establish normal operation. This does not appear to be related to link state or other conditions; simple loss of routing is confirmed to cause.

This would seem to be a fairly glaring reliability issue. If there is some mechanism handle this, it does not appear in the documentation.

wohali · 2019-08-23T21:21:30Z

wohali
Aug 23, 2019
Collaborator

This is definitely not the case in production systems across hundreds of installs I've seen personally.

Can you describe your cluster setup more completely? Are you using Docker, or cloud instances with private networking, or bare metal installs (e.g. Raspberry Pis)?

0 replies

rpfeifer-swi · 2019-08-23T21:44:15Z

rpfeifer-swi
Aug 23, 2019
Author

This is running on a VPN networking appliance based on Debian 9, run in a (amd64) VM (typically VMWare) or on real iron. We are using the provided pre-build .deb (stretch) packages.

Our cluster configuration is a bit unusual, in that we adjust sharding (and n) to keep a shard on every node, so that every node always has a full copy of data. We build the cluster one node at a time; each new node joins the cluster and then replicates a shard metadata to the new node. We are maintaining one fairly small database (at most a few thousand small documents), with a cluster size of 1 - 4 nodes (each of which must always have data available, hence the resharding). Otherwise, it's pretty standard stuff, I think.

1 reply

rnewson Jul 1, 2020
Collaborator

if you create a database on a 3 node cluster, then every node has a full copy of the data anyway, that's the default behaviour and requires no adjustment from you.

wohali · 2019-08-23T23:02:25Z

wohali
Aug 23, 2019
Collaborator

/cc @nickva ever seen this? this might be an actual bug but I don't know if it's one we care to fix with 4.0 plans.

0 replies

wohali · 2020-06-25T18:12:52Z

wohali
Jun 25, 2020
Collaborator

@rwpfeifer I'm sorry that we don't have any more information to provide here. The only thing I can think of is that if you are actually tearing down the network interface itself, that epmd - the Erlang Port Mapper Daemon - may be losing the interface it's bound to. This, in turn, would prevent other nodes from reaching CouchDB on that node, since they can't talk to epmd, which is how they find out how to talk to CouchDB on e.g. port 9100/tcp.

When you kill CouchDB, epmd will also terminate. Restarting CouchDB will then automatically restart epmd.

No Erlang distributed process can survive epmd being restarted from underneath it, to my knowledge, so the only workaround for you would be to ensure that when you interrupt networking, you do not also tear down the virtual interface in the VM guest at the same time.

1 reply

skeyby Jul 2, 2020

Joan I didn't catch your reply previously, but in my scenario epmd is only opening port 4396, not 9100 -- and there's nothing flowing on that port...

root@io-01:/var/log/couchdb2 # sockstat -4 | grep epm
couchdb  epmd       1078  3  tcp4   *:4369                *:*
couchdb  epmd       1078  5  tcp4   127.0.0.1:4369        127.0.0.1:14743
root@io-01:/var/log/couchdb2 # netstat -an | grep 4369
tcp4       0      0 127.0.0.1.4369         127.0.0.1.14743        ESTABLISHED
tcp4       0      0 127.0.0.1.14743        127.0.0.1.4369         ESTABLISHED
tcp6       0      0 *.4369                 *.*                    LISTEN
tcp4       0      0 *.4369                 *.*                    LISTEN

root@io-02:/var/log/couchdb2 # sockstat -4 | grep epm
couchdb  epmd       1013  3  tcp4   *:4369                *:*
couchdb  epmd       1013  5  tcp4   127.0.0.1:4369        127.0.0.1:48258
root@io-02:/var/log/couchdb2 # netstat -an | grep 4369
tcp4       0      0 127.0.0.1.4369         127.0.0.1.48258        ESTABLISHED
tcp4       0      0 127.0.0.1.48258        127.0.0.1.4369         ESTABLISHED
tcp6       0      0 *.4369                 *.*                    LISTEN
tcp4       0      0 *.4369                 *.*                    LISTEN

skeyby · 2020-06-30T14:38:43Z

skeyby
Jun 30, 2020

Unfortunately I'm facing the same issue.

I have two VMs on Digital Ocean that are connected trough the internal lans.

After some days or hours of activity the machines just split brain. I guess "behind the curtain" the VM may be moving from one physical host to another one.

The logs get filled with things like:

[error] 2020-06-30T14:34:38.754383Z [email protected] <0.3087.142> b8e83a37cf fabric_worker_timeout update_docs,'[email protected]',<<"shards/c0000000-dfffffff/queue.1592984636">>
[error] 2020-06-30T14:34:38.915822Z [email protected] <0.1364.142> c0efb5762f fabric_worker_timeout update_docs,'[email protected]',<<"shards/60000000-7fffffff/queue.1592984636">>
[error] 2020-06-30T14:34:45.850769Z [email protected] <0.27551.141> 01ce47afb2 fabric_worker_timeout update_docs,'[email protected]',<<"shards/20000000-3fffffff/queue.1592984636">>
[error] 2020-06-30T14:34:51.734713Z [email protected] <0.30745.141> 852d8f98f9 fabric_worker_timeout update_docs,'[email protected]',<<"shards/20000000-3fffffff/queue.1592984636">>
[error] 2020-06-30T14:34:56.528303Z [email protected] <0.2807.142> 8655186dc3 fabric_worker_timeout update_docs,'[email protected]',<<"shards/a0000000-bfffffff/queue.1592984636">>

(That is the "remote" machine IP).

Restarting CouchDB fixes the problem.

CouchDB 2.3.1 on FreeBSD 12.1

0 replies

skeyby · 2020-06-30T14:42:18Z

skeyby
Jun 30, 2020

This is the "beginning" of the problem, that happened at 14.26 today:

Machine A

[error] 2020-06-30T14:26:42.218066Z [email protected] <0.22770.141> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/20000000-3fffffff/_global_changes.1592994148">>
[error] 2020-06-30T14:26:42.218209Z [email protected] <0.22770.141> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020-06-30T14:26:42.218284Z [email protected] <0.22775.141> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/60000000-7fffffff/_global_changes.1592994148">>
[error] 2020-06-30T14:26:42.218350Z [email protected] <0.22775.141> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020-06-30T14:26:43.239807Z [email protected] <0.19639.141> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/40000000-5fffffff/_global_changes.1592994148">>
[error] 2020-06-30T14:26:43.239891Z [email protected] <0.19639.141> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020-06-30T14:26:43.927694Z [email protected] <0.19725.104> fd66a2549d fabric_worker_timeout update_docs,'[email protected]',<<"shards/a0000000-bfffffff/queue.1592984636">>

Machine B

[error] 2020-06-30T14:26:19.770152Z [email protected] <0.15949.176> -------- rexi_server: from: [email protected](<0.27190.175>) mfa: fabric_rpc:all_docs/3 error:function_clause [{couch_db,incref,[undefined],[{file,"src/couch_db.erl"},{line,185}]},{couch_server,open,2,[{file,"src/couch_server.erl"},{line,85}]},{fabric_rpc,all_docs,3,[{file,"src/fabric_rpc.erl"},{line,124}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2020-06-30T14:26:39.895071Z [email protected] <0.30132.0> -------- Replicator, request PUT to "http://127.0.0.1:5984/_users/_local/153910aca337d66bb0901018a8f58206" failed due to error {error,req_timedout}
[error] 2020-06-30T14:26:39.895406Z [email protected] <0.30132.0> -------- Replication `153910aca337d66bb0901018a8f58206+continuous` (`http://home-replication:*****@10.133.136.126:5984/_users/` -> `http://127.0.0.1:5984/_users/`) failed: {http_request_failed,"PUT",
[error] 2020-06-30T14:26:39.922133Z [email protected] <0.5378.1> -------- Replicator, request GET to "http://home-replication:*****@10.133.136.126:5984/_users/_changes?feed=continuous&style=all_docs&since=%22118783-g1AAAALLeJyl0M0KwjAMAODiBMW7Ht18gbE2rm4n9yban8mQqSdvgr6Jvonii6jv4H127ZynIayHJJCQj5AcIdTPHIlGYrcXmeQJDnwMoIL6mNBczTsMcbcoinXmcITc6Ub1eiuxkiGBxsV_JvdU5vOafQ81iyWRfDprzyYlu6hZb6BZilOIwYJdluzx94S7ZhkGQShuzW67KqOTKko-G3r80nQoCRNpaElfDH2trs41TSDiwYxZ0jdDPyr6oGmIghRkbEk_Df399cRczWIaR6Rxf_0B5GCzpw%22&timeout=10000" failed due to error closing_on_request
[error] 2020-06-30T14:26:43.649857Z [email protected] <0.9173.176> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/40000000-5fffffff/_global_changes.1592994148">>
[error] 2020-06-30T14:26:43.649951Z [email protected] <0.9173.176> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020-06-30T14:26:44.159851Z [email protected] <0.18146.175> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/e0000000-ffffffff/_global_changes.1592994148">>
[error] 2020-06-30T14:26:44.159850Z [email protected] <0.22160.175> -------- fabric_worker_timeout get_all_security,'[email protected]',<<"shards/00000000-1fffffff/_global_changes.1592994148">>
[error] 2020-06-30T14:26:44.159910Z [email protected] <0.18146.175> -------- Error checking security objects for _global_changes :: {error,timeout}
[error] 2020-06-30T14:26:44.159930Z [email protected] <0.22160.175> -------- Error checking security objects for _global_changes :: {error,timeout}

0 replies

rnewson · 2020-07-01T13:26:05Z

rnewson
Jul 1, 2020
Collaborator

Are the IP's changing? What does /_membership show on the various nodes? Are you still reading/writing when this partition is not resolving itself?

1 reply

skeyby Jul 2, 2020

The IP are absolutely static and not changing.

When the cluster splits, CouchDB starts behaving strange. Not all reads works and writing gets funky as well:

[error] 2020-07-01T14:12:27.228410Z [email protected] <0.1645.0> -------- Error getting security objects for <<"userdb-4242544d484c3737453331463833394a">>: {error,no_majority}
[error] 2020-07-01T14:12:27.228883Z [email protected] <0.1661.0> -------- Error getting security objects for <<"userdb-4242544d484c3737453331463833394a">>: {error,no_majority}
[error] 2020-07-01T14:12:27.229347Z [email protected] <0.1633.0> -------- Error getting security objects for <<"userdb-4242544c4c523636423635433538385a">>: {error,no_majority}

[error] 2020-07-01T14:11:21.739767Z [email protected] <0.20260.77> -------- rexi_server: from: [email protected](<0.19280.77>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
[error] 2020-07-01T14:11:21.739850Z [email protected] <0.18854.77> -------- rexi_server: from: [email protected](<0.19280.77>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]

[error] 2020-07-01T14:12:27.062349Z [email protected] <0.350.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020-07-01T14:12:27.062389Z [email protected] <0.353.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020-07-01T14:12:27.062394Z [email protected] <0.352.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020-07-01T14:12:27.062419Z [email protected] <0.419.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}
[error] 2020-07-01T14:12:27.062478Z [email protected] <0.348.0> -------- Error getting security objects for <<"_global_changes">>: {error,no_majority}

Basic Couch requests works, like / or whatever, so it's also very complicated to have an automated monitor for the problem.

As you can see from the log it's happening almost daily.

Just as an hypothesis: I see the cluster has a TCP connection between nodes. May be it never times out or it gets stuck when the backend lan flap?

(Today CouchDB 3.1 should become available in the FreeBSD Ports tree, so I plan to upgrade soon. Let's hope it fixes the problem.)

skeyby · 2020-07-02T12:32:51Z

skeyby
Jul 2, 2020

I had the same problem happen again half an hour ago on a different couple of machines.

Same kind of symptoms, same kind of logs.

Netstat during the broken condition showed an active TCP connection between the machine, and with tcpdump I could see traffic flowing.

Yet at the same time in the log I had dozens of

[error] 2020-07-02T12:29:48.722656Z [email protected] <0.31373.307> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/userdb-42535444474936395032354c37333651.1593462601">>
[error] 2020-07-02T12:29:48.722840Z [email protected] <0.31373.307> -------- _all_docs open error: userdb-42535444474936395032354c37333651 [email protected] :: {error,{case_clause,{error,timeout}}} [{fabric_view_all_docs,open_doc_int,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,269}]},{fabric_view_all_docs,open_doc,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,258}]}]

and

[error] 2020-07-02T12:29:48.722656Z [email protected] <0.31373.307> -------- fabric_worker_timeout open_doc,'[email protected]',<<"shards/80000000-ffffffff/userdb-42535444474936395032354c37333651.1593462601">>
[error] 2020-07-02T12:29:48.722840Z [email protected] <0.31373.307> -------- _all_docs open error: userdb-42535444474936395032354c37333651 [email protected] :: {error,{case_clause,{error,timeout}}} [{fabric_view_all_docs,open_doc_int,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,269}]},{fabric_view_all_docs,open_doc,4,[{file,[115,114,99,47,102,97,98,114,105,99,95,118,105,101,119,95,97,108,108,95,100,111,99,115,46,101,114,108]},{line,258}]}]

I'll investigate further, if you have any diagnosing suggestion feel free to tell me....

1 reply

rpfeifer-swi Jul 2, 2020
Author

Just my 2 cents worth - our solution was to periodically check to detect disconnected nodes (in _membership / cluster_nodes) and re-start couchdb if any detected. Ugly, but it works.

zdravko123 · 2022-12-10T10:39:40Z

zdravko123
Dec 10, 2022

Hey guys this could be whats happening to us, was there any resolution?

0 replies

jeffguorg · 2024-03-21T02:46:31Z

jeffguorg
Mar 21, 2024

this problem occured to us on two couchdb setup in azure kubernetes cluster.

one setup has run for over a year or two since before i came to the team, but in Feburary the nodes suddenly partitioned. the cluster recovered after a problematic pod is restarted.

another setup is a cluster we used to validate the backup of couchdb

we start a 3 node couchdb cluster with seed list and statefulset and headless service
- we verified it's working.
we changed the command to sleep 1d and removed the /_up health checks, and the pods are recreated.
we rsync/rcloned the databases into volume mount path
we remove the command to let couchdb instances start
- after all three nodes started, we found that the nodes are not syncing
we restarted each node, one by one
- after that, the nodes started to sync. partition situation is gone.
we add the health check back to statefulset
- the nodes restarted one by one, and works after the statefulset is all up-to-date and ready

maybe we should write a script as the health checker to detect the partition situation and let kubernetes kill the pod.

1 reply

nickva Mar 21, 2024
Collaborator

You can try monitoring the _membership on each node like @rpfeifer-swi suggested.

In the latest 3.3.3 there is a mem3_distribution module which will try to periodically reconnect any disconnected nodes. There is a period of how often it check and tries to do that configured as [cluster] reconnect_interval_sec = $seconds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster does not recover from temporary network partition #2140

{{title}}

Replies: 10 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Cluster does not recover from temporary network partition #2140

Replies: 10 comments · 5 replies

wohali Aug 23, 2019 Collaborator

rpfeifer-swi Aug 23, 2019 Author

rnewson Jul 1, 2020 Collaborator

wohali Aug 23, 2019 Collaborator

wohali Jun 25, 2020 Collaborator

rnewson Jul 1, 2020 Collaborator

rpfeifer-swi Jul 2, 2020 Author

nickva Mar 21, 2024 Collaborator

Replies: 10 comments 5 replies

wohali
Aug 23, 2019
Collaborator

rpfeifer-swi
Aug 23, 2019
Author

rnewson Jul 1, 2020
Collaborator

wohali
Aug 23, 2019
Collaborator

wohali
Jun 25, 2020
Collaborator

rnewson
Jul 1, 2020
Collaborator

rpfeifer-swi Jul 2, 2020
Author

nickva Mar 21, 2024
Collaborator