import timeouts page

memcached · Sep 5, 2024 · 1abc652 · 1abc652
1 parent e199fc5
commit 1abc652
Show file tree

Hide file tree

Showing 2 changed files with 171 additions and 0 deletions.
diff --git a/content/troubleshooting/_index.md b/content/troubleshooting/_index.md
@@ -3,3 +3,5 @@ title = 'Troubleshooting'
 date = 2024-09-02T11:18:44-07:00
 weight = 5
 +++
+
+[Troubleshooting Client Timeouts](/troubleshooting/timeouts)
diff --git a/content/troubleshooting/timeouts.md b/content/troubleshooting/timeouts.md
@@ -0,0 +1,169 @@
++++
+title = 'Client Timeouts'
+date = 2024-09-05T11:13:11-07:00
+weight = 1
++++
+
+## Troubleshooting Timeouts
+
+Client complaining about "timeout errors", but not sure how to track it down?
+Here's a simple utility for examining your situation.
+
+### First, check listen_disabled_num
+
+Before you go ahead with troubleshooting, you'll want to telnet to your
+memcached instance and run `stats`, then look for "listen_disabled_num". This
+is a poorly named counter which describes how many times you've reached
+maxconns. Each time memcached hits maxconns it will delay new connections,
+which means you'll possibly get timeouts.
+
+Also, disable or tune any firewalls you may have in the way.
+
+### Then, carefully check the usual suspects
+
+Is the machine in swap? You will see random lag bubbles if your OS is swapping
+memcached to disk periodically.
+
+Is the machine overloaded? 0% CPU idle with a load of 400 and memcached
+probably isn't getting enough CPU time. You can try `nice` or `renice`, or
+just run less on the machine. If you're severely overloaded on CPU, you might
+notice the mc_conn_tester below reporting very high wait times for `set`
+commands.
+
+Is the memcached server 32bit? 32bit hosts have less memory available to the
+kernel for TCP sockets and friends. We've observed some odd behavior under
+large numbers of open sockets and high load with 32bit systems. Strongly
+consider going 64bit, as it may help some hard to trace problems go away,
+including segfaults due to the 2/4g memory limit.
+
+### Next, mc_conn_tester.pl
+
+Fetch this:
+
+https://www.memcached.org/files/mc_conn_tester.pl
+
+```
+$ ./mc_conn_tester.pl -s memcached-host -p 11211 -c 1000 --timeout 1
+Averages: (conn: 0.00081381) (set: 0.00001603) (get: 0.00040122)
+$ ./mc_conn_tester.pl --help
+Usage:
+    mc_conn_tester.pl [options]
+
+Options:
+    -s --server hostname
+            Connect to an alternate hostname.
+[...etc...]
+```
+
+This is a minimal utility for testing a quick routine with a memcached
+instance. It will connect, attempt a couple sets, attempt a few gets, then loop and
+repeat.
+
+The utility does not use any memcached client and instead does minimal, raw
+commands with the ASCII protocol. Thus helping to rule out client bugs.
+
+If it reaches a timeout, you can see how far along in the cycle it was:
+
+```
+Fail: (timeout: 1) (elapsed: 1.00427794) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
+Fail: (timeout: 1) (elapsed: 1.00133896) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
+Fail: (timeout: 1) (elapsed: 1.00135303) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
+Fail: (timeout: 1) (elapsed: 1.00145602) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
+```
+
+In the above line, it has a total elapsed time of the test, and then the times
+at which each sub-test succeeded. In the above scanario it wasn't able to
+connect to memcached, so all tests failed.
+
+```
+Fail: (timeout: 1) (elapsed: 0.00121498) (conn: 0.00114512) (set: 1.00002694) (get: 0.00000000)
+Fail: (timeout: 1) (elapsed: 0.00368810) (conn: 0.00360799) (set: 1.00003314) (get: 0.00000000)
+Fail: (timeout: 1) (elapsed: 0.00128603) (conn: 0.00117397) (set: 1.00004005) (get: 0.00000000)
+Fail: (timeout: 1) (elapsed: 0.00115108) (conn: 0.00108099) (set: 1.00002789) (get: 0.00000000)
+```
+
+In this case, it failed waiting for "get" to complete.
+
+If you want to log all of the tests mc_conn_tester.pl runs, open the file and
+change the line:
+
+```
+my $debug = 0;
+```
+
+to
+
+```
+my $debug = 1;
+```
+
+You will then see normal lines begin with `loop:` and failed tests will start
+with `Fail:` as usual.
+
+### You're probably dropping packets.
+
+In most cases, where listen_disabled_num doesn't apply, you're likely dropping
+packets for some reason. Either a firewall is in the way and has run out of
+stateful tracking slots, or your network card or switch is dropping packets.
+
+You'll most likely see this manifest as:
+
+```
+Fail: (timeout: 1) (elapsed: 1.00145602) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
+```
+
+... where `conn:` and the rest are all zero. So the test was not able to
+connect to memcached.
+
+On most systems SYN retries are 3 seconds, which is awfully long. Losing a
+single SYN packet will certainly mean a timeout. This is easily proven:
+
+```
+$ ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 1 > log_one_second
+&& ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 4 > log_three_seconds
+&& ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 8 > log_eight_seconds
+```
+
+... Run 5000 tests each round (you can adjust this if you wish). The first one
+having a timeout of 1s, which is often the client default. Then next with 4s,
+which would allow for one SYN packet to be lost but still pass the test. Then
+finally 8s, which allows two SYN packets to be lost in a row and yet still
+succeed.
+
+If you see the number of `Fail:` lines in each log file *decrease*, then your
+network is likely dropping SYN packets.
+
+Fixing that, however, is beyond the scope of this document.
+
+### TIME_WAIT buckets or local port exhaustion
+
+If mc_conn_tester.pl is seeing connection timeouts (conn: is 0), you may be
+running out of local ports, firewall states, or TIME_WAIT buckets. This can
+happen if you are opening and closing connections quicker than the sockets can
+die off.
+
+Use netstat to see how many you have open, and if the number is high enough
+that it may be problematic. `netstat -n | grep -c TIME_WAIT`.
+
+Details of how to tune these variables are outside the scope of this document,
+but google for "Linux TCP network tuning TIME_WAIT" (or whatever OS you have) will
+usually give you good results. Look for the variables below and understand
+their meaning before tuning.
+
+```
+!THESE ARE EXAMPLES, NOT RECOMMENDED VALUES!
+net.ipv4.ip_local_port_range = 16384 65534
+net.ipv4.tcp_max_tw_buckets = 262144
+net.ipv4.tcp_fin_timeout = 30
+net.ipv4.tcp_tw_reuse = 1
+net.ipv4.tcp_tw_recycle = 1
+```
+
+Also read up on iptables and look up information on managing conntrack states
+or conntrack buckets. If you find some links you love e-mail us and we'll link
+them here.
+
+### But your utility never fails!
+
+Odds are good your client has a bug :( Try reaching out to the client author
+for help.