Skip to content

Commit

Permalink
import timeouts page
Browse files Browse the repository at this point in the history
  • Loading branch information
dormando committed Sep 5, 2024
1 parent e199fc5 commit 1abc652
Show file tree
Hide file tree
Showing 2 changed files with 171 additions and 0 deletions.
2 changes: 2 additions & 0 deletions content/troubleshooting/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ title = 'Troubleshooting'
date = 2024-09-02T11:18:44-07:00
weight = 5
+++

[Troubleshooting Client Timeouts](/troubleshooting/timeouts)
169 changes: 169 additions & 0 deletions content/troubleshooting/timeouts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
+++
title = 'Client Timeouts'
date = 2024-09-05T11:13:11-07:00
weight = 1
+++

## Troubleshooting Timeouts

Client complaining about "timeout errors", but not sure how to track it down?
Here's a simple utility for examining your situation.

### First, check listen_disabled_num

Before you go ahead with troubleshooting, you'll want to telnet to your
memcached instance and run `stats`, then look for "listen_disabled_num". This
is a poorly named counter which describes how many times you've reached
maxconns. Each time memcached hits maxconns it will delay new connections,
which means you'll possibly get timeouts.

Also, disable or tune any firewalls you may have in the way.

### Then, carefully check the usual suspects

Is the machine in swap? You will see random lag bubbles if your OS is swapping
memcached to disk periodically.

Is the machine overloaded? 0% CPU idle with a load of 400 and memcached
probably isn't getting enough CPU time. You can try `nice` or `renice`, or
just run less on the machine. If you're severely overloaded on CPU, you might
notice the mc_conn_tester below reporting very high wait times for `set`
commands.

Is the memcached server 32bit? 32bit hosts have less memory available to the
kernel for TCP sockets and friends. We've observed some odd behavior under
large numbers of open sockets and high load with 32bit systems. Strongly
consider going 64bit, as it may help some hard to trace problems go away,
including segfaults due to the 2/4g memory limit.

### Next, mc_conn_tester.pl

Fetch this:

https://www.memcached.org/files/mc_conn_tester.pl

```
$ ./mc_conn_tester.pl -s memcached-host -p 11211 -c 1000 --timeout 1
Averages: (conn: 0.00081381) (set: 0.00001603) (get: 0.00040122)
$ ./mc_conn_tester.pl --help
Usage:
mc_conn_tester.pl [options]
Options:
-s --server hostname
Connect to an alternate hostname.
[...etc...]
```

This is a minimal utility for testing a quick routine with a memcached
instance. It will connect, attempt a couple sets, attempt a few gets, then loop and
repeat.

The utility does not use any memcached client and instead does minimal, raw
commands with the ASCII protocol. Thus helping to rule out client bugs.

If it reaches a timeout, you can see how far along in the cycle it was:

```
Fail: (timeout: 1) (elapsed: 1.00427794) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00133896) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00135303) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00145602) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
```

In the above line, it has a total elapsed time of the test, and then the times
at which each sub-test succeeded. In the above scanario it wasn't able to
connect to memcached, so all tests failed.

```
Fail: (timeout: 1) (elapsed: 0.00121498) (conn: 0.00114512) (set: 1.00002694) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 0.00368810) (conn: 0.00360799) (set: 1.00003314) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 0.00128603) (conn: 0.00117397) (set: 1.00004005) (get: 0.00000000)
Fail: (timeout: 1) (elapsed: 0.00115108) (conn: 0.00108099) (set: 1.00002789) (get: 0.00000000)
```

In this case, it failed waiting for "get" to complete.

If you want to log all of the tests mc_conn_tester.pl runs, open the file and
change the line:

```
my $debug = 0;
```

to

```
my $debug = 1;
```

You will then see normal lines begin with `loop:` and failed tests will start
with `Fail:` as usual.

### You're probably dropping packets.

In most cases, where listen_disabled_num doesn't apply, you're likely dropping
packets for some reason. Either a firewall is in the way and has run out of
stateful tracking slots, or your network card or switch is dropping packets.

You'll most likely see this manifest as:

```
Fail: (timeout: 1) (elapsed: 1.00145602) (conn: 0.00000000) (set: 0.00000000) (get: 0.00000000)
```

... where `conn:` and the rest are all zero. So the test was not able to
connect to memcached.

On most systems SYN retries are 3 seconds, which is awfully long. Losing a
single SYN packet will certainly mean a timeout. This is easily proven:

```
$ ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 1 > log_one_second
&& ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 4 > log_three_seconds
&& ./mc_conn_tester.pl -s memcached-host -c 5000 --timeout 8 > log_eight_seconds
```

... Run 5000 tests each round (you can adjust this if you wish). The first one
having a timeout of 1s, which is often the client default. Then next with 4s,
which would allow for one SYN packet to be lost but still pass the test. Then
finally 8s, which allows two SYN packets to be lost in a row and yet still
succeed.

If you see the number of `Fail:` lines in each log file *decrease*, then your
network is likely dropping SYN packets.

Fixing that, however, is beyond the scope of this document.

### TIME_WAIT buckets or local port exhaustion

If mc_conn_tester.pl is seeing connection timeouts (conn: is 0), you may be
running out of local ports, firewall states, or TIME_WAIT buckets. This can
happen if you are opening and closing connections quicker than the sockets can
die off.

Use netstat to see how many you have open, and if the number is high enough
that it may be problematic. `netstat -n | grep -c TIME_WAIT`.

Details of how to tune these variables are outside the scope of this document,
but google for "Linux TCP network tuning TIME_WAIT" (or whatever OS you have) will
usually give you good results. Look for the variables below and understand
their meaning before tuning.

```
!THESE ARE EXAMPLES, NOT RECOMMENDED VALUES!
net.ipv4.ip_local_port_range = 16384 65534
net.ipv4.tcp_max_tw_buckets = 262144
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
```

Also read up on iptables and look up information on managing conntrack states
or conntrack buckets. If you find some links you love e-mail us and we'll link
them here.

### But your utility never fails!

Odds are good your client has a bug :( Try reaching out to the client author
for help.

0 comments on commit 1abc652

Please sign in to comment.