Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCE perf #7

Open
miekg opened this issue May 9, 2017 · 38 comments
Open

GCE perf #7

miekg opened this issue May 9, 2017 · 38 comments

Comments

@miekg
Copy link
Member

miekg commented May 9, 2017

I've started playing around in GCE.

3 machines:

  • instance-1: n1-standard-2
  • instance-2: n1-standard-4
  • sender: n1-standard-4

All installed with Ubuntu-1703. CoreDNS is 007 (release from github).

@miekg
Copy link
Member Author

miekg commented May 9, 2017

Some simple queryperfing to get baseline:

0.2: Queries per second: 58800.788936 qps
0.3: Queries per second: 74555.654706 qps

@miekg
Copy link
Member Author

miekg commented May 9, 2017

Doing SO_REUSEADDR did not make any impact at all! REUSEPORT even made performace worse, don't know what I saw in #4, but can't easily replicate it.

@miekg
Copy link
Member Author

miekg commented May 9, 2017

Profile out when profiing: All this is done against the n1-standard-4 instance
Queries per second: 51117.913849 qps

% go tool pprof pprof.coredns.localhost:6053.samples.cpu.001.pb.gz                                                                ~/pprof deb
Entering interactive mode (type "help" for commands)
(pprof) top
24650ms of 48350ms total (50.98%)
Dropped 209 nodes (cum <= 241.75ms)
Showing top 10 nodes out of 161 (cum >= 1770ms)
      flat  flat%   sum%        cum   cum%
    9740ms 20.14% 20.14%    10280ms 21.26%  syscall.Syscall
    4170ms  8.62% 28.77%     4170ms  8.62%  runtime.futex
    2400ms  4.96% 33.73%     5760ms 11.91%  runtime.mallocgc
    1980ms  4.10% 37.83%     2740ms  5.67%  runtime.scanobject
    1490ms  3.08% 40.91%     1490ms  3.08%  runtime.epollwait
    1170ms  2.42% 43.33%     3290ms  6.80%  runtime.pcvalue
    1040ms  2.15% 45.48%     1040ms  2.15%  runtime.readvarint
    1000ms  2.07% 47.55%     1000ms  2.07%  runtime.usleep
     860ms  1.78% 49.33%     5830ms 12.06%  runtime.gentraceback
     800ms  1.65% 50.98%     1770ms  3.66%  github.com/coredns/coredns/vendor/github.com/miekg/dns.packDomainName

@miekg
Copy link
Member Author

miekg commented May 9, 2017

This looks pretty OK-ish.

pprof) cum
(pprof) top10
10.57s of 48.35s total (21.86%)
Dropped 209 nodes (cum <= 0.24s)
Showing top 10 nodes out of 161 (cum >= 9.85s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%     35.03s 72.45%  runtime.goexit
     0.23s  0.48%  0.48%     24.35s 50.36%  github.com/coredns/coredns/vendor/github.com/miekg/dns.(*Server).serve
     0.02s 0.041%  0.52%     20.41s 42.21%  github.com/coredns/coredns/vendor/github.com/miekg/dns.HandlerFunc.ServeDNS
     0.02s 0.041%  0.56%     20.39s 42.17%  github.com/coredns/coredns/core/dnsserver.(*Server).ServePacket.func1
     0.21s  0.43%  0.99%     20.21s 41.80%  github.com/coredns/coredns/core/dnsserver.(*Server).ServeDNS
     0.02s 0.041%  1.03%     19.21s 39.73%  github.com/coredns/coredns/middleware/whoami.(*Whoami).ServeDNS
     0.26s  0.54%  1.57%     19.19s 39.69%  github.com/coredns/coredns/middleware/whoami.Whoami.ServeDNS
     0.05s   0.1%  1.68%     13.20s 27.30%  github.com/coredns/coredns/vendor/github.com/miekg/dns.(*response).WriteMsg
     9.74s 20.14% 21.82%     10.28s 21.26%  syscall.Syscall
     0.02s 0.041% 21.86%      9.85s 20.37%  github.com/coredns/coredns/vendor/github.com/miekg/dns.(*response).Write

@miekg
Copy link
Member Author

miekg commented May 9, 2017

Uploading the profile
pprof.coredns.localhost:6053.samples.cpu.001.pb.gz

@miekg
Copy link
Member Author

miekg commented May 9, 2017

Looking at the web view syscall.Syscall again stands out.

@miekg
Copy link
Member Author

miekg commented May 9, 2017

(pprof) text syscall.Syscall
10.21s of 48.35s total (21.12%)
Dropped 2 nodes (cum <= 0.24s)
Showing top 10 nodes out of 37 (cum >= 7.88s)
      flat  flat%   sum%        cum   cum%
     9.74s 20.14% 20.14%     10.28s 21.26%  syscall.Syscall
     0.20s  0.41% 20.56%      0.20s  0.41%  runtime.casgstatus
     0.12s  0.25% 20.81%      0.32s  0.66%  runtime.reentersyscall
     0.07s  0.14% 20.95%      0.21s  0.43%  runtime.exitsyscall
     0.07s  0.14% 21.10%      0.07s  0.14%  runtime.futex
     0.01s 0.021% 21.12%      0.33s  0.68%  runtime.entersyscall
         0     0% 21.12%      7.88s 16.30%  github.com/coredns/coredns/core/dnsserver.(*Server).ServeDNS
         0     0% 21.12%      2.40s  4.96%  github.com/coredns/coredns/core/dnsserver.(*Server).ServePacket
         0     0% 21.12%      7.88s 16.30%  github.com/coredns/coredns/core/dnsserver.(*Server).ServePacket.func1
         0     0% 21.12%      7.88s 16.30%  github.com/coredns/coredns/middleware/whoami.(*Whoami).ServeDNS

@miekg
Copy link
Member Author

miekg commented May 9, 2017

Well, syscall.Syscall just does the syscall and that takes time, (pprof) list syscall.Syscall shows the source which lists the children above (runtime.casgstatus and friends). So far this is actually WAI, but it does feel that something needs to be optimized here.

@miekg
Copy link
Member Author

miekg commented May 9, 2017

There are 2 things:

  1. Is this syscall.Syscall an abnormal thing?
  2. Should it be faster?

@miekg
Copy link
Member Author

miekg commented May 9, 2017

Ok

(pprof) list syscall.sendmsg
Total: 48.35s
ROUTINE ======================== syscall.sendmsg in /home/miek/upstream/go/src/syscall/zsyscall_linux_amd64.go
      10ms      7.89s (flat, cum) 16.32% of Total
         .          .   1745:}
         .          .   1746:
         .          .   1747:// THIS FILE IS GENERATED BY THE COMMAND AT THE TOP; DO NOT EDIT
         .          .   1748:
         .          .   1749:func sendmsg(s int, msg *Msghdr, flags int) (n int, err error) {
         .      7.88s   1750:	r0, _, e1 := Syscall(SYS_SENDMSG, uintptr(s), uintptr(unsafe.Pointer(msg)), uintptr(flags))
         .          .   1751:	n = int(r0)
         .          .   1752:	if e1 != 0 {
         .          .   1753:		err = errnoErr(e1)
         .          .   1754:	}
      10ms       10ms   1755:	return
         .          .   1756:}
         .          .   1757:
         .          .   1758:// THIS FILE IS GENERATED BY THE COMMAND AT THE TOP; DO NOT EDIT
         .          .   1759:
         .          .   1760:func mmap(addr uintptr, length uintptr, prot int, flags int, fd int, offset int64) (xaddr uintptr, err error) {

The syscall is sending the message. Makes sense looks legit, i.e. answer 1) from above

@miekg
Copy link
Member Author

miekg commented May 9, 2017

Ah, the slowness might be the use of defer() in my patch to Go DNS. I'll retest.
Not true, this is called once on the socket.

@miekg
Copy link
Member Author

miekg commented May 9, 2017

No using any of the functions in udp_linux.go, results in similar profile. With a large chunk taken by Syscall:

26030ms of 52680ms total (49.41%)
Dropped 220 nodes (cum <= 263.40ms)
Showing top 10 nodes out of 156 (cum >= 2040ms)
      flat  flat%   sum%        cum   cum%
    9430ms 17.90% 17.90%     9890ms 18.77%  syscall.Syscall
    4550ms  8.64% 26.54%     4550ms  8.64%  runtime.futex
    2760ms  5.24% 31.78%     6890ms 13.08%  runtime.mallocgc
    2690ms  5.11% 36.88%     3600ms  6.83%  runtime.scanobject
    1600ms  3.04% 39.92%     4200ms  7.97%  runtime.pcvalue
    1250ms  2.37% 42.29%     1250ms  2.37%  runtime.readvarint
    1200ms  2.28% 44.57%     1200ms  2.28%  runtime.epollwait
     970ms  1.84% 46.41%     2400ms  4.56%  github.com/coredns/coredns/vendor/github.com/miekg/dns.packDomainName
     790ms  1.50% 47.91%     6850ms 13.00%  runtime.gentraceback
     790ms  1.50% 49.41%     2040ms  3.87%  runtime.step

It is nice to see dns.packDomainName, but just writing responses takes 3x longer....

@miekg
Copy link
Member Author

miekg commented May 9, 2017

OK, completely removing OOB UDP sending does not impact performace, hovers around 70K qps.

Going to kernel for sending these udp packets is slow (and we do it for every packet). Now starting to wonder how other UDP servers deal with this.

@miekg
Copy link
Member Author

miekg commented May 9, 2017

@johnbelamaric do you know how this is optimized in C programs?

@miekg
Copy link
Member Author

miekg commented May 9, 2017

This also explains the speedup that I saw locally (although not the 3x increase), but AFAICT linux optimizes out the network when going through localhost - which means no kernel context switch..???

@miekg
Copy link
Member Author

miekg commented May 9, 2017

https://groups.google.com/forum/#!topic/golang-nuts/tBxagDKvrm8 mentions go DNS (sigh) and slowness of syscall on Linux

@miekg
Copy link
Member Author

miekg commented May 9, 2017

Might make sense to write super simple udp echo server from scratch in go and see how that performs.

@johnbelamaric
Copy link
Member

As for optimization in C, no, but let me ping some folks here and see if they do.

@miekg
Copy link
Member Author

miekg commented May 9, 2017

Thanks. The core question seems to be how long should it take to write 200 B to the UDP socket and is this slower than it should be in Go (or as it stands now: Go DNS)

@miekg
Copy link
Member Author

miekg commented May 10, 2017

This is a supersimple Go UDP server: https://gist.github.com/miekg/d9bc045c89578f3cc66a214488e68227

Doing the same queryperf in GCE yields:

./bin/queryperf-linux_amd64 -d domain.lst.queryperf -l 5 -s 10.132.0.3 -p 1053
DNS Query Performance Testing Tool
Version: $Id: queryperf.c,v 1.12 2007/09/05 07:36:04 marka Exp $
[Status] Processing input data
[Status] Sending queries (beginning with 10.132.0.3)
[Status] Testing complete
Statistics:
  Parse input file:     multiple times
  Run time limit:       5 seconds
  Ran through file:     144313 times
  Queries sent:         432942 queries
  Queries completed:    432942 queries
  Queries lost:         0 queries
  Queries delayed(?):   0 queries
  RTT max:              0.002241 sec
  RTT min:              0.000026 sec
  RTT average:          0.000200 sec
  RTT std deviation:    0.000072 sec
  RTT out of range:     0 queries
  Percentage completed: 100.00%
  Percentage lost:        0.00%
  Started at:           Wed May 10 09:31:13 2017
  Finished at:          Wed May 10 09:31:18 2017
  Ran for:              5.000065 seconds
  Queries per second:   86587.274365 qps

~85K qps, 20% faster than Go DNS. Nice but not earth shattering.

@miekg
Copy link
Member Author

miekg commented May 10, 2017

LOL with profiling this drops to "Queries per second: 72013.668522 qps"

@miekg
Copy link
Member Author

miekg commented May 10, 2017

Fetching profile from http://localhost:6060/debug/pprof/profile
Please wait... (30s)
Saved profile in /home/miek/pprof/pprof.udp.localhost:6060.samples.cpu.001.pb.gz
Entering interactive mode (type "help" for commands)
(pprof) top
28760ms of 38180ms total (75.33%)
Dropped 147 nodes (cum <= 190.90ms)
Showing top 10 nodes out of 90 (cum >= 340ms)
      flat  flat%   sum%        cum   cum%
   13440ms 35.20% 35.20%    14220ms 37.24%  syscall.Syscall6
    6070ms 15.90% 51.10%     6070ms 15.90%  runtime.futex
    4650ms 12.18% 63.28%     4650ms 12.18%  runtime.epollwait
    1630ms  4.27% 67.55%     1630ms  4.27%  runtime.usleep
     620ms  1.62% 69.17%     2200ms  5.76%  runtime.mallocgc
     590ms  1.55% 70.72%      590ms  1.55%  runtime.casgstatus
     560ms  1.47% 72.18%    10160ms 26.61%  runtime.findrunnable
     560ms  1.47% 73.65%      560ms  1.47%  runtime.memclrNoHeapPointers
     330ms  0.86% 74.52%      340ms  0.89%  runtime.unlock
     310ms  0.81% 75.33%      340ms  0.89%  runtime.lock

(pprof) cum
(pprof) top10
14.22s of 38.18s total (37.24%)
Dropped 147 nodes (cum <= 0.19s)
Showing top 10 nodes out of 90 (cum >= 11.40s)
      flat  flat%   sum%        cum   cum%
     0.01s 0.026% 0.026%     24.36s 63.80%  runtime.goexit
     0.13s  0.34%  0.37%     16.05s 42.04%  main.serve
     0.10s  0.26%  0.63%     15.92s 41.70%  net.(*UDPConn).WriteTo
     0.02s 0.052%  0.68%     15.82s 41.44%  net.(*UDPConn).writeTo
     0.18s  0.47%  1.15%     14.96s 39.18%  net.(*netFD).writeTo
    13.44s 35.20% 36.35%     14.22s 37.24%  syscall.Syscall6
     0.05s  0.13% 36.49%     12.03s 31.51%  runtime.mcall
     0.20s  0.52% 37.01%     11.67s 30.57%  runtime.schedule
     0.05s  0.13% 37.14%     11.59s 30.36%  syscall.Sendto
     0.04s   0.1% 37.24%     11.40s 29.86%  syscall.sendto

@miekg
Copy link
Member Author

miekg commented May 10, 2017

@miekg
Copy link
Member Author

miekg commented May 10, 2017

(pprof) disasm syscall.Syscall
Total: 38.18s
ROUTINE ======================== syscall.Syscall6
    13.44s     14.22s (flat, cum) 37.24% of Total
         .      520ms     477d00: CALL runtime.entersyscall(SB)
         .          .     477d05: MOVQ 0x10(SP), DI
         .          .     477d0a: MOVQ 0x18(SP), SI
         .          .     477d0f: MOVQ 0x20(SP), DX
         .          .     477d14: MOVQ 0x28(SP), R10
         .          .     477d19: MOVQ 0x30(SP), R8
         .          .     477d1e: MOVQ 0x38(SP), R9
         .          .     477d23: MOVQ 0x8(SP), AX
         .          .     477d28: SYSCALL
    13.42s     13.42s     477d2a: CMPQ $-0xfff, AX
         .          .     477d30: JBE 0x477d52
         .          .     477d32: MOVQ $-0x1, 0x40(SP)
         .          .     477d3b: MOVQ $0x0, 0x48(SP)
         .          .     477d44: NEGQ AX
         .          .     477d47: MOVQ AX, 0x50(SP)
         .          .     477d4c: CALL runtime.exitsyscall(SB)
         .          .     477d51: RET
         .          .     477d52: MOVQ AX, 0x40(SP)
      10ms       10ms     477d57: MOVQ DX, 0x48(SP)
         .          .     477d5c: MOVQ $0x0, 0x50(SP)
      10ms      270ms     477d65: CALL runtime.exitsyscall(SB)

@miekg
Copy link
Member Author

miekg commented May 10, 2017

UDP echo server in C (does not set the QR bit, just echo's bytes back): https://gist.github.com/miekg/a61d55a8ec6560ad6c4a2747b21e6128

Queryperf: 84343.155846 qps (some order of magnitude)

@miekg
Copy link
Member Author

miekg commented May 10, 2017

https://github.com/kavu/go_reuseport makes this somewhat faster in the Go case. Fluctuates in my localhost testing from 60K to 90K

@johnbelamaric
Copy link
Member

From my colleague Kevin who is a veritable performance guru:

The first question I'd ask is if they're seeing 100% CPU.

If not, then they need to see if their path to sendmsg (including any
internal mutexes) has opportunites to block.

In C, we make sure we never do I/O with a lock.  Even with a single
thread though, a simple test does a hell of a lot more than that.

@miekg
Copy link
Member Author

miekg commented May 10, 2017

top - 14:09:17 up 1 day,  3:34,  2 users,  load average: 1.26, 0.35, 0.12
Tasks: 132 total,   1 running, 131 sleeping,   0 stopped,   0 zombie
%Cpu0  : 17.8 us, 28.4 sy,  0.0 ni, 53.4 id,  0.0 wa,  0.0 hi,  0.3 si,  0.0 st
%Cpu1  : 18.2 us, 29.6 sy,  0.0 ni, 52.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 19.9 us, 25.2 sy,  0.0 ni, 37.4 id,  0.0 wa,  0.0 hi, 17.5 si,  0.0 st
%Cpu3  : 14.6 us, 31.9 sy,  0.0 ni, 53.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 15398136 total, 13453972 free,   216784 used,  1727380 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 14836896 avail Mem

not cpu bound.

Following some of Go code paths, they seem sane-ish, but all kind of stuff is happening because of the goroutine scheduling and so forth.

Thanks for the update

@johnbelamaric
Copy link
Member

Does the simple echo server use go routines? It would be interesting to see the raw performance of a single-threaded server with no go routines.

@miekg
Copy link
Member Author

miekg commented May 10, 2017

That should be simple to do; yes it is using multiple Go routines to write the answer. The reading is done from the main thread. SO_REUSEPORT could be helpful there, but initial/small scale testing didn't see any improvements there

@miekg
Copy link
Member Author

miekg commented May 10, 2017

Geh, completely sequential, i.e. go serve() -> serve() yields better performance (still all in the 70-80 Kqps range), but more often higher, e.g. 90K

@miekg
Copy link
Member Author

miekg commented May 10, 2017

This is totally on-par with the C program now. On my (linode VPS) both progs go up to ~90Kqps with the queryperf as committed to this repo.

@miekg
Copy link
Member Author

miekg commented May 10, 2017

userspace <-> kernel space context switches are hard, let's go fishing?

@miekg
Copy link
Member Author

miekg commented May 10, 2017

I'm going to put NSD on these nodes tomorrow and check what it does. 'Cause I believe that we should reverse the testing; clearly this 80-100Kqps is the max this "hardware" can do. Let's optimize for that, i.e. cut out the syscall.Syscall of all perfs and see what's left.

Depends a bit on my NSD testing, if that yields 2x qps something is very wrong somewhere.

@miekg
Copy link
Member Author

miekg commented May 11, 2017

NSD is even more nonsensical and drops replies, so I get only 140 qps. Setup: nsd 4 (apt-get install)
nsd.conf

server:
    port: 1053
    server-count: 1
    ip4-only: yes
    zonesdir: "/home/miek/zones"
    pidfile: "/home/miek/nsd-pid"
    database: "/home/miek/nsd.db"
    username: miek
remote-control:
    control-enable: no
zone:
    name: example.org
    zonefile: "/home/miek/zones/example.org"

nsd running with nsd -c nsd.conf -d

Think #7 (comment) is the most sensible way forward if we want to optimize performance.

@miekg
Copy link
Member Author

miekg commented May 11, 2017

So basic idea is this: get a udp baseline: with udpserv.go and/or the C program that does the same.
Benchmark coredns on the same hardware, see how far off we are --> optimize.

@achille-roussel
Copy link

achille-roussel commented Jan 15, 2018

Hello @miekg

Did you ever find time to finish this up?

Edit: never mind, I missed the other open issues, seems like you moved on to using different setups, thanks for all the details!

@miekg
Copy link
Member Author

miekg commented Jan 16, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants