-
Notifications
You must be signed in to change notification settings - Fork 215
Load Testing on Docker
I've been conducting load tests to distinguish performance between RestComm binary versus RestComm-Docker image. For this effect, I developed a RestComm application that performs a Gather and then Says the digit pressed by the caller. The call duration is 6 seconds.
I setup an EC2 m3.2xlarge instance with specifications as shown below. This instance type is preferred mainly because of the increased number of cores. They greatly contribute to better RestComm performance, since the Media Server allocates a number of threads proportional to the number of available cores.
Model | vCPU | Mem (GiB) | SSD Storage (GB) |
---|---|---|---|
m3.medium | 1 | 3.75 | 1 x 4 |
m3.large | 2 | 7.5 | 1 x 32 |
m3.xlarge | 4 | 15 | 2 x 40 |
m3.2xlarge | 8 | 30 | 2 x 80 |
Before running the tests, I configured RestComm as follows:
- Set logging threshold to ERROR (both JBoss and AKKA)
- Configure Media Server's resources pool ($MS_HOME/deploy/server-beans.xml) to accommodate load peaks by increasing the initial size of resources pools (endpoints, players, connections, etc).
- Increase MGCP timeout on restcomm.xml to 1000ms (from 500).
- Configure JVM for both RestComm and Media Server as
JAVA_OPTS="-Xms8g -Xmx8g -Xmn512m -XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycle=100 -XX:CMSIncrementalDutyCycleMin=100 -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:MaxPermSize=512m"
The list of results from the binary load tests is the following:
Concurrent Calls | Call Rate (cps) | Total Calls | Successful | Failed | Comments |
---|---|---|---|---|---|
86 | 10 | 20000 | 20000 | 0 | |
150 | 50 | 50000 | 50000 | 0 | |
167 | 20 | 50000 | 50000 | 0 | |
200 | 30 | 50000 | 50000 | 0 | |
260 | 30 | 50000 | 49954 | 46 | - 46 timeouts, because SIPp did not receive BYE - logs in RestComm show SIP Servlet exception when sending BYE - no errors in media server side |
300 (peak 355) | 30 | 100000 | 66744 | 18678 | - Aborted test after 85422 calls - After 50k calls, MGCP timeouts started to happen - From that moment on, results got worse and CPU leak soon rendered MS unresponsive |
250 | 50 | 44020 | 31824 | 12196 | - mgcp timeouts - no errors on MS log - MS CPU leak (prevented test from continue running) |
The best result I could obtain was 260 concurrent calls with a call rate of 30 calls per second. Although I've got 46 failures, these happened because RestComm was unable to send a BYE back to SIPp, which means that in practice the call was established successfully but only failed to hang up in elegant manner. Loads higher than this would start to generate MGCP timeouts from the Media Server, which would lead to a CPU leak that would soon enough degenerate the quality of the tests and ultimately leave the Media Server unresponsive.
Finally, I performed the same round of tests for the RestComm-Docker image, using similar configuration as described above. One detail worth mentioning is that Docker was operating in host mode (--net=host) instead of default bridge mode, because of well known performance issues. The highest load I could test that would result in a clean run was 60 concurrent calls with a call rate of 15 calls per second. Quite a performance hit.
My conclusion so far is that Docker is imposing a performance penalty on RestComm-Docker image. It might help to investigate well-known issues and bottlenecks inherent of Docker and what workarounds are currently adopted by community. For example, setting Docker container to run in host mode help to obtain better performance and reduces memory consumption compared to bridged mode. Also, it may be worth to investigate Docker performance using OverlayFS filesystem as described in here
On the other side, improving the responsiveness of the Media Server’s MGCP stack will help reducing the number of timeouts which will surely help prevent the CPU leak. That would surely translate in better test results and even allow to increase the call rates. Media Server issues to keep an eye on: #109, #58, #60, #92.