-
Notifications
You must be signed in to change notification settings - Fork 0
Multi datacenter results
To be able to scale Voldemort to handle multiple datacenters with minimum effect on latency and throughput. In order to achieve this we need to make the following changes to Voldemort :
- Add new semantics to Voldemort node to determine its proximity to other nodes
- Change the routing logic to incorporate the above knowledge to make smarter routing decisions
The details of the new semantics can be found here . In summary, we propose the formation of zones wherein a zone is a group of nodes which are geographically close to each other. Thus a zone could be a group of nodes in a data-center or even a group of nodes in a rack. Every zone has knowledge of its distance to other zones in the form of a proximity list. We have also added the ability for every Voldemort store to mention the number of zones it would be want to block for before completing a request.
The integral part of this project is part (2) wherein we use this knowledge of proximity list to make better routing decisions.
This page contains the results of our tests comparing the following strategies:
- Naive datacenter non-aware consistent hashing routing (i.e. what we have in production today, but running over two data centers)
- Our new zone aware consistent hashing, which can operate in two configurations
- Cross datacenter quorums that guarantee that reads and writes go to at least N different zones. This requires blocking on cross-datacenter requests so impacts latency, but can guarantee consistency and availability of a write even in the face of a datacenter failure.
- Background replication – This strategy only does blocking operations in the local datacenter and makes all non-datacenter local operations asynchronous. This strategy relies on the vector clock versioning and application-level resolution to resolve conflicts that would result in two concurrent writes on different datacenters.
These two strategies are configured on a per store (i.e. table) basis.This allows applications that want strong consistency guarantees to have them, and those that are willing to sacrifice some consistency to get higher performance can get that as well. The later is the common case in our sample of uses.
As part of this effort we also re-implemented the routing layer of voldemort to better handle a large number of slow requests. This is called the pipelined routed store in the results below.
We aim to answer two questions with our testing:
- What is the effect on latency due to this new routing algorithm in a multi DC scenario?
- What is the effect on throughput?
In both cases there are two baselines we can compare to
- The existing non-datacenter aware code running in multiple datacenters. We never be much worse than this even when giving guarantees on the datacenters.
- The existing code running in a single datacenter. Obviously we cannot be better than this when we introduce high latency requests, but we would like background replication to not be much worse.
In order to evaluate our new routing strategy we ran our tests on Amazon EC2. We deployed small instance nodes ( 1.7 GB memory, 160 GB instance storage, Ubuntu ) in 2 different Amazon data-centers viz. east and west. An important thing to remember here is that these are low configuration machines and that these numbers are not meaningful in absolute terms; however as a relative comparison (e.g. single dc vs multi dc) we think they are representative.
For the test, we set the number of client threads equal to the number of connections allowed per node.
N (number of replicas stored for every key) | 3 |
R (number of replicas we would block for during reads) | 2 |
W (number of replicas we would block for during writes) | 2 |
Existing Records in our store entered during warm-up phase | 200000 |
Number of operations run during benchmark | 200000 |
Value Size | 1024 bytes |
Cross datacenter latency | 81 ms |
Same datacenter latency | 0.509 ms |
AMI on West Coast | ami-4f5f0e0a |
AMI on East Coast | ami-dd957ab4 |
The following is the graph of Median Latency (Y axis) Vs Number of client threads (X axis). In other words we increase the amount of parallelism (in the form of client threads) and try to visualize its effect on the Latency. The following is a description of each test case:
- Naive MultiDC (red) – The naive (DC-non-aware) consistent hashing algorithm deployed in a multiple datacenter environment. In other words, we created 2 nodes on the west coast and 1 node on the east coast. The naive algorithm treats all these 3 nodes equally.
- Naive SingleDC (green) – Uses the same naive consistent hashing algorithm, but this time in a single datacenter environment. That is we used all 3 nodes from the west coast. This is the equivalent of our current configuration.
- Zone MultiDC – Local (blue) – This uses our new zone aware routing algorithm using only background replication. We ran this in a multiple datacenter environment wherein we defined 2 zones. Zone 0 contained 2 nodes from the west coast while Zone 1 contained 1 node from the east.
- Zone MultiDC – Remote (pink) – The zone-aware routing in a configuration requiring each request to interact with one non-local zone.
- Blue vs. Green – The performance for the single dc case and the multi-dc case with background replication are quite comparable. This means that background replication does not significantly effect the client, and we can add support for additional datacenters without serious performance degredation in this configuration.
- Red vs Pink – We achieve a slightly higher latency for both reads and writes in the case of our new zone aware routing strategy when compared to the case of running the naive algorithm in a multi-dc environment. But this slight increase in latency now guarantees that we will always have a copy of our data in another zone (whereas the previous strategy would often only be local (by random chance). In other words, enforcing the guaranteed datacenter semantics has little incremental cost in a multi-datacenter scenario while giving cross datacenter semantics.
Our goal is to be able to maintain high throughput even in a multi datacenter scenario.
As expected an increase in the number of client threads results in an increase in throughput in all cases.
- Blue Vs Green – This summarizes the hit on throughput we incur by moving to multi-data center without requiring cross data-center semantics. When comparing these two lines / scenarios we see an improvement in throughput in case of reads in a multi-dc environment with zone aware routing when compared to a single-dc. This is counter-intuitive but the cause is overall improvements we made in the pipelining of requests as part of this project. On the other hand for writes, multi-dc with zone routing gives a throughput less than in single-dc for less than 40 threads, after which its throughput increases
- Red Vs Pink – There are no surprises when comparing these two tests since in case of Naive algorithm (red) some cases may result in request not going across to the other zone. In case of pink, we force the request to block till we receive a remote zone response thereby incurring a minimum latency equivalent to cross-zone latency (Refer “min” column below of Table 1 and 7 to notice this). Thus the throughput is less for a multi-dc environment but with the strong guarantee of having a remote replica.
All latencies in ms
Threads | Min lat | Max lat | Median lat | 95th percentile lat | 99th percentile lat | Avg lat | Throughput | |
---|---|---|---|---|---|---|---|---|
reads | 10 | 0 | 2623 | 82 | 125 | 144 | 72.63036 | 135.6980068 |
reads | 20 | 0 | 2917 | 83 | 126 | 147 | 73.691925 | 265.5157444 |
reads | 30 | 0 | 3561 | 83 | 126 | 152 | 74.78994899 | 394.0055988 |
reads | 40 | 0 | 6364 | 84 | 132 | 185 | 80.05115 | 489.6272468 |
reads | 50 | 0 | 6458 | 84 | 139 | 211 | 80.241145 | 603.7681168 |
reads | 60 | 0 | 6872 | 85 | 163 | 231 | 85.18961896 | 686.6198392 |
Threads | Min lat | Max lat | Median lat | 95th percentile lat | 99th percentile lat | Avg lat | Throughput | |
---|---|---|---|---|---|---|---|---|
writes | 10 | 1 | 3216 | 161 | 275 | 305 | 159.0387 | 62.31158534 |
writes | 20 | 1 | 4045 | 258 | 368 | 411 | 230.46631 | 85.93250517 |
writes | 30 | 1 | 5771 | 257 | 367 | 417 | 232.0618862 | 127.5510204 |
writes | 40 | 1 | 3451 | 260 | 372 | 438 | 238.29019 | 166.0486888 |
writes | 50 | 1 | 8816 | 262 | 382 | 458 | 244.95525 | 199.9532109 |
writes | 60 | 1 | 5690 | 270 | 412 | 577 | 255.967517 | 231.5768997 |
Threads | Min lat | Max lat | Median lat | 95th percentile lat | 99th percentile lat | Avg lat | Throughput | |
---|---|---|---|---|---|---|---|---|
reads | 10 | 0 | 675 | 19 | 63 | 80 | 22.48538 | 431.0242645 |
reads | 20 | 0 | 3089 | 22 | 69 | 128 | 27.74373 | 702.4346385 |
reads | 30 | 0 | 1862 | 26 | 101 | 182 | 33.9590209 | 855.3844311 |
reads | 40 | 0 | 1028 | 29 | 135 | 215 | 43.47711 | 900.2115497 |
reads | 50 | 0 | 5735 | 36 | 153 | 255 | 52.49223 | 929.9384381 |
reads | 60 | 0 | 4085 | 45 | 179 | 302 | 60.89041404 | 954.1575028 |
Threads | Min lat | Max lat | Median lat | 95th percentile lat | 99th percentile lat | Avg lat | Throughput | |
---|---|---|---|---|---|---|---|---|
writes | 10 | 1 | 846 | 33 | 100 | 140 | 39.65215 | 247.1637954 |
writes | 20 | 1 | 1637 | 72 | 161 | 244 | 77.77219 | 253.1011215 |
writes | 30 | 1 | 1939 | 79 | 205 | 340 | 95.21793179 | 310.030414 |
writes | 40 | 1 | 1728 | 92 | 250 | 442 | 115.32439 | 341.08855 |
writes | 50 | 1 | 2460 | 130 | 325 | 566 | 153.07419 | 322.1763658 |
writes | 60 | 1 | 5677 | 145 | 390 | 681 | 176.4291317 | 332.9393551 |
Threads | Min lat | Max lat | Median lat | 95th percentile lat | 99th percentile lat | Avg lat | Throughput | |
---|---|---|---|---|---|---|---|---|
reads | 10 | 0 | 536 | 3 | 54 | 71 | 15.008425 | 644.8180807 |
reads | 20 | 0 | 438 | 19 | 62 | 121 | 23.67133 | 823.021559 |
reads | 30 | 0 | 578 | 23 | 76 | 144 | 29.54253425 | 991.8863695 |
reads | 40 | 0 | 676 | 28 | 128 | 162 | 37.27433 | 1047.877524 |
reads | 50 | 0 | 1014 | 33 | 142 | 182 | 46.05392 | 1062.185659 |
reads | 60 | 0 | 1816 | 44 | 156 | 196 | 54.81218622 | 1074.881629 |
Threads | Min lat | Max lat | Median lat | 95th percentile lat | 99th percentile lat | Avg lat | Throughput | |
---|---|---|---|---|---|---|---|---|
writes | 10 | 1 | 1426 | 82 | 128 | 170 | 91.33087 | 108.6284678 |
writes | 20 | 1 | 1477 | 88 | 182 | 242 | 102.14024 | 194.158546 |
writes | 30 | 1 | 1461 | 94 | 205 | 304 | 109.3179718 | 271.5583375 |
writes | 40 | 2 | 2418 | 104 | 228 | 327 | 122.22935 | 323.102821 |
writes | 50 | 1 | 2286 | 114 | 259 | 391 | 136.03922 | 361.6335712 |
writes | 60 | 2 | 4386 | 128 | 285 | 426 | 150.7876451 | 391.7298005 |
Threads | Min lat | Max lat | Median lat | 95th percentile lat | 99th percentile lat | Avg lat | Throughput | |
---|---|---|---|---|---|---|---|---|
reads | 10 | 81 | 1132 | 87 | 126 | 135 | 94.971145 | 104.5631352 |
reads | 20 | 81 | 1592 | 91 | 130 | 200 | 99.110385 | 200.0568161 |
reads | 30 | 81 | 4015 | 94 | 146 | 245 | 103.759916 | 285.10904 |
reads | 40 | 81 | 2573 | 98 | 187 | 274 | 109.02197 | 362.7223034 |
reads | 50 | 81 | 4022 | 102 | 210 | 292 | 117.18184 | 421.5478394 |
reads | 60 | 81 | 3573 | 104 | 220 | 312 | 121.6437244 | 487.0801978 |
Threads | Min lat | Max lat | Median lat | 95th percentile lat | 99th percentile lat | Avg lat | Throughput | |
---|---|---|---|---|---|---|---|---|
writes | 10 | 163 | 1408 | 204 | 253 | 290 | 203.92787 | 48.85813649 |
writes | 20 | 245 | 1925 | 294 | 369 | 456 | 305.79862 | 65.06561868 |
writes | 30 | 245 | 2365 | 301 | 397 | 527 | 315.4977198 | 94.39750791 |
writes | 40 | 245 | 4271 | 313 | 442 | 630 | 331.62486 | 119.7034228 |
writes | 50 | 246 | 3437 | 325 | 497 | 689 | 349.99634 | 141.7836382 |
writes | 60 | 246 | 7279 | 344 | 531 | 910 | 373.7387355 | 159.2229918 |