Skip to content

Multi datacenter results

rsumbaly edited this page Aug 17, 2010 · 4 revisions

Aim

To be able to scale Voldemort to handle multiple datacenters with minimum effect on latency and throughput. In order to achieve this we need to make the following changes to Voldemort :

  1. Add new semantics to Voldemort node to determine its proximity to other nodes
  2. Change the routing logic to incorporate the above knowledge to make smarter routing decisions

The details of the new semantics can be found here . In summary, we propose the formation of zones wherein a zone is a group of nodes which are geographically close to each other. Thus a zone could be a group of nodes in a data-center or even a group of nodes in a rack. Every zone has knowledge of its distance to other zones in the form of a proximity list. We have also added the ability for every Voldemort store to mention the number of zones it would be want to block for before completing a request.

The integral part of this project is part (2) wherein we use this knowledge of proximity list to make better routing decisions.

This page contains the results of our tests comparing the following strategies:

  1. Naive datacenter non-aware consistent hashing routing (i.e. what we have in production today, but running over two data centers)
  2. Our new zone aware consistent hashing, which can operate in two configurations
    • Cross datacenter quorums that guarantee that reads and writes go to at least N different zones. This requires blocking on cross-datacenter requests so impacts latency, but can guarantee consistency and availability of a write even in the face of a datacenter failure.
    • Background replication – This strategy only does blocking operations in the local datacenter and makes all non-datacenter local operations asynchronous. This strategy relies on the vector clock versioning and application-level resolution to resolve conflicts that would result in two concurrent writes on different datacenters.

These two strategies are configured on a per store (i.e. table) basis.This allows applications that want strong consistency guarantees to have them, and those that are willing to sacrifice some consistency to get higher performance can get that as well. The later is the common case in our sample of uses.

As part of this effort we also re-implemented the routing layer of voldemort to better handle a large number of slow requests. This is called the pipelined routed store in the results below.

We aim to answer two questions with our testing:

  1. What is the effect on latency due to this new routing algorithm in a multi DC scenario?
  2. What is the effect on throughput?

In both cases there are two baselines we can compare to

  1. The existing non-datacenter aware code running in multiple datacenters. We never be much worse than this even when giving guarantees on the datacenters.
  2. The existing code running in a single datacenter. Obviously we cannot be better than this when we introduce high latency requests, but we would like background replication to not be much worse.

Test Environment

In order to evaluate our new routing strategy we ran our tests on Amazon EC2. We deployed small instance nodes ( 1.7 GB memory, 160 GB instance storage, Ubuntu ) in 2 different Amazon data-centers viz. east and west. An important thing to remember here is that these are low configuration machines and that these numbers are not meaningful in absolute terms; however as a relative comparison (e.g. single dc vs multi dc) we think they are representative.

For the test, we set the number of client threads equal to the number of connections allowed per node.

N (number of replicas stored for every key) 3
R (number of replicas we would block for during reads) 2
W (number of replicas we would block for during writes) 2
Existing Records in our store entered during warm-up phase 200000
Number of operations run during benchmark 200000
Value Size 1024 bytes
Cross datacenter latency 81 ms
Same datacenter latency 0.509 ms
AMI on West Coast ami-4f5f0e0a
AMI on East Coast ami-dd957ab4

Latency

The following is the graph of Median Latency (Y axis) Vs Number of client threads (X axis). In other words we increase the amount of parallelism (in the form of client threads) and try to visualize its effect on the Latency. The following is a description of each test case:

  • Naive MultiDC (red) – The naive (DC-non-aware) consistent hashing algorithm deployed in a multiple datacenter environment. In other words, we created 2 nodes on the west coast and 1 node on the east coast. The naive algorithm treats all these 3 nodes equally.
  • Naive SingleDC (green) – Uses the same naive consistent hashing algorithm, but this time in a single datacenter environment. That is we used all 3 nodes from the west coast. This is the equivalent of our current configuration.
  • Zone MultiDC – Local (blue) – This uses our new zone aware routing algorithm using only background replication. We ran this in a multiple datacenter environment wherein we defined 2 zones. Zone 0 contained 2 nodes from the west coast while Zone 1 contained 1 node from the east.
  • Zone MultiDC – Remote (pink) – The zone-aware routing in a configuration requiring each request to interact with one non-local zone.

Reads

Writes

Summary

  1. Blue vs. Green – The performance for the single dc case and the multi-dc case with background replication are quite comparable. This means that background replication does not significantly effect the client, and we can add support for additional datacenters without serious performance degredation in this configuration.
  2. Red vs Pink – We achieve a slightly higher latency for both reads and writes in the case of our new zone aware routing strategy when compared to the case of running the naive algorithm in a multi-dc environment. But this slight increase in latency now guarantees that we will always have a copy of our data in another zone (whereas the previous strategy would often only be local (by random chance). In other words, enforcing the guaranteed datacenter semantics has little incremental cost in a multi-datacenter scenario while giving cross datacenter semantics.

Throughput

Our goal is to be able to maintain high throughput even in a multi datacenter scenario.

Reads

Writes

Summary

As expected an increase in the number of client threads results in an increase in throughput in all cases.

  1. Blue Vs Green – This summarizes the hit on throughput we incur by moving to multi-data center without requiring cross data-center semantics. When comparing these two lines / scenarios we see an improvement in throughput in case of reads in a multi-dc environment with zone aware routing when compared to a single-dc. This is counter-intuitive but the cause is overall improvements we made in the pipelining of requests as part of this project. On the other hand for writes, multi-dc with zone routing gives a throughput less than in single-dc for less than 40 threads, after which its throughput increases
  2. Red Vs Pink – There are no surprises when comparing these two tests since in case of Naive algorithm (red) some cases may result in request not going across to the other zone. In case of pink, we force the request to block till we receive a remote zone response thereby incurring a minimum latency equivalent to cross-zone latency (Refer “min” column below of Table 1 and 7 to notice this). Thus the throughput is less for a multi-dc environment but with the strong guarantee of having a remote replica.

Result Dump

All latencies in ms

Table 1 – Naive MultiDC Reads

Threads Min lat Max lat Median lat 95th percentile lat 99th percentile lat Avg lat Throughput
reads 10 0 2623 82 125 144 72.63036 135.6980068
reads 20 0 2917 83 126 147 73.691925 265.5157444
reads 30 0 3561 83 126 152 74.78994899 394.0055988
reads 40 0 6364 84 132 185 80.05115 489.6272468
reads 50 0 6458 84 139 211 80.241145 603.7681168
reads 60 0 6872 85 163 231 85.18961896 686.6198392

Table 2 – Naive MultiDC Writes

Threads Min lat Max lat Median lat 95th percentile lat 99th percentile lat Avg lat Throughput
writes 10 1 3216 161 275 305 159.0387 62.31158534
writes 20 1 4045 258 368 411 230.46631 85.93250517
writes 30 1 5771 257 367 417 232.0618862 127.5510204
writes 40 1 3451 260 372 438 238.29019 166.0486888
writes 50 1 8816 262 382 458 244.95525 199.9532109
writes 60 1 5690 270 412 577 255.967517 231.5768997

Table 3 – Naive SingleDC Reads

Threads Min lat Max lat Median lat 95th percentile lat 99th percentile lat Avg lat Throughput
reads 10 0 675 19 63 80 22.48538 431.0242645
reads 20 0 3089 22 69 128 27.74373 702.4346385
reads 30 0 1862 26 101 182 33.9590209 855.3844311
reads 40 0 1028 29 135 215 43.47711 900.2115497
reads 50 0 5735 36 153 255 52.49223 929.9384381
reads 60 0 4085 45 179 302 60.89041404 954.1575028

Table 4 – Naive SingleDC Writes

Threads Min lat Max lat Median lat 95th percentile lat 99th percentile lat Avg lat Throughput
writes 10 1 846 33 100 140 39.65215 247.1637954
writes 20 1 1637 72 161 244 77.77219 253.1011215
writes 30 1 1939 79 205 340 95.21793179 310.030414
writes 40 1 1728 92 250 442 115.32439 341.08855
writes 50 1 2460 130 325 566 153.07419 322.1763658
writes 60 1 5677 145 390 681 176.4291317 332.9393551

Table 5 – Zone MultiDC Local – Reads ( Zone Count = 0 )

Threads Min lat Max lat Median lat 95th percentile lat 99th percentile lat Avg lat Throughput
reads 10 0 536 3 54 71 15.008425 644.8180807
reads 20 0 438 19 62 121 23.67133 823.021559
reads 30 0 578 23 76 144 29.54253425 991.8863695
reads 40 0 676 28 128 162 37.27433 1047.877524
reads 50 0 1014 33 142 182 46.05392 1062.185659
reads 60 0 1816 44 156 196 54.81218622 1074.881629

Table 6 – Zone MultiDC Local – Writes ( Zone Count = 0 )

Threads Min lat Max lat Median lat 95th percentile lat 99th percentile lat Avg lat Throughput
writes 10 1 1426 82 128 170 91.33087 108.6284678
writes 20 1 1477 88 182 242 102.14024 194.158546
writes 30 1 1461 94 205 304 109.3179718 271.5583375
writes 40 2 2418 104 228 327 122.22935 323.102821
writes 50 1 2286 114 259 391 136.03922 361.6335712
writes 60 2 4386 128 285 426 150.7876451 391.7298005

Table 7 – Zone MultiDC Remote – Reads ( Zone Count = 1 )

Threads Min lat Max lat Median lat 95th percentile lat 99th percentile lat Avg lat Throughput
reads 10 81 1132 87 126 135 94.971145 104.5631352
reads 20 81 1592 91 130 200 99.110385 200.0568161
reads 30 81 4015 94 146 245 103.759916 285.10904
reads 40 81 2573 98 187 274 109.02197 362.7223034
reads 50 81 4022 102 210 292 117.18184 421.5478394
reads 60 81 3573 104 220 312 121.6437244 487.0801978

Table 8 – Zone MultiDC Remote – Writes ( Zone Count = 1 )

Threads Min lat Max lat Median lat 95th percentile lat 99th percentile lat Avg lat Throughput
writes 10 163 1408 204 253 290 203.92787 48.85813649
writes 20 245 1925 294 369 456 305.79862 65.06561868
writes 30 245 2365 301 397 527 315.4977198 94.39750791
writes 40 245 4271 313 442 630 331.62486 119.7034228
writes 50 246 3437 325 497 689 349.99634 141.7836382
writes 60 246 7279 344 531 910 373.7387355 159.2229918