Multi datacenter results

Aim

To be able to scale Voldemort to handle multiple datacenters with minimum effect on latency and throughput. In order to achieve this we need to make the following changes to Voldemort :

Add new semantics to Voldemort node to determine its proximity to other nodes
Change the routing logic to incorporate the above knowledge to make smarter routing decisions

The details of the new semantics can be found here . In summary, we propose the formation of zones wherein a zone is a group of nodes which are geographically close to each other. Thus a zone could be a group of nodes in a data-center or even a group of nodes in a rack. Every zone has knowledge of its distance to other zones in the form of a proximity list. We have also added the ability for every Voldemort store to mention the number of zones it would be want to block for before completing a request.

The integral part of this project is part (2) wherein we use this knowledge of proximity list to make better routing decisions.

This page contains the results of our tests comparing the following strategies:

Naive datacenter non-aware consistent hashing routing (i.e. what we have in production today, but running over two data centers)
Our new zone aware consistent hashing, which can operate in two configurations
- Cross datacenter quorums that guarantee that reads and writes go to at least N different zones. This requires blocking on cross-datacenter requests so impacts latency, but can guarantee consistency and availability of a write even in the face of a datacenter failure.
- Background replication – This strategy only does blocking operations in the local datacenter and makes all non-datacenter local operations asynchronous. This strategy relies on the vector clock versioning and application-level resolution to resolve conflicts that would result in two concurrent writes on different datacenters.

These two strategies are configured on a per store (i.e. table) basis.This allows applications that want strong consistency guarantees to have them, and those that are willing to sacrifice some consistency to get higher performance can get that as well. The later is the common case in our sample of uses.

As part of this effort we also re-implemented the routing layer of voldemort to better handle a large number of slow requests. This is called the pipelined routed store in the results below.

We aim to answer two questions with our testing:

What is the effect on latency due to this new routing algorithm in a multi DC scenario?
What is the effect on throughput?

In both cases there are two baselines we can compare to

The existing non-datacenter aware code running in multiple datacenters. We never be much worse than this even when giving guarantees on the datacenters.
The existing code running in a single datacenter. Obviously we cannot be better than this when we introduce high latency requests, but we would like background replication to not be much worse.

Test Environment

In order to evaluate our new routing strategy we ran our tests on Amazon EC2. We deployed small instance nodes ( 1.7 GB memory, 160 GB instance storage, Ubuntu ) in 2 different Amazon data-centers viz. east and west. An important thing to remember here is that these are low configuration machines and that these numbers are not meaningful in absolute terms; however as a relative comparison (e.g. single dc vs multi dc) we think they are representative.

For the test, we set the number of client threads equal to the number of connections allowed per node.

N (number of replicas stored for every key)	3
R (number of replicas we would block for during reads)	2
W (number of replicas we would block for during writes)	2
Existing Records in our store entered during warm-up phase	200000
Number of operations run during benchmark	200000
Value Size	1024 bytes
Cross datacenter latency	81 ms
Same datacenter latency	0.509 ms
AMI on West Coast	ami-4f5f0e0a
AMI on East Coast	ami-dd957ab4

Latency

The following is the graph of Median Latency (Y axis) Vs Number of client threads (X axis). In other words we increase the amount of parallelism (in the form of client threads) and try to visualize its effect on the Latency. The following is a description of each test case:

Naive MultiDC (red) – The naive (DC-non-aware) consistent hashing algorithm deployed in a multiple datacenter environment. In other words, we created 2 nodes on the west coast and 1 node on the east coast. The naive algorithm treats all these 3 nodes equally.
Naive SingleDC (green) – Uses the same naive consistent hashing algorithm, but this time in a single datacenter environment. That is we used all 3 nodes from the west coast. This is the equivalent of our current configuration.
Zone MultiDC – Local (blue) – This uses our new zone aware routing algorithm using only background replication. We ran this in a multiple datacenter environment wherein we defined 2 zones. Zone 0 contained 2 nodes from the west coast while Zone 1 contained 1 node from the east.
Zone MultiDC – Remote (pink) – The zone-aware routing in a configuration requiring each request to interact with one non-local zone.

Reads

Writes

Summary

Blue vs. Green – The performance for the single dc case and the multi-dc case with background replication are quite comparable. This means that background replication does not significantly effect the client, and we can add support for additional datacenters without serious performance degredation in this configuration.
Red vs Pink – We achieve a slightly higher latency for both reads and writes in the case of our new zone aware routing strategy when compared to the case of running the naive algorithm in a multi-dc environment. But this slight increase in latency now guarantees that we will always have a copy of our data in another zone (whereas the previous strategy would often only be local (by random chance). In other words, enforcing the guaranteed datacenter semantics has little incremental cost in a multi-datacenter scenario while giving cross datacenter semantics.

Throughput

Our goal is to be able to maintain high throughput even in a multi datacenter scenario.

Reads

Writes

Summary

As expected an increase in the number of client threads results in an increase in throughput in all cases.

Blue Vs Green – This summarizes the hit on throughput we incur by moving to multi-data center without requiring cross data-center semantics. When comparing these two lines / scenarios we see an improvement in throughput in case of reads in a multi-dc environment with zone aware routing when compared to a single-dc. This is counter-intuitive but the cause is overall improvements we made in the pipelining of requests as part of this project. On the other hand for writes, multi-dc with zone routing gives a throughput less than in single-dc for less than 40 threads, after which its throughput increases
Red Vs Pink – There are no surprises when comparing these two tests since in case of Naive algorithm (red) some cases may result in request not going across to the other zone. In case of pink, we force the request to block till we receive a remote zone response thereby incurring a minimum latency equivalent to cross-zone latency (Refer “min” column below of Table 1 and 7 to notice this). Thus the throughput is less for a multi-dc environment but with the strong guarantee of having a remote replica.

Result Dump

All latencies in ms

Table 1 – Naive MultiDC Reads

	Threads	Max lat	Median lat	95th percentile lat	99th percentile lat	Avg lat	Throughput
reads	10	2623	82	125	144	72.63036	135.6980068
reads	20	2917	83	126	147	73.691925	265.5157444
reads	30	3561	83	126	152	74.78994899	394.0055988
reads	40	6364	84	132	185	80.05115	489.6272468
reads	50	6458	84	139	211	80.241145	603.7681168
reads	60	6872	85	163	231	85.18961896	686.6198392

Table 2 – Naive MultiDC Writes

	Threads	Min lat	Max lat	Median lat	95th percentile lat	99th percentile lat	Avg lat	Throughput
writes	10	1	3216	161	275	305	159.0387	62.31158534
writes	20	1	4045	258	368	411	230.46631	85.93250517
writes	30	1	5771	257	367	417	232.0618862	127.5510204
writes	40	1	3451	260	372	438	238.29019	166.0486888
writes	50	1	8816	262	382	458	244.95525	199.9532109
writes	60	1	5690	270	412	577	255.967517	231.5768997

Table 3 – Naive SingleDC Reads

	Threads	Max lat	Median lat	95th percentile lat	99th percentile lat	Avg lat	Throughput
reads	10	675	19	63	80	22.48538	431.0242645
reads	20	3089	22	69	128	27.74373	702.4346385
reads	30	1862	26	101	182	33.9590209	855.3844311
reads	40	1028	29	135	215	43.47711	900.2115497
reads	50	5735	36	153	255	52.49223	929.9384381
reads	60	4085	45	179	302	60.89041404	954.1575028

Table 4 – Naive SingleDC Writes

	Threads	Min lat	Max lat	Median lat	95th percentile lat	99th percentile lat	Avg lat	Throughput
writes	10	1	846	33	100	140	39.65215	247.1637954
writes	20	1	1637	72	161	244	77.77219	253.1011215
writes	30	1	1939	79	205	340	95.21793179	310.030414
writes	40	1	1728	92	250	442	115.32439	341.08855
writes	50	1	2460	130	325	566	153.07419	322.1763658
writes	60	1	5677	145	390	681	176.4291317	332.9393551

Table 5 – Zone MultiDC Local – Reads ( Zone Count = 0 )

	Threads	Max lat	Median lat	95th percentile lat	99th percentile lat	Avg lat	Throughput
reads	10	536	3	54	71	15.008425	644.8180807
reads	20	438	19	62	121	23.67133	823.021559
reads	30	578	23	76	144	29.54253425	991.8863695
reads	40	676	28	128	162	37.27433	1047.877524
reads	50	1014	33	142	182	46.05392	1062.185659
reads	60	1816	44	156	196	54.81218622	1074.881629

Table 6 – Zone MultiDC Local – Writes ( Zone Count = 0 )

	Threads	Min lat	Max lat	Median lat	95th percentile lat	99th percentile lat	Avg lat	Throughput
writes	10	1	1426	82	128	170	91.33087	108.6284678
writes	20	1	1477	88	182	242	102.14024	194.158546
writes	30	1	1461	94	205	304	109.3179718	271.5583375
writes	40	2	2418	104	228	327	122.22935	323.102821
writes	50	1	2286	114	259	391	136.03922	361.6335712
writes	60	2	4386	128	285	426	150.7876451	391.7298005

Table 7 – Zone MultiDC Remote – Reads ( Zone Count = 1 )

	Threads	Min lat	Max lat	Median lat	95th percentile lat	99th percentile lat	Avg lat	Throughput
reads	10	81	1132	87	126	135	94.971145	104.5631352
reads	20	81	1592	91	130	200	99.110385	200.0568161
reads	30	81	4015	94	146	245	103.759916	285.10904
reads	40	81	2573	98	187	274	109.02197	362.7223034
reads	50	81	4022	102	210	292	117.18184	421.5478394
reads	60	81	3573	104	220	312	121.6437244	487.0801978

Table 8 – Zone MultiDC Remote – Writes ( Zone Count = 1 )

	Threads	Min lat	Max lat	Median lat	95th percentile lat	99th percentile lat	Avg lat	Throughput
writes	10	163	1408	204	253	290	203.92787	48.85813649
writes	20	245	1925	294	369	456	305.79862	65.06561868
writes	30	245	2365	301	397	527	315.4977198	94.39750791
writes	40	245	4271	313	442	630	331.62486	119.7034228
writes	50	246	3437	325	497	689	349.99634	141.7836382
writes	60	246	7279	344	531	910	373.7387355	159.2229918

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi datacenter results

Aim

Test Environment

Latency

Reads

Writes

Summary

Throughput

Reads

Writes

Summary

Result Dump

Table 1 – Naive MultiDC Reads

Table 2 – Naive MultiDC Writes

Table 3 – Naive SingleDC Reads

Table 4 – Naive SingleDC Writes

Table 5 – Zone MultiDC Local – Reads ( Zone Count = 0 )

Table 6 – Zone MultiDC Local – Writes ( Zone Count = 0 )

Table 7 – Zone MultiDC Remote – Reads ( Zone Count = 1 )

Table 8 – Zone MultiDC Remote – Writes ( Zone Count = 1 )

Clone this wiki locally