Evaluating the Use of Replication Factor = 1 During Zone Failures in a 3-Zone Mimir Deployment #9411

ilmiye · 2024-09-25T15:38:39Z

ilmiye
Sep 25, 2024

Hello,

“Deploying Grafana Mimir clusters to more zones than the configured replication factor does not have a negative impact. Deploying Grafana Mimir clusters to fewer zones than the configured replication factor can cause writes to the replica to be missed or fail completely. If there are fewer than floor(replication factor / 2) zones with failing replicas, reads and writes can withstand zone failures.”

Our current replication factor is 3, which, based on the statement above, suggests we should be able to survive the loss of one Mimir zone. However, in practice, during a recent incident with one of our clusters, we were unable to maintain operations under heavy load following an AZ (Availability Zone) failure.

We are currently deploying Grafana Mimir on EKS, aligning AWS zones with Mimir zones for multi-zone replication.

Given that our deployment failed to survive an AZ loss, I’m seeking short-term strategies to ensure availability during such scenarios, at least until the affected zone recovers.

One potential solution I’m considering is to remove the ingesters of the lost AZ from the ring and temporarily reduce the replication factor to 1. This would lower the write quorum to 1, allowing writes to proceed. However, I am unsure how this would impact the read path, particularly given the ingester in-memory metric data that may become inconsistent until it's flushed to S3. How would the queriers and read path behave with a replication factor of 1, and with only two operational zones?

This state with replication factor = 1 would only be a temporary measure to survive an AZ loss.

Looking forward to your advice and insights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating the Use of Replication Factor = 1 During Zone Failures in a 3-Zone Mimir Deployment #9411

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Evaluating the Use of Replication Factor = 1 During Zone Failures in a 3-Zone Mimir Deployment #9411

ilmiye Sep 25, 2024

Replies: 0 comments

ilmiye
Sep 25, 2024