You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
“Deploying Grafana Mimir clusters to more zones than the configured replication factor does not have a negative impact. Deploying Grafana Mimir clusters to fewer zones than the configured replication factor can cause writes to the replica to be missed or fail completely. If there are fewer than floor(replication factor / 2) zones with failing replicas, reads and writes can withstand zone failures.”
Our current replication factor is 3, which, based on the statement above, suggests we should be able to survive the loss of one Mimir zone. However, in practice, during a recent incident with one of our clusters, we were unable to maintain operations under heavy load following an AZ (Availability Zone) failure.
We are currently deploying Grafana Mimir on EKS, aligning AWS zones with Mimir zones for multi-zone replication.
Given that our deployment failed to survive an AZ loss, I’m seeking short-term strategies to ensure availability during such scenarios, at least until the affected zone recovers.
One potential solution I’m considering is to remove the ingesters of the lost AZ from the ring and temporarily reduce the replication factor to 1. This would lower the write quorum to 1, allowing writes to proceed. However, I am unsure how this would impact the read path, particularly given the ingester in-memory metric data that may become inconsistent until it's flushed to S3. How would the queriers and read path behave with a replication factor of 1, and with only two operational zones?
This state with replication factor = 1 would only be a temporary measure to survive an AZ loss.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
According to this documentation:
Our current replication factor is 3, which, based on the statement above, suggests we should be able to survive the loss of one Mimir zone. However, in practice, during a recent incident with one of our clusters, we were unable to maintain operations under heavy load following an AZ (Availability Zone) failure.
We are currently deploying Grafana Mimir on EKS, aligning AWS zones with Mimir zones for multi-zone replication.
Given that our deployment failed to survive an AZ loss, I’m seeking short-term strategies to ensure availability during such scenarios, at least until the affected zone recovers.
One potential solution I’m considering is to remove the ingesters of the lost AZ from the ring and temporarily reduce the replication factor to 1. This would lower the write quorum to 1, allowing writes to proceed. However, I am unsure how this would impact the read path, particularly given the ingester in-memory metric data that may become inconsistent until it's flushed to S3. How would the queriers and read path behave with a replication factor of 1, and with only two operational zones?
This state with replication factor = 1 would only be a temporary measure to survive an AZ loss.
Looking forward to your advice and insights.
Beta Was this translation helpful? Give feedback.
All reactions