You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster
Description
We encountered a critical issue in our PostgreSQL cluster (1 master, 1 replica, 1 repo host) where a disk fill-up on the repo host led to a cascading failure of the entire cluster. The problem propagated from the repo host to the master node and then to the replica, causing system-wide outage.
Steps to Reproduce
Set up a PostgreSQL cluster with 1 master, 1 replica, and 1 repo host
Allow the disk on the repo host to fill up due to storing old backups
Observe that WAL files start accumulating on the master node
Master node's disk fills up, causing it to fail
System switches to the replica node
The same WAL accumulation occurs on the replica, leading to its failure as well
Impact
Complete failure of the PostgreSQL cluster
Data unavailability
Potential data loss or corruption
Significant downtime and operational impact
Environment
Everest: 1.2.0
PostgreSQL version: 16.1
Kubernetes version: 1.27.16
Workaround
We don't find a workaround of this problem. Cluster was recreated from backup.
The text was updated successfully, but these errors were encountered:
george-pogosyan
changed the title
Host disk filled up due to unchecked growth of usage by repo host
Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster
Oct 23, 2024
Thanks for the feedback. It tooks me a while to discuss this issue with our experts but looks like we found and improvement we can do to fix this problem. You can track progress of this issue via this PG operator ticket: https://perconadev.atlassian.net/browse/K8SPG-685
Title
Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster
Description
We encountered a critical issue in our PostgreSQL cluster (1 master, 1 replica, 1 repo host) where a disk fill-up on the repo host led to a cascading failure of the entire cluster. The problem propagated from the repo host to the master node and then to the replica, causing system-wide outage.
Steps to Reproduce
Impact
Environment
Workaround
We don't find a workaround of this problem. Cluster was recreated from backup.
Attachments
Environment
The text was updated successfully, but these errors were encountered: