Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster #781

george-pogosyan · 2024-10-23T10:01:13Z

Title

Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster

Description

We encountered a critical issue in our PostgreSQL cluster (1 master, 1 replica, 1 repo host) where a disk fill-up on the repo host led to a cascading failure of the entire cluster. The problem propagated from the repo host to the master node and then to the replica, causing system-wide outage.

Steps to Reproduce

Set up a PostgreSQL cluster with 1 master, 1 replica, and 1 repo host
Allow the disk on the repo host to fill up due to storing old backups
Observe that WAL files start accumulating on the master node
Master node's disk fills up, causing it to fail
System switches to the replica node
The same WAL accumulation occurs on the replica, leading to its failure as well

Impact

Complete failure of the PostgreSQL cluster
Data unavailability
Potential data loss or corruption
Significant downtime and operational impact

Environment

Everest: 1.2.0
PostgreSQL version: 16.1
Kubernetes version: 1.27.16

Workaround

We don't find a workaround of this problem. Cluster was recreated from backup.

Attachments

apiVersion: everest.percona.com/v1alpha1
kind: DatabaseCluster
metadata:
  creationTimestamp: '2024-10-09T10:54:47Z'
  finalizers:
    - everest.percona.com/upstream-cluster-cleanup
    - foregroundDeletion
  generation: 8
  labels:
    backupStorage-yc-s3: used
    clusterName: cluster-name
    monitoringConfigName: pmm
  managedFields:
    - apiVersion: everest.percona.com/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"everest.percona.com/upstream-cluster-cleanup": {}
            v:"foregroundDeletion": {}
          f:labels:
            .: {}
            f:backupStorage-yc-s3: {}
            f:clusterName: {}
            f:monitoringConfigName: {}
        f:spec:
          f:engine:
            f:userSecretsName: {}
          f:proxy:
            f:resources:
              .: {}
              f:cpu: {}
              f:memory: {}
      manager: manager
      operation: Update
      time: '2024-10-09T14:18:51Z'
    - apiVersion: everest.percona.com/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:backup:
            .: {}
            f:enabled: {}
            f:pitr:
              .: {}
              f:backupStorageName: {}
              f:enabled: {}
            f:schedules: {}
          f:engine:
            .: {}
            f:config: {}
            f:replicas: {}
            f:resources:
              .: {}
              f:cpu: {}
              f:memory: {}
            f:storage:
              .: {}
              f:class: {}
              f:size: {}
            f:type: {}
            f:version: {}
          f:monitoring:
            .: {}
            f:monitoringConfigName: {}
          f:proxy:
            .: {}
            f:expose:
              .: {}
              f:type: {}
            f:replicas: {}
      manager: Mozilla
      operation: Update
      time: '2024-10-17T10:09:08Z'
    - apiVersion: everest.percona.com/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:crVersion: {}
          f:details: {}
          f:hostname: {}
          f:observedGeneration: {}
          f:port: {}
          f:ready: {}
          f:size: {}
          f:status: {}
      manager: manager
      operation: Update
      subresource: status
      time: '2024-10-20T13:25:41Z'
  name: cluster-name
  namespace: everest
  resourceVersion: '468620532'
  uid: ef3850cf-722c-43a8-a708-42001d3f4287
spec:
  backup:
    enabled: true
    pitr:
      backupStorageName: yc-s3
      enabled: true
    schedules:
      - backupStorageName: yc-s3
        enabled: true
        name: backup-gxe
        retentionCopies: 7
        schedule: 0 10 * * *
  engine:
    config: ''
    replicas: 2
    resources:
      cpu: '0.6'
      memory: 1G
    storage:
      class: longhorn
      size: 10Gi
    type: postgresql
    userSecretsName: everest-secrets-cluster-name
    version: '16.1'
  monitoring:
    monitoringConfigName: pmm
  proxy:
    expose:
      type: internal
    replicas: 2
    resources:
      cpu: '0'
      memory: '0'
status:
  crVersion: 2.3.1
  details: |
    postgres:
      size: 2
      ready: 2
      instancesets:
      - name: instance1
        size: 2
        ready: 2
    pgbouncer:
      size: 2
      ready: 2
    state: ready
    host: cluster-name-pgbouncer.everest.svc
  hostname: cluster-name-pgbouncer.everest.svc
  observedGeneration: 8
  port: 5432
  ready: 4
  size: 4
  status: ready

Environment

Everest: 1.2.0
k8s: 1.27.16
postgres operator: 2.3.1
postgres version: 16.1

The text was updated successfully, but these errors were encountered:

PeterSzcz · 2024-11-25T12:05:13Z

Hey @george-pogosyan

Thanks for the feedback. It tooks me a while to discuss this issue with our experts but looks like we found and improvement we can do to fix this problem. You can track progress of this issue via this PG operator ticket: https://perconadev.atlassian.net/browse/K8SPG-685

george-pogosyan changed the title ~~Host disk filled up due to unchecked growth of usage by repo host~~ Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster #781

Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster #781

george-pogosyan commented Oct 23, 2024 •

edited

Loading

PeterSzcz commented Nov 25, 2024

Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster #781

Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster #781

Comments

george-pogosyan commented Oct 23, 2024 • edited Loading

Title

Description

Steps to Reproduce

Impact

Environment

Workaround

Attachments

Environment

PeterSzcz commented Nov 25, 2024

george-pogosyan commented Oct 23, 2024 •

edited

Loading