Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster #781

Open
george-pogosyan opened this issue Oct 23, 2024 · 1 comment

Comments

@george-pogosyan
Copy link

george-pogosyan commented Oct 23, 2024

Title

Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster

Description

We encountered a critical issue in our PostgreSQL cluster (1 master, 1 replica, 1 repo host) where a disk fill-up on the repo host led to a cascading failure of the entire cluster. The problem propagated from the repo host to the master node and then to the replica, causing system-wide outage.

Steps to Reproduce

  1. Set up a PostgreSQL cluster with 1 master, 1 replica, and 1 repo host
  2. Allow the disk on the repo host to fill up due to storing old backups
  3. Observe that WAL files start accumulating on the master node
  4. Master node's disk fills up, causing it to fail
  5. System switches to the replica node
  6. The same WAL accumulation occurs on the replica, leading to its failure as well

Impact

  • Complete failure of the PostgreSQL cluster
  • Data unavailability
  • Potential data loss or corruption
  • Significant downtime and operational impact

Environment

  • Everest: 1.2.0
  • PostgreSQL version: 16.1
  • Kubernetes version: 1.27.16

Workaround

We don't find a workaround of this problem. Cluster was recreated from backup.

Attachments

image

apiVersion: everest.percona.com/v1alpha1
kind: DatabaseCluster
metadata:
  creationTimestamp: '2024-10-09T10:54:47Z'
  finalizers:
    - everest.percona.com/upstream-cluster-cleanup
    - foregroundDeletion
  generation: 8
  labels:
    backupStorage-yc-s3: used
    clusterName: cluster-name
    monitoringConfigName: pmm
  managedFields:
    - apiVersion: everest.percona.com/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"everest.percona.com/upstream-cluster-cleanup": {}
            v:"foregroundDeletion": {}
          f:labels:
            .: {}
            f:backupStorage-yc-s3: {}
            f:clusterName: {}
            f:monitoringConfigName: {}
        f:spec:
          f:engine:
            f:userSecretsName: {}
          f:proxy:
            f:resources:
              .: {}
              f:cpu: {}
              f:memory: {}
      manager: manager
      operation: Update
      time: '2024-10-09T14:18:51Z'
    - apiVersion: everest.percona.com/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:backup:
            .: {}
            f:enabled: {}
            f:pitr:
              .: {}
              f:backupStorageName: {}
              f:enabled: {}
            f:schedules: {}
          f:engine:
            .: {}
            f:config: {}
            f:replicas: {}
            f:resources:
              .: {}
              f:cpu: {}
              f:memory: {}
            f:storage:
              .: {}
              f:class: {}
              f:size: {}
            f:type: {}
            f:version: {}
          f:monitoring:
            .: {}
            f:monitoringConfigName: {}
          f:proxy:
            .: {}
            f:expose:
              .: {}
              f:type: {}
            f:replicas: {}
      manager: Mozilla
      operation: Update
      time: '2024-10-17T10:09:08Z'
    - apiVersion: everest.percona.com/v1alpha1
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:crVersion: {}
          f:details: {}
          f:hostname: {}
          f:observedGeneration: {}
          f:port: {}
          f:ready: {}
          f:size: {}
          f:status: {}
      manager: manager
      operation: Update
      subresource: status
      time: '2024-10-20T13:25:41Z'
  name: cluster-name
  namespace: everest
  resourceVersion: '468620532'
  uid: ef3850cf-722c-43a8-a708-42001d3f4287
spec:
  backup:
    enabled: true
    pitr:
      backupStorageName: yc-s3
      enabled: true
    schedules:
      - backupStorageName: yc-s3
        enabled: true
        name: backup-gxe
        retentionCopies: 7
        schedule: 0 10 * * *
  engine:
    config: ''
    replicas: 2
    resources:
      cpu: '0.6'
      memory: 1G
    storage:
      class: longhorn
      size: 10Gi
    type: postgresql
    userSecretsName: everest-secrets-cluster-name
    version: '16.1'
  monitoring:
    monitoringConfigName: pmm
  proxy:
    expose:
      type: internal
    replicas: 2
    resources:
      cpu: '0'
      memory: '0'
status:
  crVersion: 2.3.1
  details: |
    postgres:
      size: 2
      ready: 2
      instancesets:
      - name: instance1
        size: 2
        ready: 2
    pgbouncer:
      size: 2
      ready: 2
    state: ready
    host: cluster-name-pgbouncer.everest.svc
  hostname: cluster-name-pgbouncer.everest.svc
  observedGeneration: 8
  port: 5432
  ready: 4
  size: 4
  status: ready

Environment

  • Everest: 1.2.0
  • k8s: 1.27.16
  • postgres operator: 2.3.1
  • postgres version: 16.1
@george-pogosyan george-pogosyan changed the title Host disk filled up due to unchecked growth of usage by repo host Cascading disk fill-up due to WAL file accumulation in multi-node PostgreSQL cluster Oct 23, 2024
@PeterSzcz
Copy link

Hey @george-pogosyan

Thanks for the feedback. It tooks me a while to discuss this issue with our experts but looks like we found and improvement we can do to fix this problem. You can track progress of this issue via this PG operator ticket: https://perconadev.atlassian.net/browse/K8SPG-685

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants