Mirrored CQv1 in 3.13.1 runs into an exception in rabbit_msg_store:remove_message/3 #11236

BadLiveware · 2024-05-14T20:07:41Z

BadLiveware
May 14, 2024

Describe the bug

We recently upgraded from 3.10 to 3.13.1 and are now experiencing semi-frequent(3 occurrences in ~1w). As far as we can tell nothing out of the ordinary happens during the crashes. A rolling restart of the cluster nodes does fix it

We will be upgrading to 3.13.2 but I couldn't see any fix that fits the issue.

We are actively working on going towards QQs but it is realistically a few weeks off.

The first error message is:

Stopping message store for directory '/bitnami/rabbitmq/mnesia/rabbit@rabbitmq-nsb-server-2.rabbitmq-nsb-nodes.tradera-production/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' with reason {{case_clause,{false,not_found}},[{rabbit_msg_store,remove_message,3,[{file,"rabbit_msg_store.erl"},{line,1170}]},{lists,foldl_1,3,[{file,"lists.erl"},{line,1599}]},{rabbit_msg_store,handle_cast,2,[{file,"rabbit_msg_store.erl"},{line,921}]},{gen_server2,handle_msg,2,[{file,"gen_server2.erl"},{line,1056}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,251}]}]}

Logs: downloaded-logs-20240514-214643.zip
downloaded-logs-20240513.zip

Definitions: rabbit_rabbitmq-nsb-server-1.rabbitmq-nsb-nodes.tradera-production_2024-5-14.zip

Reproduction steps

Unknown

Expected behavior

No queue crashes

Additional context

Crash happened at 21:00:06

rabbitmq-diagnostics report: report.txt

Setup:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq-nsb
  namespace: tradera-production
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - rabbitmq-nsb
          topologyKey: kubernetes.io/hostname
        weight: 100
  image: docker.io/bitnami/rabbitmq:3.13.1-debian-12-r0
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers: []
            priorityClassName: high-priority
            topologySpreadConstraints:
            - labelSelector:
                matchLabels:
                  app.kubernetes.io/name: rabbitmq-nsb
              maxSkew: 1
              topologyKey: topology.kubernetes.io/zone
              whenUnsatisfiable: ScheduleAnyway
  persistence:
    storage: 128Gi
    storageClassName: pd-ssd
  rabbitmq:
    additionalConfig: |
      collect_statistics_interval = 10000

      # If a consumer does not ack its delivery for more than the timeout value (30 minutes by default), its channel will be closed with a PRECONDITION_FAILED channel exception.
      ## https://www.rabbitmq.com/consumers.html#acknowledgement-timeout
      ### We have atleast one job that takes 1hr so a 2hr timeout for leeway
      consumer_timeout = 7200000

      # Fraction of total memory before publishing is blocked
      vm_memory_high_watermark.relative = 0.6
      # Fraction of vm_memory_high_watermark used until rabbitmq aggresivly starts paging to disk
      # 0.5 * 0.6 = 30% of memory
      # RabbitMQ production recommendation uses 0.99 https://github.com/rabbitmq/cluster-operator/blob/3f746ecd32d1103c1153b8ac34a4771bfea146f7/docs/examples/production-ready/rabbitmq.yaml
      vm_memory_high_watermark_paging_ratio = 0.5

      # Fraction of memory amount of disk space required before blocking publishing
      # E.g. 10Gi memory * 1.5 = 15Gi disk space required
      disk_free_limit.relative = 1.5

      log.console = true
      log.console.level = info
      log.console.formatter = json
      log.file = false
    additionalPlugins:
    - rabbitmq_shovel
    - rabbitmq_shovel_management
  replicas: 3
  resources:
    limits:
      memory: 20Gi
    requests:
      cpu: "3"
      memory: 20Gi
  service:
    annotations:
      cloud.google.com/backend-config: '{"ports": { "management": "iap" }}'
      external-dns.alpha.kubernetes.io/hostname: rabbitmq-nsb.tradera.service.,rabbitmq-nsb-new.tradera.service.
      external-dns.alpha.kubernetes.io/ttl: "300"
      networking.gke.io/load-balancer-type: Internal
      prometheus.io/path: /metrics/per-object
      prometheus.io/port: "15692"
      prometheus.io/scrape: "true"
      service.kubernetes.io/topology-mode: auto
    type: LoadBalancer

After a migration we have gotten old nodes 'stuck' in streams, we have been running with this warning for a long time so I believe it is unrelated. We have tried to fix it but have been unable to, I think we had a discussion here where we got help to try to remove them but were unable to(I cannot find the discussion/issue now), and as it wasnt causing issues it has unfortunately been de-prioritized. This is what we see in the logs for it

rabbit_stream_coordinator: failed to stop member __nsb_v2_verify-stream-flag-enabled_1665666312452343162 'rabbit@rabbitmq-nsb-2.rabbitmq-nsb-discovery.tradera-production.svc.cluster.local' Error: {{nodedown,'rabbit@rabbitmq-nsb-2.rabbitmq-nsb-discovery.tradera-production.svc.cluster.local'},{gen_server,call,[{osiris_server_sup,'rabbit@rabbitmq-nsb-2.rabbitmq-nsb-discovery.tradera-production.svc.cluster.local'},{terminate_child,[95,95,110,115,98,95,118,50,95,118,101,114,105,102,121,45,115,116,114,101,97,109,45,102,108,97,103,45,101,110,97,98,108,101,100,95,49,54,54,53,54,54,54,51,49,50,52,53,50,51,52,51,49,54,50]},infinity]}}

This is what we see when we query rabbitmq for the stream, I do however think it is unrelated

❯ kubectl exec sts/rabbitmq-nsb-server -- rabbitmq-streams stream_status nsb.v2.verify-stream-flag-enabled
Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
Status of stream nsb.v2.verify-stream-flag-enabled on node rabbit@rabbitmq-nsb-server-2.rabbitmq-nsb-nodes.tradera-production ...
┌─────────┬────────────────────────────────────────────────────────────────────┬───────┬────────┬──────────────────┬──────────────┬─────────┬──────────┐
│ role    │ node                                                               │ epoch │ offset │ committed_offset │ first_offset │ readers │ segments │
├─────────┼────────────────────────────────────────────────────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ replica │ rabbit@rabbitmq-nsb-server-0.rabbitmq-nsb-nodes.tradera-production │ 65    │ -1     │ -1               │ 0            │ 0       │ 1        │
├─────────┼────────────────────────────────────────────────────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ writer  │ rabbit@rabbitmq-nsb-server-1.rabbitmq-nsb-nodes.tradera-production │ 65    │ -1     │ -1               │ 0            │ 2       │ 0        │
├─────────┼────────────────────────────────────────────────────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ replica │ rabbit@rabbitmq-nsb-server-2.rabbitmq-nsb-nodes.tradera-production │ 65    │ -1     │ -1               │ 0            │ 0       │ 1        │
└─────────┴────────────────────────────────────────────────────────────────────┴───────┴────────┴──────────────────┴──────────────┴─────────┴──────────┘

Answered by michaelklishin

May 15, 2024

@lhoguin, who worked on #11111 #10902, confirms that this is a different manifestation of the same root cause.

View full answer

michaelklishin · 2024-05-14T21:23:08Z

michaelklishin
May 14, 2024
Maintainer

Our team will not spend any more time on classic mirrored queues. They will be removed completely for 4.0 later this year, after being very visibly deprecated for several years.

0 replies

michaelklishin · 2024-05-14T21:26:47Z

michaelklishin
May 14, 2024
Maintainer

This look similar to #11111 #10902 addressed in 3.13.2. Not only they are in the same module, both have to do with message removal in the classic message store, in particular during or post-compaction.

If this exception can be reproduced with a non-mirrored classic queue (v2 since v1 were removed in main for 4.x) on 3.13.2, then we'd be interested in investigating.

0 replies

michaelklishin · 2024-05-15T12:32:24Z

michaelklishin
May 15, 2024
Maintainer

@lhoguin, who worked on #11111 #10902, confirms that this is a different manifestation of the same root cause.

2 replies

BadLiveware May 16, 2024
Author

Thank you for taking the time to check even when we are using a long time deprecated feature, it is great to know for sure. We have upgraded to 3.13.2 and have not had the issue since(even if its only been one day). We are also actively migrating to QQs

lukebakken May 16, 2024
Maintainer

We have upgraded to 3.13.2 and have not had the issue since

Thank you for following up!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mirrored CQv1 in 3.13.1 runs into an exception in rabbit_msg_store:remove_message/3 #11236

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Mirrored CQv1 in 3.13.1 runs into an exception in rabbit_msg_store:remove_message/3 #11236

BadLiveware May 14, 2024

Describe the bug

Reproduction steps

Expected behavior

Additional context

Replies: 3 comments · 2 replies

michaelklishin May 14, 2024 Maintainer

michaelklishin May 14, 2024 Maintainer

michaelklishin May 15, 2024 Maintainer

BadLiveware May 16, 2024 Author

lukebakken May 16, 2024 Maintainer

BadLiveware
May 14, 2024

Replies: 3 comments 2 replies

michaelklishin
May 14, 2024
Maintainer

michaelklishin
May 14, 2024
Maintainer

michaelklishin
May 15, 2024
Maintainer

BadLiveware May 16, 2024
Author

lukebakken May 16, 2024
Maintainer