Mirrored CQv1 in 3.13.1 runs into an exception in rabbit_msg_store:remove_message/3 #11236
-
Describe the bugWe recently upgraded from 3.10 to 3.13.1 and are now experiencing semi-frequent(3 occurrences in ~1w). As far as we can tell nothing out of the ordinary happens during the crashes. A rolling restart of the cluster nodes does fix it We will be upgrading to 3.13.2 but I couldn't see any fix that fits the issue. We are actively working on going towards QQs but it is realistically a few weeks off. The first error message is:
Logs: downloaded-logs-20240514-214643.zip Definitions: rabbit_rabbitmq-nsb-server-1.rabbitmq-nsb-nodes.tradera-production_2024-5-14.zip Reproduction stepsUnknown Expected behaviorNo queue crashes Additional contextrabbitmq-diagnostics report: report.txt Setup: apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
name: rabbitmq-nsb
namespace: tradera-production
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- rabbitmq-nsb
topologyKey: kubernetes.io/hostname
weight: 100
image: docker.io/bitnami/rabbitmq:3.13.1-debian-12-r0
override:
statefulSet:
spec:
template:
spec:
containers: []
priorityClassName: high-priority
topologySpreadConstraints:
- labelSelector:
matchLabels:
app.kubernetes.io/name: rabbitmq-nsb
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
persistence:
storage: 128Gi
storageClassName: pd-ssd
rabbitmq:
additionalConfig: |
collect_statistics_interval = 10000
# If a consumer does not ack its delivery for more than the timeout value (30 minutes by default), its channel will be closed with a PRECONDITION_FAILED channel exception.
## https://www.rabbitmq.com/consumers.html#acknowledgement-timeout
### We have atleast one job that takes 1hr so a 2hr timeout for leeway
consumer_timeout = 7200000
# Fraction of total memory before publishing is blocked
vm_memory_high_watermark.relative = 0.6
# Fraction of vm_memory_high_watermark used until rabbitmq aggresivly starts paging to disk
# 0.5 * 0.6 = 30% of memory
# RabbitMQ production recommendation uses 0.99 https://github.com/rabbitmq/cluster-operator/blob/3f746ecd32d1103c1153b8ac34a4771bfea146f7/docs/examples/production-ready/rabbitmq.yaml
vm_memory_high_watermark_paging_ratio = 0.5
# Fraction of memory amount of disk space required before blocking publishing
# E.g. 10Gi memory * 1.5 = 15Gi disk space required
disk_free_limit.relative = 1.5
log.console = true
log.console.level = info
log.console.formatter = json
log.file = false
additionalPlugins:
- rabbitmq_shovel
- rabbitmq_shovel_management
replicas: 3
resources:
limits:
memory: 20Gi
requests:
cpu: "3"
memory: 20Gi
service:
annotations:
cloud.google.com/backend-config: '{"ports": { "management": "iap" }}'
external-dns.alpha.kubernetes.io/hostname: rabbitmq-nsb.tradera.service.,rabbitmq-nsb-new.tradera.service.
external-dns.alpha.kubernetes.io/ttl: "300"
networking.gke.io/load-balancer-type: Internal
prometheus.io/path: /metrics/per-object
prometheus.io/port: "15692"
prometheus.io/scrape: "true"
service.kubernetes.io/topology-mode: auto
type: LoadBalancer After a migration we have gotten old nodes 'stuck' in streams, we have been running with this warning for a long time so I believe it is unrelated. We have tried to fix it but have been unable to, I think we had a discussion here where we got help to try to remove them but were unable to(I cannot find the discussion/issue now), and as it wasnt causing issues it has unfortunately been de-prioritized. This is what we see in the logs for it
This is what we see when we query rabbitmq for the stream, I do however think it is unrelated ❯ kubectl exec sts/rabbitmq-nsb-server -- rabbitmq-streams stream_status nsb.v2.verify-stream-flag-enabled
Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
Status of stream nsb.v2.verify-stream-flag-enabled on node rabbit@rabbitmq-nsb-server-2.rabbitmq-nsb-nodes.tradera-production ...
┌─────────┬────────────────────────────────────────────────────────────────────┬───────┬────────┬──────────────────┬──────────────┬─────────┬──────────┐
│ role │ node │ epoch │ offset │ committed_offset │ first_offset │ readers │ segments │
├─────────┼────────────────────────────────────────────────────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ replica │ rabbit@rabbitmq-nsb-server-0.rabbitmq-nsb-nodes.tradera-production │ 65 │ -1 │ -1 │ 0 │ 0 │ 1 │
├─────────┼────────────────────────────────────────────────────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ writer │ rabbit@rabbitmq-nsb-server-1.rabbitmq-nsb-nodes.tradera-production │ 65 │ -1 │ -1 │ 0 │ 2 │ 0 │
├─────────┼────────────────────────────────────────────────────────────────────┼───────┼────────┼──────────────────┼──────────────┼─────────┼──────────┤
│ replica │ rabbit@rabbitmq-nsb-server-2.rabbitmq-nsb-nodes.tradera-production │ 65 │ -1 │ -1 │ 0 │ 0 │ 1 │
└─────────┴────────────────────────────────────────────────────────────────────┴───────┴────────┴──────────────────┴──────────────┴─────────┴──────────┘ |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Our team will not spend any more time on classic mirrored queues. They will be removed completely for 4.0 later this year, after being very visibly deprecated for several years. |
Beta Was this translation helpful? Give feedback.
-
This look similar to #11111 #10902 addressed in If this exception can be reproduced with a non-mirrored classic queue (v2 since v1 were removed in |
Beta Was this translation helpful? Give feedback.
-
@lhoguin, who worked on #11111 #10902, confirms that this is a different manifestation of the same root cause. |
Beta Was this translation helpful? Give feedback.
@lhoguin, who worked on #11111 #10902, confirms that this is a different manifestation of the same root cause.