PerfTest reports high latency #11026

skry-dev · 2024-04-17T11:39:31Z

skry-dev
Apr 17, 2024

Hi,

I did a performance test and saw high latencies in message flow.
End to end performance test latency goes up to 15 seconds. It shouldn't take more than a few seconds for me.
I checked resources and network but didn't find any anomaly.
How can I improvement message times?

My test command:

docker run -p 18691:8080  -it --rm --network perf-test rabbit:perftest \
    --uri amqps://<cluster_ip> -s 3000 -z60 --rate 500 \
    --quorum-queue --queue qq --queue-pattern 'perf-test-quorum-%d' \
    --queue-pattern-from 1  --queue-pattern-to 5 \
    --producers 100  --consumers 200  --metrics-prometheus \
    --metrics-tags type=publisher,type=consumer  -a --id "test 1"

My cluster properties:

I have 3 node RabbitMQ (version 3.12.12) cluster.
Rabbitmq run as docker container on each nodes.
Each node has 200GB Disk, 16CPU and 20GB RAM.
I used quorum queues.

rabbitmq.conf

listeners.ssl.default = 5671
loopback_users.guest = false
ssl_options.verify               = verify_peer
ssl_options.fail_if_no_peer_cert = false
ssl_options.cacertfile=***
ssl_options.certfile=***
ssl_options.keyfile=***
ssl_options.honor_cipher_order   = true
ssl_options.honor_ecc_order      = true

ssl_options.versions.1 = tlsv1.2
proxy_protocol=true

management.ssl.port       = 15671
management.ssl.cacertfile = ***
management.ssl.certfile   = ***
management.ssl.keyfile    = ***
management.ssl.honor_cipher_order   = true
management.ssl.honor_ecc_order      = true

prometheus.ssl.port       = 15691
prometheus.ssl.cacertfile = ***
prometheus.ssl.certfile   = ***
prometheus.ssl.keyfile    = ***

log.file.rotation.date = $D0
log.file.rotation.count = 6

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config
cluster_formation.classic_config.nodes.1 = rabbit@rabbit1
cluster_formation.classic_config.nodes.2 = rabbit@rabbit2
cluster_formation.classic_config.nodes.3 = rabbit@rabbit3

raft.wal_max_size_bytes = 1073741824
raft.segment_max_entries = 32768

message_interceptors.incoming.set_header_timestamp.overwrite = true

log.file = rabbit.log
log.dir = /var/log/rabbitmq
log.file.level = error
log.connection.level = error
log.channel.level = error
log.queue.level = error
log.mirroring.level = error

vm_memory_high_watermark.relative = 0.9

docker-compose.yml

version: "3.3"

services:
  rabbitmq:
    image: rabbitmq:3.12.12-management
    container_name: rabbit
    hostname: rabbit1
    volumes:
      - /rabbit/volumes/certs:/etc/rabbitmq/certificates
      - /rabbit/volumes/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf
      - /rabbit/volumes/advanced.config:/etc/rabbitmq/advanced.config
      - /rabbit/volumes/data:/var/lib/rabbitmq/mnesia/rabbit@rabbit1
      - /rabbit/volumes/erlang/.erlang.cookie:/var/lib/rabbitmq/.erlang.cookie
      - /rabbit/volumes/logs:/var/log/rabbitmq
      - /etc/localtime:/etc/localtime:ro
    ports:
      - "15672:15672"
      - "15671:15671"
      - "5672:5672"
      - "5671:5671"
      - "4369:4369"
      - "25672:25672"
      - "15692:15692"
      - "15691:15691"
    environment:
      - RABBITMQ_LOGS=/var/log/rabbitmq/rabbit.log
      - RABBITMQ_DISTRIBUTION_BUFFER_SIZE=384000

Test results:

test stopped (Reached time limit)
id: test 1, sending rate avg: 29535 msg/s
id: test 1, receiving rate avg: 27995 msg/s
id: test 1, consumer latency min/median/75th/95th/99th 555564/8634335/11241884/14891191/17180789 µs

Thank you.

Expected behavior

I need improvement quorum queue performance.

lukebakken · 2024-04-17T13:08:14Z

lukebakken
Apr 17, 2024
Maintainer

@skry-dev thanks for using RabbitMQ and providing a reasonably detailed set of information.

Please note how I edited your question to make it easier to read. Another option is to attach files to your comments on GitHub.

In this specific case, we know nothing about the environment in which you are running RabbitMQ

Are you using a cloud provider? Bare metal?
What kind of virtualization?
What kind of disks?
What kind of network?

1 reply

skry-dev Apr 17, 2024
Author

Hi,

Thank you for editing.

I am using bare metal,.
My nodes are Linux VMs (vmware virtual machines).
I am using SSD disks.
Clients reach to nodes via load balancer, but all nodes are same subnet and reach each other directly.

lukebakken · 2024-04-17T15:45:16Z

lukebakken
Apr 17, 2024
Maintainer

Your PerfTest workload will result in attempting to send 150,000,000 bytes of data per second (500 * 100 * 3000). This is equivalent to 1.2 gigibits per second. Your environment probably can't keep up.

My recommendation would be to scale back some combination of message rate, size and publisher count, until you see acceptable latencies.

@mkuratczyk will have insights here as well.

1 reply

skry-dev Apr 18, 2024
Author

Hi,

Yes, I did a load test and this combination is highest load.
But when I checked system resources, there was no any bottleneck or exceed.

Actually I did not find where the overload is.

My Rabbitmq use max 3.5 GB, %16 CPU and max 1,5 GB/s network traffic during this test, but when I was decrease traffic to 600 MB/s, saw latency also.

Where should I check for latency? And what can I do improvement processing speed?

Thank you.

lukebakken · 2024-04-20T22:21:11Z

lukebakken
Apr 20, 2024
Maintainer

With this much load you should run your producer and consumer PerfTest instances on separate hosts. Since they are running on the same host they may be competing for resources.

0 replies

mkuratczyk · 2024-04-22T14:47:00Z

mkuratczyk
Apr 22, 2024
Maintainer

RabbitMQ is completely overloaded in this scenario. First of all you are not using publisher confirms, so you keep publishing even though RabbitMQ can't keep up. This leads to Erlang process queues growing longer and longer, which is why you see high latency. You can use rabbitmq-diagnostics observer and I'd expect you to see a long mailbox on the ra_log_wal process. If that's the case, no amount of RAM and CPU will help - ra_log_wal is a single Erlang process per node so it can't use more than 1 core at a given time. That's why you don't see high overall CPU usage.

i'd say the short answer is - you are demanding a lot from RabbitMQ and there's no simple way that'd guarantee this workload can be handled well. You can use observer to confirm the above, you can record a flamegraph to see where CPU time is spent, but likely there won't be a single obvious thing that we can just improve. Some things you could try:

disabling fsync: https://github.com/rabbitmq/ra/?tab=readme-ov-file#configuration-reference
disabling huge pages ( echo never > /sys/kernel/mm/transparent_hugepage/enabled)
making sure that connections are well distributed and that connections are where the leaders are, etc

But as I said - I can't guarantee that any of this will be sufficient. You can also reconsider where you really need a workload like this - perhaps you can use a stream, perhaps you can use classic non-mirrored queues, perhaps you can split the workload between multiple RabbitMQ clusters, etc.

In the future, we are thinking about having multiple Ra subsystems with QQs assigned (probably randomly) to one of them. This way there wouldn't be a single Erlang process which needs to handle everything - it'd be split between a few such processes, so we could use multiple CPUs and use the hard disk better. And if a single hard disk would still not be sufficient (fsync latency is the main bottleneck) you'd be able to mount multiple disks, one per a Ra system, so everything would parallelize better.

I'm fairly sure we will make this change at some point but no guarantees about when.

8 replies

skry-dev May 3, 2024
Author

Hi, I added two more node cluster and increase wal file size. It reduced the latency slightly.
I wonder that is separate disk space for wal file useful ?

Thank you.

mkuratczyk May 6, 2024
Maintainer

It might help (which doesn't mean it'll be sufficient) or it might not. As I said before - you expect to do more than RabbitMQ can handle. Is it possible to take a fast enough machine / tune some parameters / use a large enough cluster? Maybe. There are no easy answers here - you need to roll up your sleeves and try stuff. The biggest difference would likely come from changing the workload - maybe you can have smaller messages, maybe you can have lower throughput, maybe you can have fewer publishers or consumers, etc.

michaelklishin May 7, 2024
Maintainer

Batching messages in a single published message plus compressing them is a timeless technique to fairly easily reduce both message rate and network traffic volume without significantly increasing the complexity of both publishers and consumers.

Can be used with any queue type, with streams, with most workloads in general.

skry-dev May 14, 2024
Author

Hi,

During tests, the MsgQueue value of ra_log_wal value up to 241 highest. I think, my problem does not about Erlang process.
Is it normal? After what levels can this value cause a latency?

Thank you.

mkuratczyk May 14, 2024
Maintainer

241 messages would introduce a bit of latency but not seconds. Probably the bottleneck is somewhere else in this case

kjnilsson · 2024-05-14T08:07:22Z

kjnilsson
May 14, 2024
Maintainer

Your receiving rate is lower than your sending rate so messages have to queue in the, erm, queues. That is why you see higher latency.

Try a workload with fewer consumers and set a QoS value of each consumer in the region of 20-50 and experiment from there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PerfTest reports high latency #11026

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

PerfTest reports high latency #11026

skry-dev Apr 17, 2024

Expected behavior

Replies: 5 comments · 10 replies

lukebakken Apr 17, 2024 Maintainer

skry-dev Apr 17, 2024 Author

lukebakken Apr 17, 2024 Maintainer

skry-dev Apr 18, 2024 Author

lukebakken Apr 20, 2024 Maintainer

mkuratczyk Apr 22, 2024 Maintainer

skry-dev May 3, 2024 Author

mkuratczyk May 6, 2024 Maintainer

michaelklishin May 7, 2024 Maintainer

skry-dev May 14, 2024 Author

mkuratczyk May 14, 2024 Maintainer

kjnilsson May 14, 2024 Maintainer

skry-dev
Apr 17, 2024

Replies: 5 comments 10 replies

lukebakken
Apr 17, 2024
Maintainer

skry-dev Apr 17, 2024
Author

lukebakken
Apr 17, 2024
Maintainer

skry-dev Apr 18, 2024
Author

lukebakken
Apr 20, 2024
Maintainer

mkuratczyk
Apr 22, 2024
Maintainer

skry-dev May 3, 2024
Author

mkuratczyk May 6, 2024
Maintainer

michaelklishin May 7, 2024
Maintainer

skry-dev May 14, 2024
Author

mkuratczyk May 14, 2024
Maintainer

kjnilsson
May 14, 2024
Maintainer