[WIP, RFC] Problem: lot of use of dynamic memory allocation, leading to performance degradation and non-determism #3911

mjvankampen · 2020-05-12T19:25:15Z

Basically the problem we are trying to solve is that the default allocator (malloc/free) is:

slow
unpredictable

@f18m showed that using pre-allocated memory that is re-used performance can be increased significantly. My own use-case is somewhat different as I prefer to have as little dynamic memory allocation in my applications as determinism is an element for me. Allocation on the heap is a source of non-determinism.

I took @f18m pull request and tried to make a a bit nicer to allow users to provide their own queues and made the queue grow dynamically.

Open discussion points:

Todo

Add RFC for better PR description
Update documentation
Add statistics from memory pool
Fix CI: broken build on some platforms
Add unit tests: coverage below par
Change to function pointers, virtuals cannot be exposed through zmq.h
Fix missing headers (something with pch)
Adds free_fn for allocator

RFC

Problem

ZMQ uses quite a bit of dynamic memory allocation. This results in

some loss in performance as heap allocation takes time, can introduce implicit synchronisation and lead to heap fragmentation, and
non-determinism as most new/malloc implementations do not have a upper bound time limit.

The first one is not a problem in itself but as one ZMQ strengths is performance it is always good to improve this. The second one is an issue in some systems that require a bound on latency.

Solution direction

The most simple solution would be to allow users to provide a custom allocator that can be used instead of the default allocator. This has the advantage that users can customize the application to their needs and solve the above issues. A user can select the trade-off between memory vs. speed vs. determinism themselves. Of course it adds some disadvantages as well:

some overhead is likely to be introduced (or some templating);
a new API is required; and
this allows users to break ZMQ with their custom allocator, resulting in more questions for the ZMQ community.

To mitigate these somewhat a standard implementation for an allocator should be provided.

Default allocator usage

While there are a quite a few new calls in zmq, Approximately 100 new and 50 malloc calls. Replacing all these is quite a bit of work and most likely not necessary as most of these happen during initialisation. Focus should be on the allocations that happen often and all the time. The best place to look for this is the code that does an allocation for every message sent or received. This results in allocations in:

msg.hpp/cpp, and
decoders (decoder_allocators.hpp/cpp).

At the moment (correct me if I'm wrong) the decoder is does not allocate per message but per stream. It seems that the most gain is in the messages, which means the decoder becomes a nice to have (at this point).

APIs

message level API: set an allocator per message.

Another option is to set an allocator for all messages (using the context). In my opinion this is a nice to have, and should only be implemented if someone asks for it.

`zmq_allocator_new(type)`

Construct an allocator of type. This is a way to provide default allocators to users. If they want to provide their own custom allocator, in that case this function is not used.

`zmq_allocator_destroy(allocator)`

Destroy an allocator (either from the user or native) after you are done with it. Can only be called when all messages have been handled. This means after terminating the context.

`zmq_msg_init_allocator(message, size, allocator)`

Initialize a message using a custom allocator. Very similar to zmq_msg_init_size except with an argument to pass an allocator.

`zmg_allocator_t`

A struct consisting of function pointers.

void*(*allocate_fn) (void *allocator, size_t len)
Allocate function

void(*deallocate_fn) (void *allocator, void *data_)
Deallocate function

bool(*check_tag_fn) (void *allocator)
Check the tag of an allocator to see if the allocator pointer is valid

void *allocator
Pointer to an allocator itself (on which to use the above functions)

optional (not required for the msg interface):
void(*release_fn) (void *allocator, void *data_)
Release data, this means the allocator is not the owner anymore.

Provided allocators

Default

A basic allocator using new/delete. Basically the same what we have now but with a slight added cost of indirection through a vtable (might be elimated using an interface and marking things as final).

Global pool

This is an allocator which keeps queues of memory chunks based on bin sizes. This reduces the amount of dynamic memory allocation (thus gaining speed) at the cost of reserving more memory than strictly needed. To prevent an unreasonable startup size memory gets allocated on demand, but with the power of two. So if more memory chunks are required than available in a bin the number of chunks in a bin is doubled. If the biggest bin is too large, a bin twice the size is created.

Depending on the workload this way of working can lead to excessive memory use. Scenarios where large burst of same message size, but changing message sizes between bursts, might lead to excessive memory use. If we want to prevent this it might be wise to set a limit to the pre-allocated memory and fall back on new/delete if it is exceeded or return an error. Or we trust the user to not pick this specific allocator.

Other?

by completely removing malloc/free from the hot path of benchmark util

…pool # Conflicts: # src/msg.cpp # src/msg.hpp # src/zmq.cpp

In practice I observed that 256 byte messages require +- 1024 messages pre-allocated to reach max. performance on my pc. This scales depending on message size I believe.

…able

src/allocator_default.cpp

src/allocator_global_pool.hpp

Changed some options that should not have changed

Cpp equivalent of malloc/free

gummif · 2020-05-17T09:25:25Z

include/zmq.h

+ZMQ_EXPORT int
+zmq_msg_init_allocator (zmq_msg_t *msg_, size_t size_, void *allocator_);
+
+struct zmq_allocator_t


Does this need to be exposed? If so, can the type not be used in the API, as in ZMQ_EXPORT zmq_allocator_t *zmq_msg_allocator_new (int type_);?

If I don't expose this one users cannot provide their own allocator with the current api. This is a design decision. I would allow users to provide their own allocators (maybe they want fixed bins with new, fixed bins with errors, an off the shelves allocator etc.). But of course that depends on how much you want to expose to the user. What do you think about this?

gummif · 2020-05-17T09:28:15Z

include/zmq.h

+ void(*deallocate_fn) (void *allocator, void *data_);
+
+ // Return true if this is an allocator and alive, otherwise false
+ bool(*check_tag_fn) (void *allocator);


Is this required? In what circumstances will this be useful?

When the users wants to provide his/her own allocator instead of using one of the built-ins.

What I mean is specifically the check_tag_fn function. I don't see why it is necessary at all. Is it used for trying to catch programming errors at runtime?

It would allow you to check if a void pointer is a pointer to what you think it is. It was in the previous PR so I left it in. Need to check if it is actually used somewhere.

gummif · 2020-05-17T12:13:09Z

src/basic_concurrent_queue.hpp

+ return success;
+ }
+
+ size_t size_approx () const { return _queue.size (); }


This is not thread-safe.

Hence the size_approx() instead of size. Might be tricky though not sure what size() returns during a change. Will see if I can find it somewhere in the standard and otherwise will add a mutex.

Depending on arch, you could get get a torn-read. Although since it is size_t, it should be the platform's native word size and therefore safe-ish.
If size_approx needs to be fast, alternatively you could use an std::atomic size counter with relaxed memory ordering.

gummif · 2020-05-17T12:24:41Z

src/allocator_global_pool.cpp

+ _storage_mutex.lock ();
+ size_t oldSize = _storage.size ();
+ if (oldSize <= bl) {
+ _storage.resize (bl + 1);


While this function is being locked with a mutex, these member variables are accessed in other functions without locks, which is not safe.

Thanks! Changed a few things around. Should be better now. Still need to check thoroughly.

rfarnham · 2020-05-27T23:22:33Z

src/allocator_default.hpp

+
+ private:
+ // Used to check whether the object is a socket.
+ uint32_t _tag;


I see _tag is marked as dead in destructor, but the compiler will optimize that out unless _tag is volatile-qualified (which could have performance implications elsewhere).

sonatype-lift · 2020-09-19T18:07:09Z

external/mpmcqueue/concurrentqueue.h

+ header->capacity = nextBlockIndexCapacity;
+ header->tail.store((prevCapacity - 1) & (nextBlockIndexCapacity - 1), std::memory_order_relaxed);
+
+ blockIndex.store(header, std::memory_order_release);


MEMORY_LEAK: memory dynamically allocated at line 2931 by call to moodycamel::ConcurrentQueueDefaultTraits::malloc, is not reachable after line 2963, column 4.

sonatype-lift · 2020-09-19T18:07:11Z

external/mpmcqueue/concurrentqueue.h

+ if (!raw)
+ return nullptr;
+ char* ptr = details::align_for<TAlign>(reinterpret_cast<char*>(raw) + sizeof(void*));
+ *(reinterpret_cast<void**>(ptr) - 1) = raw;


MEMORY_LEAK: memory dynamically allocated at line 3556 by call to moodycamel::ConcurrentQueueDefaultTraits::malloc, is not reachable after line 3560, column 3.

sonatype-lift · 2020-09-19T18:07:12Z

external/mpmcqueue/concurrentqueue.h

+ // If it's < three-quarters full, add to the old one anyway so that we don't have to wait for the next table
+ // to finish being allocated by another thread (and if we just finished allocating above, the condition will
+ // always be true)
+ if (newCount < (mainHash->capacity >> 1) + (mainHash->capacity >> 2)) {


MEMORY_LEAK: memory dynamically allocated at line 3430 by call to moodycamel::ConcurrentQueueDefaultTraits::malloc, is not reachable after line 3458, column 8.

axelriet · 2023-12-21T06:34:45Z

I've looked at this as I, too, believe it's worth doing something as in some scenario (inproc and presumably with other very fast transports) the dominant location where time is spent is the heap. However I think the approach taken here is too complex. All that is needed is a couple of strategically placed hooks for malloc/free, a couple of typedefs and one public function to pass-on two function pointers, then plug a faster heap. I did just that and used Intel's TBB scalable allocator instead of the CRT heap and the speed gains are in the 20% range below the vsm threshold (33 bytes message payload or less) and ranging from 30% to 70% depending on message size after that. In fact it works so well I later made using TBB a build option in itself so TBB can be the new default, or you can supply custom functions in case you think you can do better.

f18m · 2023-12-21T10:01:18Z

I've looked at this as I, too, believe it's worth doing something as in some scenario (inproc and presumably with other very fast transports) the dominant location where time is spent is the heap. However I think the approach taken here is too complex. All that is needed is a couple of strategically placed hooks for malloc/free, a couple of typedefs and one public function to pass-on two function pointers, then plug a faster heap. I did just that and used Intel's TBB scalable allocator instead of the CRT heap and the speed gains are in the 20% range below the vsm threshold (33 bytes message payload or less) and ranging from 30% to 70% depending on message size after that. In fact it works so well I later made using TBB a build option in itself so TBB can be the new default, or you can supply custom functions in case you think you can do better.

hi @axelriet, do you have any "poc quality code" that you can share around your TBB-based approach?
I would be interested in having a look... (I originally raised #3644 but then exhausted the time I had available to complete the feature... as you said maybe it was a too complex approach after all)

axelriet · 2023-12-21T12:15:29Z

@f18m Yep you can check my fork of the project. It’s a bit of a battlefield atm, but it builds on Windows, all tests pass, and it’s a drop-in binary replacement so it should be easy to experiment with. The alloc hooks and defaulting to TBB are two separate things that you can combine or not. If you want to try Intel’s allocator just get the latest oneApi from their website. They don’t supply a static lib (should be possible to build one, though) so you’ll have to add tbbmalloc.dll to your binaries. ETA: I've TBB-ified all the low-level constructs now (tries, queues...) The custom alloc and TBB alloc work is spread amongst several commits, the main ones being this one and that one. I think I can claim a solid 30% increase in throughput using inproc_thr.exe but I expect the benefits to be even greater on real-world server workloads with lots of competition inside and outside the process. There is one new public function to support the feature:

ZMQ_EXPORT (bool)
zmq_set_custom_msg_allocator (_In_ zmq_custom_msg_alloc_fn *malloc_,
                              _In_ zmq_custom_msg_free_fn *free_);

If never called, the default is either malloc/free or TBB's depending on build options.

As a side note I've added support for Microsoft Hyper-V's HvSocket transport for guest-host communication (which incidentally is now the fastest zmq IPC on Windows) as well as Vsock for Linux VMs to talk to their host.

axelriet · 2024-01-02T20:08:59Z

Benefits of a custom allocator for inproc workloads. Here I'm using Intel's Scalable Allocator, part of Intel oneAPI / Threading Building Blocks. There is definitely an advantage. One thing jumps out: the VSM optimization is very effective. Too bad it's limited to 33-bytes or less. What could be done to increase that number is reduce the threshold for long group names from 14 + z down to 10 + z chars, and increases the VSM threshold from <= 33 bytes to <= 37 bytes without changing the existing zmq_msg_t size and therefore without breaking binary compact with existing clients. That's a modest improvement but it will allow more small messages to go below the VSM break, and it's easy to restrict the group names to a smaller length (to avoid an external allocation) by convention.

axelriet · 2024-01-03T08:34:00Z

I went ahead in my fork and was able to increase the vsm to 40 bytes (up from 33) by repacking the group_t and msg_t structs and reducing the short group names to 7 char.

Therefore, group names from 8 chars and up switch to long group names now (vs 15 chars and up previously) but the fast vsm performance applies to up to 40 bytes messages, which may be important to some users.

I think 40/7 is a better compromise than the original 33/14 choice as users can trivially shrink group names to 7 chars or less if they want max performances for small messages, but they could not increase the vsm threshold beyond 33 if their data didn't fit under it. Now they have seven more bytes to play with. I get that 9M msg/sec inproc_thr perf up to 40 bytes now.

f18m and others added 30 commits August 13, 2019 11:19

Implement a very simple zero-lock message pool to test performance gains

4f84758

by completely removing malloc/free from the hot path of benchmark util

Merge branch 'master' of https://github.com/f18m/libzmq.git

ea0dc06

Allow to choose message sizes as well

a24f2af

Allow using env variables to do some basic overriding

1bd2ae1

fix typo

252e8d4

add TCP kernel socket buffer setting

4a30795

Merge remote-tracking branch 'upstream/master'

577232e

Merge remote-tracking branch 'upstream/master'

ff8d79f

First implementation of global memory pool for ZMQ

00e514e

Remove changes related to graph generation

18c52c4

allow testing up to 8k msg sizes

a720a31

correctly deallocate memory pool blocks

b9e1f01

fix build with no draft API

1649701

never use allocator for VSM

0baafa4

Merge branch 'master' of https://github.com/zeromq/libzmq into memory…

59cbfac

…pool # Conflicts: # src/msg.cpp # src/msg.hpp # src/zmq.cpp

Fixes cmake build

f0a7a7f

Changes to base class with virtuals

f682600

Makes max message size dynamic

3a3d877

Dynamically grows mempool

b416348

Updates dynamic global pool

cfd4c85

Removes unnecessary class

1dd2304

Adds new files to makefile

d06f868

Adds concurrentqueue to sources

cfa228b

Fixes some warnings

d96d616

Adds includes

caf7798

Makes initial number of messages a bit more dynamic

5fbc4cc

In practice I observed that 256 byte messages require +- 1024 messages pre-allocated to reach max. performance on my pc. This scales depending on message size I believe.

Hides global allocator implementation and option when C++11 not avail…

d2c53c5

…able

Fixes msvc __cplusplus reporting

348865f

Improves <C++11 support

59c6a6c

Fixes missing declaration

1f5abc1

bluca reviewed May 16, 2020

View reviewed changes

src/allocator_default.cpp Outdated Show resolved Hide resolved

bluca reviewed May 16, 2020

View reviewed changes

src/allocator_global_pool.hpp Outdated Show resolved Hide resolved

Mark Jan van Kampen added 9 commits May 16, 2020 15:30

Fixes bad options

d666af8

Changed some options that should not have changed

Moves to draft

73807f8

Fixes formatting

ffcede1

Updates copyright years

312f8c3

Switches to new/delete

bf495f8

Cpp equivalent of malloc/free

Switches to alternative log2

61870a4

More copyright years

c13f837

Fixes more formatting

ea9c5dc

Fixes more bad years

20f49ec

gummif reviewed May 17, 2020

View reviewed changes

Fixes some concurrency issues and bugs

ba05e8f

bluca mentioned this pull request May 24, 2020

[RFC, WIP] Global memorypool implementation #3644

Closed

rfarnham reviewed May 27, 2020

View reviewed changes

Mark Jan van Kampen added 2 commits September 19, 2020 19:23

Merge branch 'master' into memorypool

afb858d

Adds destroy fn

32d827d

sonatype-lift bot reviewed Sep 19, 2020

View reviewed changes

Adds destroy

84b4f8f

mjvankampen mentioned this pull request Jan 17, 2021

[PROBLEM] No custom memory allocator for msg can be provided #4125

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP, RFC] Problem: lot of use of dynamic memory allocation, leading to performance degradation and non-determism #3911

[WIP, RFC] Problem: lot of use of dynamic memory allocation, leading to performance degradation and non-determism #3911

mjvankampen commented May 12, 2020 •

edited

gummif May 17, 2020

mjvankampen May 17, 2020

gummif May 17, 2020 •

edited

mjvankampen May 17, 2020

gummif May 17, 2020

mjvankampen May 17, 2020

gummif May 17, 2020

mjvankampen May 17, 2020

rfarnham May 27, 2020

gummif May 17, 2020

mjvankampen May 17, 2020

rfarnham May 27, 2020

sonatype-lift bot Sep 19, 2020

sonatype-lift bot Sep 19, 2020

sonatype-lift bot Sep 19, 2020

axelriet commented Dec 21, 2023

f18m commented Dec 21, 2023

axelriet commented Dec 21, 2023 •

edited

axelriet commented Jan 2, 2024 •

edited

axelriet commented Jan 3, 2024 •

edited

[WIP, RFC] Problem: lot of use of dynamic memory allocation, leading to performance degradation and non-determism #3911

Are you sure you want to change the base?

[WIP, RFC] Problem: lot of use of dynamic memory allocation, leading to performance degradation and non-determism #3911

Conversation

mjvankampen commented May 12, 2020 • edited

RFC

Problem

Solution direction

Default allocator usage

APIs

zmq_allocator_new(type)

zmq_allocator_destroy(allocator)

zmq_msg_init_allocator(message, size, allocator)

zmg_allocator_t

Provided allocators

Default

Global pool

Other?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gummif May 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonatype-lift bot Sep 19, 2020

Choose a reason for hiding this comment

sonatype-lift bot Sep 19, 2020

Choose a reason for hiding this comment

sonatype-lift bot Sep 19, 2020

Choose a reason for hiding this comment

axelriet commented Dec 21, 2023

f18m commented Dec 21, 2023

axelriet commented Dec 21, 2023 • edited

axelriet commented Jan 2, 2024 • edited

axelriet commented Jan 3, 2024 • edited

mjvankampen commented May 12, 2020 •

edited

`zmq_allocator_new(type)`

`zmq_allocator_destroy(allocator)`

`zmq_msg_init_allocator(message, size, allocator)`

`zmg_allocator_t`

gummif May 17, 2020 •

edited

axelriet commented Dec 21, 2023 •

edited

axelriet commented Jan 2, 2024 •

edited

axelriet commented Jan 3, 2024 •

edited