Accelerate intra-node IPC with shared memory #6267

Hoeze · 2020-05-30T23:53:07Z

When implementing e.g. a data loading pipeline for machine learning with Dask, I can choose either:

threaded scheduler: Only fast, when GIL is released
forking scheduler: Only fast, when the data calcuation is very CPU intense compared to the result size.

I often face the issue that the threaded scheduler effectively uses only 150% CPU, no matter how many cores it gets, because of python code that does not parallelize.
The forking scheduler sometimes works better but only if the data loading is very CPU-intense.

Recently, I tried Ray and it could speed up some of my prediction models by 5-fold due to some reason.
I'm not 100% up to date with the latest development in Dask, but AFAIK Dask serializes all data when sending it between workers. That's why I assume the huge speed difference is due to the shared-memory object store Plasma that allows zero-copy transfers of Arrow arrays from the worker to Tensorflow.

=> I'd like to share two ideas how Plasma or Ray could be helpful for Dask:

Have a shared object cache between all threads/forks in dask/cachey
Shared memory communication:
Allow producer to calculate data and consumer to read it without (de)serialization or copying

Related issues:

mrocklin · 2020-05-31T01:11:03Z

What is the workload?

Have you tried the dask.distributed scheduler? You can set up a system with sensible defaults by running the following:

from dask.distributed import Client

client = Client()

#  then run your normal Dask code

https://docs.dask.org/en/latest/scheduling.html#dask-distributed-local

mrocklin · 2020-05-31T02:20:14Z

In general a system like Plasma will be useful when you want to do a lot of random access changes to a large data structure and you have to use many processes for some reason.

In my experience, the number of cases where this is true is very low. Unless you're doing something like a deep learning parameter server on one machine and can't use threads for some reason there is almost always a simpler solution.

When implementing e.g. a data loading pipeline for machine learning with Dask, I can choose either:

A data loading pipeline shouldn't really require any communication, and certainly not high speed random access modifications to a large data structure. It sounds like you just want a bunch of processes (because you have code that holds the GIL) and want to minimize data movement between those processes. The dask.distributed scheduler should have you covered there, you might want to add the threads_per_worker=1 (or 2) if you have a high core machine.

jakirkham · 2020-06-11T17:21:56Z

In addition to what Matt said, we have tended to keep Dask's dependencies pretty lightweight when possible. My guess is if we were to implement shared memory it would either involve multiprocessing.shared_memory (added in Python 3.8 with a backport package) or using UNIX domain sockets ( dask/distributed#3630 ) (as noted above).

That said, if serialization is really a bottleneck for you, would suggest you take a closer look at what is being serialized. If it's not something that Dask serializes efficiently (like NumPy arrays), then it might just be you need to implement Dask serialization. If you have some simple Python classes consisting of things Dask already knows how to serialize efficiently, you might be able to just register those classes with Dask. It will then recurse through them and serialize them efficiently.

Additionally if you are Python with pickle protocol 5 support and a recent version of Dask, you can get efficient serialization with plain pickle thanks to out-of-band pickling ( dask/distributed#3784 ). Though you would have to check and make sure you are meeting those requirements. This may also require some work on your end to ensure your objects use things that can be handled out-of-band by either wrapping them in PickleBuffers (like in the docs) or using NumPy arrays, which have builtin support.

dhirschfeld · 2020-06-12T00:11:12Z

plasma might be ideally suited for e.g. shuffling operations, #6164

mrocklin · 2020-06-12T00:16:51Z

Maybe. We're not really bound by bandwidth there yet. Even if we were, the people who are concerned about performance for dataframe shuffle operations are only really concerned when we start talking about very large datasets, for which single-node systems wouldn't be appropriate.

…

On Thu, Jun 11, 2020 at 5:11 PM Dave Hirschfeld ***@***.***> wrote: plasma might be ideally suited for e.g. shuffling operations, #6164 <#6164> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6267 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTGABCAUEKJ6C2SYJ23RWFXCZANCNFSM4NO46HZA> .

jakirkham · 2020-06-12T00:26:23Z

plasma might be ideally suited for e.g. shuffling operations, #6164

Though if you have thoughts on how plasma would help in that issue, please feel free to suggest over there. I'm sure people would be interested to hear 😉

dhirschfeld · 2020-06-12T00:32:20Z

In the context of distributed you could have a plasma store per node and instead of having workers communicating data directly, have them send the data to the plasma store on the receiving node and only send the guid / unique reference directly to the worker. All workers on that node would then have access to that data (by passing around the guid) without having to copy or deserialize the data.

I think that could have pretty big performance benefits for a number of workloads. IIUC that's basically what ray does.

To illustrate the benefits of Plasma, we demonstrate an 11x speedup (on a machine with 20 physical cores) for sorting a large pandas DataFrame (one billion entries). The baseline is the built-in pandas sort function, which sorts the DataFrame in 477 seconds. To leverage multiple cores, we implement the following standard distributed sorting scheme...

Anyway, it might be a very big piece of work, so not something I could invest time in. I thought I'd mention it as an option though if people are considering big changes to improve performance.

mrocklin · 2020-06-12T01:59:12Z

Yeah, I think that having some sort of shuffling service makes sense (this is also what Spark does). I'm not sure that we need all of the machinery that comes along with Plasma though, which is a bit of a bear. My guess is that a system that just stores data in normal vanilla RAM on each process would do the trick.

…

On Thu, Jun 11, 2020 at 5:32 PM Dave Hirschfeld ***@***.***> wrote: In the context of distributed you could have a plasma store per node and instead of having workers communicating data directly, have them send the data to the plasma store on the receiving node and only send the guid / unique reference directly to the worker. All workers on that node would then have access to that data (by passing around the guid) without having to copy or deserialize the data. I think that could have pretty big performance benefits for a number of workloads. IIUC that's basically what ray <https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html> does. To illustrate the benefits of Plasma, we demonstrate an 11x speedup (on a machine with 20 physical cores) for sorting a large pandas DataFrame (one billion entries). The baseline is the built-in pandas sort function, which sorts the DataFrame in 477 seconds. To leverage multiple cores, we implement the following standard distributed sorting scheme... Anyway, it would would be very big piece of work so, not something I could invest time in.I thought I'd mention it as an option if people are considering big changes to improve performance. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#6267 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTE7ESNRUJGQDG7DQ73RWFZSDANCNFSM4NO46HZA> .

mrocklin · 2020-06-12T02:00:26Z

I could totally be wrong though. It would be great if people wanted to run experiments here and report back.

…

On Thu, Jun 11, 2020 at 6:58 PM Matthew Rocklin ***@***.***> wrote: Yeah, I think that having some sort of shuffling service makes sense (this is also what Spark does). I'm not sure that we need all of the machinery that comes along with Plasma though, which is a bit of a bear. My guess is that a system that just stores data in normal vanilla RAM on each process would do the trick. On Thu, Jun 11, 2020 at 5:32 PM Dave Hirschfeld ***@***.***> wrote: > In the context of distributed you could have a plasma store per node and > instead of having workers communicating data directly, have them send the > data to the plasma store on the receiving node and only send the guid / > unique reference directly to the worker. All workers on that node would > then have access to that data (by passing around the guid) without having > to copy or deserialize the data. > > I think that could have pretty big performance benefits for a number of > workloads. IIUC that's basically what ray > <https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html> > does. > > To illustrate the benefits of Plasma, we demonstrate an 11x speedup (on a > machine with 20 physical cores) for sorting a large pandas DataFrame (one > billion entries). The baseline is the built-in pandas sort function, which > sorts the DataFrame in 477 seconds. To leverage multiple cores, we > implement the following standard distributed sorting scheme... > > Anyway, it would would be very big piece of work so, not something I > could invest time in.I thought I'd mention it as an option if people are > considering big changes to improve performance. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#6267 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACKZTE7ESNRUJGQDG7DQ73RWFZSDANCNFSM4NO46HZA> > . >

jakirkham · 2020-06-12T03:49:41Z

cc @rjzamora @madsbk (in case this is of interest)

alexis-intellegens · 2021-04-02T14:43:20Z

Has there been any further discussion on the multiprocessing shared memory implementation? I also run dask on single machines with high core counts and have read-only datastructures that I want shared.

Hoeze · 2021-04-02T15:14:04Z

@alexis-intellegens the ray depelopers created a Dask scheduler for this called dask-on-ray.
I'd recommend you to try this one, it magically dropped my memory usage by an order of magnitue.
Note that you may need to use sth like this:

# don't do this:
dask.compute(dask_fn(large_object))
# instead do this:
large_object_ref = ray.put(large_object)
dask.compute(dask_fn(large_object_ref))

ray will automatically de-reference the object for you.

alexis-intellegens · 2021-04-03T08:23:54Z

Very interesting! I'll give it a go. Thanks @Hoeze

alexis-intellegens · 2021-04-03T08:24:42Z

Out of curiosity, what were to happen if I made a shared memory object (via Python 3.8 multiprocessing) and tried to access it in dask workers? I'll try it later today.

jcrist · 2021-10-14T16:01:44Z

Out of curiosity, what were to happen if I made a shared memory object (via Python 3.8 multiprocessing) and tried to access it in dask workers? I'll try it later today.

That should work, they'd pickle as references to the shared memory buffer and be remapped in the receiving process (provided all your workers are running on the same machine, otherwise you'd get an error). In general I think we're unlikely to add direct shared memory support in dask itself, but users are free to make use of it in custom workloads using e.g. dask.delayed. So if you have an object you want to share between workers, you can explicitly build this into your dask computations yourself (using either multiprocessing shared_memory or something more complicated like plasma).

As stated above, shared memory would make the most sense if you have objects that can be mapped to shared memory without copying (meaning they contain large buffers, like a numpy array) but also still hold the GIL. In practice this is rare - if you're using large buffers you also probably are doing something numeric (like numpy) in which case you release the GIL and threads work fine.

Closing.

jakirkham mentioned this issue May 31, 2020

Investigate using plasma #2594

Closed

fjetter mentioned this issue Jul 15, 2021

RFC: explicit shared memory dask/distributed#4497

Open

GenevieveBuckley added the discussion Discussing a topic with no specific actions yet label Oct 13, 2021

jcrist closed this as completed Oct 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate intra-node IPC with shared memory #6267

Accelerate intra-node IPC with shared memory #6267

Hoeze commented May 30, 2020 •

edited

Loading

mrocklin commented May 31, 2020

mrocklin commented May 31, 2020

jakirkham commented Jun 11, 2020

dhirschfeld commented Jun 12, 2020

mrocklin commented Jun 12, 2020 via email

jakirkham commented Jun 12, 2020

dhirschfeld commented Jun 12, 2020 •

edited

Loading

mrocklin commented Jun 12, 2020 via email

mrocklin commented Jun 12, 2020 via email

jakirkham commented Jun 12, 2020

alexis-intellegens commented Apr 2, 2021

Hoeze commented Apr 2, 2021

alexis-intellegens commented Apr 3, 2021

alexis-intellegens commented Apr 3, 2021

jcrist commented Oct 14, 2021

Accelerate intra-node IPC with shared memory #6267

Accelerate intra-node IPC with shared memory #6267

Comments

Hoeze commented May 30, 2020 • edited Loading

mrocklin commented May 31, 2020

mrocklin commented May 31, 2020

jakirkham commented Jun 11, 2020

dhirschfeld commented Jun 12, 2020

mrocklin commented Jun 12, 2020 via email

jakirkham commented Jun 12, 2020

dhirschfeld commented Jun 12, 2020 • edited Loading

mrocklin commented Jun 12, 2020 via email

mrocklin commented Jun 12, 2020 via email

jakirkham commented Jun 12, 2020

alexis-intellegens commented Apr 2, 2021

Hoeze commented Apr 2, 2021

alexis-intellegens commented Apr 3, 2021

alexis-intellegens commented Apr 3, 2021

jcrist commented Oct 14, 2021

Hoeze commented May 30, 2020 •

edited

Loading

dhirschfeld commented Jun 12, 2020 •

edited

Loading