-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accelerate Scheduler with Cython, PyPy, or C #854
Comments
Here is another, more real-world benchmark: https://gist.github.com/48b7c4b610db63b2ee816bd387b5a328 Though I plan to try to clean up performance on this one up a bit in pure python first. |
Any experience running with PyPy? How much is the difference? I think it's a bit complicated to run the scheduler on pypy and the workers with cpython (if you want numpy or pandas). According to @GaelVaroquaux 's testimonial on Cython:
With Cython it's easier to achieve c-like performance without leaving the python origins. I had great experience with it too. |
My perspective is the following: The advantage of PyPy is that we would maintain a single codebase. We would also speedup everything for free (including tornado). I'm not normally excited about PyPy. I think that dask's distributed scheduler is an interesting exception. PyPy seems to be fairly common among web projects. The distributed scheduler looks much more like a web project than a data science project. The advantage of Cython is that users can install and run it normally using their normal tool chain. Also the community that works on Dask has more experience with Cython than with PyPy. The disadvantage is that we will need to maintain two copies of several functions to support non-Cython users, and things will grow out of date. I do not intend to make Cython a hard dependency. The advantage of C is that it would force us to reconsider our data structures. |
I'm using PyPy for client, workers and scheduler and it works fine. I also tried to run the scheduler on PyPy and client with CPython, that worked too. I think that accelerating the scheduler in this way is a good idea, just wanted to state that at least with PyPy you can try it now right away, without any changes to distributed itself :-) |
cc: @marshyski this might of interest to you from Go perspective. |
Did You mean non-CPython users? It doesn't seem impossible to keep compatible with pypy without code duplication. I guess Numba is out-of-scope - the scheduler depends and possibly will depend on hashtable structures, right? |
No, I mean non-Cython users. I don't intend to make Dask depend on Cython near term. |
Numba is unlikely to accelerate the scheduler code. |
The disadvantage is that we will need to maintain two copies of several
functions to support non-Cython users, and things will grow out of
date.
You should explore using Cython's pure Python mode:
http://cython.readthedocs.io/en/latest/src/tutorial/pure.html
Maybe it will be enough for what you need.
Also, I think that if you need to have two implementations of functions,
a good practice is to test them together. It will help avoiding a decay.
I do not intend to make Cython a hard dependency.
The problem would not be Cython as a dependency, as it needs only to be a
build dependency, or even better, you can generate the C files and ship
them when you do releases (this is what we do in scikit-learn). The
problem is that you are adding compiled code to your project. It makes it
more likely that things go run on installation. You need to start
distributing binaries, and hence start compiling for the different
platforms and the different versions of Python. Systems that have non
uniform platforms (think a multi-machine environment) suffer.
I have always considered that adding compiled code to a project was a big
step, and I try to limit it. Note that limiting it is not always avoiding
it: I am very very happy that we decided for compiled code in
scikit-learn. It does increase the adoption cost for dask to add such a
requirement.
|
Quick timing update showing PyPy vs CPython for creating typical dask graphs. We find that PyPy is in the expected 2-5x faster range we've seen with Cython in the past: PyPy>>>> import time
>>>> start = time.time(); d = {('x', i): (apply, lambda x: x + 1, [1, 2, 3, i],\
{}) for i in range(100000)}; end = time.time(); print(end - start)
0.11324095726 CPythonIn [1]: import time
In [2]: start = time.time(); d = {('x', i): (apply, lambda x: x + 1, [1, 2, 3, i], {}) for i in range(100000)}; end = time.time(); print(end - start)
0.357743024826 Given what happens to turn up as a bottleneck. I can imagine wanting to cythonize core parts of dask.array as well (like |
Quick timing update showing PyPy vs CPython for creating typical dask graphs.
We find that PyPy is in the expected 2-5x faster range we've seen with Cython
in the past:
Usually that's because there are some type information that should be
added to make Cython faster.
|
I'm aware. To be clear, my previous comment was comparing PyPy to CPython, not Cython. We observe similar speedups that we've seen when accelerating data-structure-dominated Python code with Cython in the past, notably 2-5x. In my experience 100x speedups only occur in Cython when accelerating numeric code, where Python's dynamic dispatch is a larger fraction of the cost. |
OK, I had misunderstood your comment. |
I'm not sure what you mean by "making Dask depend on Cython". Cython would only be a build-time dependency --- you'd ship either the generated C files via PyPi, Conda, etc, or the user would require a C compiler. It's similar to writing a C extension, just in a better language. |
@honnibal of course you're correct. I should have said something like "making the dask development process depend on Cython and the dask source installation process depend on a C compiler" both of which add a non-trivial cost. |
Couldn't you integrate the cython building into travis before shipping to PyPI, and have a dask and dask-c version in PyPI? That seems like it has all the advantages and none of the costs. Also, you're completely right that Cython won't be faster (or only a tiny bit than PyPy for your scheduler), but my argument against relying on it is that PyPy isn't production ready for and/or doesn't support a lot of important things, including a lot of the data science ecosystem that people would use Dask with. Cython already has complete compatibility with everything and is used in massive deployments by Google etc. I also have a related question to all this, which is the reason I stumbled across this thread in the first place: |
Yes.
This adds a significant cost in development and build maintenance.
It is very hard (and often incorrect) to make claims about one being faster or slower than the other generally. Things are more complex than that.
I'm going to claim that this is unrelated. Please open a separate issue if you have a bug or ask a question on stack overflow (preferably with a minimal example) if you have a usage question. |
@justinkterry You won't really see a speed benefit from Cython unless you plan out your data structures in C. That's not nearly as hard as people suggest, and I think it actually makes code better, not worse. But it does mean maintaining a separate dask-c fork is really costly. As far as the Travis build process goes: Yeah, that does work. But the effort of automating the artifact release is really a lot. You have to use both Travis and Appveyor, and Travis's OSX stuff is not very nice, because the problem is hard. I'm also not sure Travis will stay so free for so long. I suspect they're losing a tonne of money. In general the effort of shipping a Python C extension is really quite a lot. Sometimes I feel like it's harder to build, package and ship my NLP and ML libraries than it is to write them. If it included C extensions, a library like Dask would have a build matrix with the following dimensions:
That's over 1,000 combinations, so you can't test the matrix exhaustively. Building and shipping wheels for all combinations is also really difficult, so a lot of users will have to source install. This means that a lot of things that shouldn't matter do. For instance, the peak memory usage might spike during compilation for some compilers, bringing down small nodes (obviously an important use-case for Dask!). This might happen on some platforms, but not others --- and the breakage might be introduced in a point release when Dask upgraded the version of Cython used to generate the code. Is it worth it? Well, for me the choice is between writing extension and going off and doing something completely different. My libraries couldn't exist in pure Python. But for a very marginal benefit, I think you'd rather be shipping a pure Python library. Btw, I also think PyPy might not be that helpful? The workers would have to be running CPython, right? Most tasks you want to schedule with Dask will run poorly on PyPy. |
@honnibal Thank you very much for your detailed explanation; that actually helps a lot. What would you recommend doing if I want to call a bunch of functions in dask that would be highly accelerated by C? Just package each one with cython and call it from the script using dask? My only concern with doing that is the time it takes to go between python and C, because it's a very large number of very short functions. |
@justinkterry I recommend raising a question on Stack Overflow using the #dask tag. |
Hi Everyone. So me & Matt did some benchmarks, looked at pypy and here are the takeaways:
I think we can get another 2x speedup from PyPy with moderate effort. I can't promise I'll find time immediately, but if someone pesters me at some stage in the near to mid future, I can have a look. Cheers, |
This provides some motivation to arrange per-task information into objects. It is currently spread across ~20 or so dictionaries. This would have the extra advantage of maybe being clearer to new developers (although this is subjective). We would also have to see what affect this would have on CPython performance. A rewrite of this size is feasible, but is also something that we would want to discuss heavily before performing. Presumably there are a number of other improvements that we might want to implement at the same time.
My guess is that your time would be better spent by providing us with advice on how best to use PyPy during this process. My guess is that it would be unpleasant for anyone not already familiar with Dask's task scheduler to actually perform this rewrite. |
@mrocklin as a general note, if you're thinking about moving things from dicts to objects, then the |
So on the plus side, I added some small PyPy improvements to be not as bad when handling bytearray (and it has nothing to do with the handling of refcounts). Where are the bytearrays constructed? It might be better (for Cpython too) to create a list and use b"".join instead of having a gigantic bytearrays. As for how to get some basics. I do the following:
At this stage, I think what we need to do is to take the core functions and make smaller benchmarks, otherwise it's a touch hard to do anything. My hunch is that a lot of dicts and forest-like code of a few functions makes it hard for the JIT to make sense of it and as such too slow, but it's just a hunch for now |
I'm not sure we want to go too deep into PyPy-specific tuning. The idea of converting the forest-of-dicts approach to a per-task object scheme sounds reasonable on the principle, though I'm not sure how much it would speed things up. We also don't want to risk making CPython slower by accident, as it is our primary platform (and our users' as well). |
Right, I'm pretty sure pypy-specific tuning is a terrible idea. It's also rather unlikely to make a measurable difference on CPython. Measuring on PyPy though DOES make sense (especially that it runs almost 2x faster in the first place). Do you have benchmarks running on CPython all the time? Because if not, then "making it slower by accident" is a completely moot point. |
We do have a benchmarks suite (and some of us have individual benchmarks they run on a casual basis), but unfortunately we haven't automated its running (yet?). |
If I'm guessing correctly, it is in Tornado and should be fixed by tornadoweb/tornado#2169 |
replacing OrderedDict with dict helps a bit on PyPy. I presume we can do that on PyPy always and CPython >= 3.6? Should help there too |
cc @fjetter nothing to do here, but I wanted you to be aware. It looks like this is happening whenever the scheduler writes a message. We check if the message is large enough that we should serialize it in a separate thread. In the case of the scheduler all messages should already be pre-serialized, so this check should probably be skipped. This could be solved by passing through some For the other components I'm still curious what within those functions is slow. Is it data structure access? Manipulating Task objects? It's hard to dive more deeply to figure out what about CPython itself is slow in particular. |
I wanted to check in here and bump this a bit. Two comments.
|
@Kobzol did you ever end up profiling things with |
I didn't use |
I have some questions about Cython. In @shwina 's work here he Cythonized the TaskState and other classes in the Scheduler distributed/distributed/scheduler_cy.pyx Lines 584 to 627 in 79f1f89
This yielded only modest speedup. I'm curious if it is possible to take this further, and modify the types in the class itself away from dependencies: Dict[str, TaskState] More specifically, some technical questions: Does Cython have the necessary infrastructure to understand compound types like this? Do we instead need to declare types on the various methods like It looks like the previous effort didn't attempt to Cythonize the cc @jakirkham who I think might be able to answer the Cython questions above and has, if I recall correctly, recent experience doing this with UCX-Py. Also cc @quasiben who has been doing profiling here recently. |
You can use the Usually, people type the variables that they assign the results to. Or use type casts. But that introduces either a requirement for Cython syntax or some runtime overhead in pure Python (due to a function call to Without looking at how the |
Given the dict is used with a pre-defined set of keys, I think we could define a big |
|
Thanks for the response @scoder . I think that we're becoming more comfortable with switching the entire file to Cython. So what I'm hearing is that we'll do something like the following: cdef class TaskState
dependencies: dict
cdef transition_foo_bar(self, ts: TaskState):
key: str
value: TaskState
for key, value in ts.dependencies.items():
... This is more c-like in that we're declaring up-front the types of various variables used within a function. Am I right in understanding that Cython will use these type hints effectively when unpacking |
If we are comfortable moving to more typical Cython syntax, are we also comfortable making use of C++ objects in Cython? For example we could do things like this... from libcpp.map cimport map
from libcpp.string cimport string
cdef class Obj:
cdef map[string, int] data Asking as this would allow us to make the kind of optimizations referred to above. |
I would be inclined to do this incrementally and see what is needed. My guess is that there will be value in the attributes of a TaskState object being easy to manipulate in Python as well. This will probably be useful when we engage the networking side of the scheduler, or the Bokeh dashboard. To me the following somewhat incremental path seems good:
|
I'm skeptical using a C++ |
Sure the code above was intended to provide a sample. Not the optimal solution necessarily. |
Slightly orthogonal, but I think the biggest potential speedup for the scheduler would come from vectorizing scheduler operations. That is, instead of having dicts and sets that are iterated on, find a way to express the scheduling algorithm in terms of vector/matrix operations, and use Numpy to accelerate them. Regardless of the implementation language, implementing set operations in terms of hash table operations will always be costly. (this has several implications, such as having to identify tasks and other entities by integer ids rather than arbitrary dict keys) |
Jumping on a discussion from forever ago.... that was essentially the issue
with trying to speed it up on pypy. No matter what you do, dict operations
are expensive. If the only thing you do is dict, then cpython/pypy will be
the same speed and probably faster than an equivalent C implementation. But
the implementation in C will probably not use dicts everywhere, because
it's hard. I would strongly suggest doing that work before moving to
anything else, even if it makes CPython not faster (but it should make it
faster).
…On Wed, Nov 4, 2020 at 2:58 PM Antoine Pitrou ***@***.***> wrote:
Slightly orthogonal, but I think the biggest potential speedup for the
scheduler would come from vectorizing scheduler operations. That is,
instead of having dicts and sets that are iterated on, find a way to
express the scheduling algorithm in terms of vector/matrix operations, and
use Numpy to accelerate them. Regardless of the implementation language,
implementing set operations in terms of hash table operations will always
be costly.
(this has several implications, such as having to identify tasks and other
entities by integer ids rather than arbitrary dict keys)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#854 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAU7OHEZL3AIZY74733OGLSOFFXHANCNFSM4C6ZYWMA>
.
|
Thanks for the comments @pitrou and @fijal (also, it's good to hear from both of you after so long) I agree that vectorization would probably push us well into the millions of tasks per second mark. If you look at HPC schedulers in academia one sees this kind of throughput. At some point we'll have to think about this, and I look forward to that. I still think that we should be able to do better than the 5k tasks-per-second limit that we have today. Dicts are slow, yes, but not that slow. @quasiben did some low-level profiling with NVIDIA profilers and found that the majority of our time in CPython wasn't in |
HI Matthew
It might be an unpopular opinion, but the CPython performance does not tell
you much about how it would perform on anything else (e.g. PyPy, numba,
cython, rust, C). E.g. attribute access on PyPy is mostly getting compiled
to an array access
…On Wed, Nov 4, 2020 at 4:14 PM Matthew Rocklin ***@***.***> wrote:
Thanks for the comments @pitrou <https://github.com/pitrou> and @fijal
<https://github.com/fijal> (also, it's good to hear from both of you
after so long)
I agree that vectorization would probably push us well into the millions
of tasks per second mark. If you look at HPC schedulers in academia one
sees this kind of throughput. At some point we'll have to think about this,
and I look forward to that. I still think that we should be able to do
better than the 5k tasks-per-second limit that we have today. Dicts are
slow, yes, but not that slow.
@quasiben <https://github.com/quasiben> did some low-level profiling with
NVIDIA profilers and found that the majority of our time in CPython wasn't
in PyDict_GetItem, but had more to do with attribute access (I think).
Previous Cythonization efforts left, I think, some performance on the
table. I'd like to explore that to see if we can get up to 50k or so (which
would be a huge win for us) before moving on to larger architectural
changes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#854 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAU7ODHEQVJSROZ2RVVDR3SOFOS7ANCNFSM4C6ZYWMA>
.
|
Yeah, I'm hoping that we could get the same performance result with Cython. My hope (perhaps naive) is that there are large optimizations here that PyPy wasn't able to find, and that by writing this code manually and inspecting the generated C that we might be able to find something that PyPy missed. I'm making that bet mostly based on the belief that dict/attribute access isn't that slow, but I'll admit that I'm making it in igornance of an understanding of how awesome PyPy is. Also, to be clear, I'm not saying "we're going with Cython" but rather "we should explore diving more deeply into Cython and see how it goes" |
Hi Matthew
I'm really not telling you where you should or should not go. I was
merely saying that everything that *does* any form of optimization
will have a very different profile (more like PyPy) than CPython and
looking where CPython spends it's time is not that interesting. Using
ints as keys, preallocating arrays (as opposed to resizable ones or
dictionaries) etc. are all sensible strategies that will yield good
results in anything optimizing, like say Java, *except* in CPython.
Best,
Maciej
…On Wed, Nov 4, 2020 at 4:21 PM Matthew Rocklin ***@***.***> wrote:
Yeah, I'm hoping that we could get the same performance result with Cython. My hope (perhaps naive) is that there are large optimizations here that PyPy wasn't able to find, and that by writing this code manually and inspecting the generated C that we might be able to find something that PyPy missed. I'm making that bet mostly based on the belief that dict/attribute access isn't that slow, but I'll admit that I'm making it in igornance of an understanding of how awesome PyPy is.
Also, to be clear, I'm not saying "we're going with Cython" but rather "we should explore diving more deeply into Cython and see how it goes"
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Ah! Got it. |
On the interpreter side, conda-forge and Anaconda both build CPython with profile-guided optimizations (PGO). So at least on that side, we are probably as well optimized as we can hope. |
Another option to consider if this were revisited would be mypyc, which mypy dogfoods (IOW mypy compiles itself with mypyc). This is likely a better fit to other approaches as it uses standard Python type hints, which have increasingly been added to the codebase, and it is targeting server applications, which Distributed basically is. That said, mypyc itself is considered alpha (though it has been around for a little while) ( mypyc/mypyc#780 ). Also (and this may be inexperience on my part) it seems to look at all files instead of just one being requested and does not provide all errors at once (so one incrementally fixes issues, which can be a bit slow). However given the other points above, these may improve over time. So worth keeping an eye on. |
Given that it's considered alpha, I'm not sure you want to go through debugging potential mypyc bugs on the Distributed source code. Also, I don't see any benchmark results on non-trivial use cases, though I may be missing something. |
We are sometimes bound by the administrative of the distributed scheduler. The scheduler is Pure-Python, and a bundle of core data structures (lists, sets, dicts). It generally has an overhead of a few hundred microseconds per task. When graphs become large (hundreds of thousands) this overhead can become troublesome.
There are a few potential solutions:
Generally efforts here have to be balanced with the fact that the scheduler will continue to change, and we're likely to continue writing it in Python, so any performance improvement would have the extra constraint that it can't add significant development inertia or friction.
Here are a couple of cProfile-able scripts that stress scheduler performance: https://gist.github.com/mrocklin/eb9ca64813f98946896ec646f0e4a43b
The text was updated successfully, but these errors were encountered: