-
-
Notifications
You must be signed in to change notification settings - Fork 717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST][Bug?] Can I fit/evaluate many XGBoost models on the same cluster? #8623
Comments
Just a note that the reproducer is certainly submiting tasks from the workers. Therefore this documentation is probably relevant. Perhaps there is a deadlock that can be avoided if xgboost knows it is running from a worker and can |
You might try making lots of Dask clusters? This is done in this example:
https://github.com/coiled/dask-xgboost-nyctaxi/blob/main/Modeling%203%20-%20Parallel%20HPO%20of%20XGBoost%20with%20Optuna%20and%20Dask%20(multi%20cluster).ipynb
…On Wed, Apr 17, 2024 at 12:04 PM Richard (Rick) Zamora < ***@***.***> wrote:
Just a note that the reproducer is certainly submiting tasks from the
workers. Therefore this documentation
<https://distributed.dask.org/en/stable/task-launch.html#submit-tasks-from-worker>
is probably relevant. Perhaps there is a deadlock that can be avoided if
xgboost knows it is running from a worker and can secede/rejoin when
necessary?
—
Reply to this email directly, view it on GitHub
<#8623 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTF4QHIAG5K67CTK55LY52TRFAVCNFSM6AAAAABGLNI3TWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRRG44DEMRYGE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks @mrocklin ! I saw this repository the other day, and actually did suggest the multi-cluster approach to the original user who ran into this issue. It is clear to me that the multi-cluster approach should work well. What is not clear to me is the specific reason that the single-cluster approach will hang. If it is possible for XGBoost to make the single-cluster approach work well, they seem interested to know what that change looks like. |
I agree that it would be good if the xgboost.dask approach could happily share the cluster. If your question is "does it?" then I think that the answer is "no" at least when the first prototype was made. I think that you'd have to check with the current sourcecode or folks who wrote that code to be sure. If the question is "could it?" then sure the answer is yes, although there are likely tricky things you'd want to figure out, like if xgboost jobs should queue, or if they should shard the cluster, or if they should run concurrently, etc. and how to manage that. |
Thank you for the suggestion. Is there any particular pattern I should avoid (to not hang)?
Let's say it should run sequentially, which is the simplest case. Currently, XGBoost uses the If nested parallelism is not supported, an error would also help. |
I'm not sure I understand the question that is being asked. If the question is "how would you design xgboost.dask so that it could be run multiple times at once?" then I'd need to think a lot longer (although maybe someone else can help). That's an interesting question that probably deserves a lot of thought. If the question is "how do you enforce sequential work" then it probably depends on what you're doing. For example, you could use a If you're doing something like the above then you'd want to swap out If your question is "is this a bug" then I think my answer is "no", you're just dealing with lots of Locks and you've set them up in such a way that they deadlock. Using something like |
That makes sense. So, if I understand it correctly, the observed hang is caused by the worker having a single thread, and that sole thread is being used to wait for the lock?
There's only one lock in xgboost. I can't think of another way to ensure the workers' availability.
Since the underlying implementation of If my understanding of the underlying cause is correct, then I agree that it's not a bug but more a limitation of the current design. This will bring us back to the feature request for dask to support collective/mpi tasks so that xgboost can remove/replace the lock. Will leave that for another dask sync. ;-) |
I don't know for sure. I'd have to look at the xgboost.dask code to see what is going on.
I recommend raising an issue. Most technical conversation for Dask happens in issues, not at the monthly meetings. Alternatively, if you want to talk synchronously with someone you could e-mail that person and ask for a conversation. (I am not the person to talk to here FWIW). |
I very briefly looked over the implementation of the xgboost dask integration. It's a little difficult to wrap my head around what's going on. However, I'm pretty sure it's because you are indeed submitting tasks from within tasks without seceding.
|
Thank you for the suggestion, will try to write a feature request. |
Description of possible bug
When I try to fit/evaluate many
xgboost.dask
models in parallel, one or more of the fit/evaluation futures hangs forever. The hang only occurs whenthreads_per_worker=1
(which is the default for GPU-enabled clusters in dask-cuda).Is this a bug, or is the reproducer shown below known to be wrong or dangerous for a specific reason?
Reproducer
Further background
There is a great blog article from Coiled that demonstrates Optuna-based HPO with XGBoost and dask. The article states: "the current
xgboost.dask
implementation takes over the entire Dask cluster, so running many of these at once is problematic." Note that the "problematic" practice described in that article is exactly what the reproducer above is doing. With that said, it is not clear to me why one or more workers might hang.NOTE: I realize this is probably more of an
xgboost
issue than adistributed
issue. However, it seems clear that significantdask/distributed
knowledge is needed to pin down the actual problem. Any and all help, advice, or intuition is greatly appreciated!The text was updated successfully, but these errors were encountered: