-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for multiple datasets #33
Comments
Thanks for your kind words! I'm glad you find it helpful. That's certainly an interesting proposal. There's an interesting discussion in a scikit-learn PR scikit-learn/scikit-learn#8960. I'm going to specifically point out this post by @dengemann
Could this possibly be what you are talking about when you say:
If it is, then instead of having entirely separate datasets, one way you could achieve the same result in Xcessiv is by combining all your datasets, and then setting up your base learners to be sklearn Pipelines with I'm afraid that Xcessiv is very dependent on the scikit-learn (X, y) single dataset paradigm and I'm not sure how multiple sources would fit in with the rest of the codebase without a major overhaul or departure from that paradigm. I might add here that I intend to keep the Xcessiv interface as closely tied to the scikit-learn interface as possible. Those guys have put a lot of thought into it and the result of Xcessiv following that is that things just kind of mesh together perfectly. Perhaps you could give a more specific example of your use case? |
Hi @reiinakano , thanks for the pointers! My use case is a regression problem. So, I was thinking of having one subset of estimators directly use the input features to predict the continuous target variable, and another subset of estimators using the input features to predict the "interval" in which the target variable would be (a multi-class classification problem), and then using predict_proba on these classifiers as inputs to another subset of regressors. So, the problem with this currently, is that the classifier base estimator's data has a different target variable (ie. the interval id), but the regressor base estimators's data has a continuous target variable. As a result, I not only need to split the input features (which i could do by following the approach you shared above) but also input different target variables (the 'y') into the different subset of estimators, which is why I thought multiple datasets would be a nice option. But I think having multiple datasets still goes well with scikit-learn approach, because each estimator will only train on one specific dataset. The main point, rather, is that Xcessiv allows users to import and define these multiple datasets, and then configure each instance of a base estimator with one of these specific datasets. In essence, the user defines a list of imported datasets and then wires each base estimator instance with one of these datasets. The ensembles and stacks made from these base estimators can also be wired with a specific dataset, in order to use the stacked predictions to predict the right target variable desired by the user. I hope I've explained it clearer than before. If not, please tell me and I will be more than willing to further clarify. Cheers! |
Yup, your use case is certainly clear now. It's an interesting one as well. How prevalent is this though? Is this a commonly used approach or something only done in very special cases? My problem here would be during the exporting of a stacked ensemble. The stacked ensemble exported should be usable on its own as a single base learner. i.e. must have Anyway, I can actually see that you can still achieve your use case in the current Xcessiv implementation, albeit in a very hacky way. First, let's assume that your problem is mainly a classification one, although you have a column containing a label suited for regression. Arrange and combine your dataset so it now looks like this: ftr1 | ftr2 | ftr3 | regression_y | classification_y When defining your main dataset, you use classification_y as y, and keep regression_y in your X. Now for all your classification base learners, you use a pipeline with FunctionTransformer to exclude the regression_y column (and any other columns you want to exclude) before your regular classifier. No problem there. Things get a bit hairy for your regressors. But remember that you have full control over what is done with your base learners' It's quite convoluted, but if you export your entire stacked ensemble in the end, you would be able to call What do you think? |
Thinking about it some more, I assume this is something mostly done for regression (because of the binning thing)? If so, then I think it's an even simpler fix. Use Remember, base learners in Xcessiv are just neat little black boxes that convert features ( Thanks for this. It never occurred to me that it'd actually make sense to use classifiers for a regression problem. And it's something that can only be done through stacking! How powerful! EDIT: You did point out it was a regression problem! My bad, didn't notice! |
Great! Thanks for your help! I'll give it a try and let you know how it goes! |
Hi there!
This is an AWESOME library! I simply cannot express my happiness with this super helpful library enough..
Anyways, I think it would be really nice if Xcessiv supports multiple datasets, similar to its support for multiple estimators, ensembles, etc.
This way, we could define and import multiple datasets, and then configure each individual base estimator instance to take input from a user-specified dataset that they have defined. This would be important for importing heterogeneous data into a subset of the estimators, and even import different versions of the same dataset (classification, regression, etc.) so that some estimators can be classifiers, some can be regressors, and regressors can take input from classifier predict_proba results, etc.
Will this implementation break the existing one? Or, in other words, is it possible to extend the current version to support multiple datasets? If you know some tips for developing this, then kindly share them with me and I will also try to implement it (I'm kind of new to GitHub in terms of contributing code, so if you can give me a brief pointer to the file/packages I would need to look into, I'll figure out the rest).
Thank you so much for your hard work! This library is a God-send for all machine learning developers, newbies, and Kagglers out there!
The text was updated successfully, but these errors were encountered: