Add support for multiple datasets #33

techscientist · 2017-06-09T18:50:21Z

Hi there!

This is an AWESOME library! I simply cannot express my happiness with this super helpful library enough..

Anyways, I think it would be really nice if Xcessiv supports multiple datasets, similar to its support for multiple estimators, ensembles, etc.

This way, we could define and import multiple datasets, and then configure each individual base estimator instance to take input from a user-specified dataset that they have defined. This would be important for importing heterogeneous data into a subset of the estimators, and even import different versions of the same dataset (classification, regression, etc.) so that some estimators can be classifiers, some can be regressors, and regressors can take input from classifier predict_proba results, etc.

Will this implementation break the existing one? Or, in other words, is it possible to extend the current version to support multiple datasets? If you know some tips for developing this, then kindly share them with me and I will also try to implement it (I'm kind of new to GitHub in terms of contributing code, so if you can give me a brief pointer to the file/packages I would need to look into, I'll figure out the rest).

Thank you so much for your hard work! This library is a God-send for all machine learning developers, newbies, and Kagglers out there!

reiinakano · 2017-06-09T19:29:26Z

Thanks for your kind words! I'm glad you find it helpful.

That's certainly an interesting proposal.

There's an interesting discussion in a scikit-learn PR scikit-learn/scikit-learn#8960. I'm going to specifically point out this post by @dengemann

It seems to be a good moment to stop for a second and consider another interesting and relevant practice in stacking, that is feature stacking. My goal is to ensure that the developments here will not obstruct or block support for feature stacking at an early stage. So what is this and why it interesting. In many applied contexts, one may want to learn a model on features from different domains, governed by different distributions and covariance functions, e.g., Text and images, images and signals, continuous and discrete variables etc. Sometimes it can be more advisable to map each of these input sources to different first-level learners or simply don't estimate one single joint model for these different classes of inputs.
As the variant proposed here, the feature stacking would use learning on stacked layers of predictions. The main difference is that the different first level learners are trained on different columns of X.

Could this possibly be what you are talking about when you say:

importing heterogeneous data into a subset of the estimators, and even import different versions of the same dataset (classification, regression, etc.) so that some estimators can be classifiers, some can be regressors, and regressors can take input from classifier predict_proba results

If it is, then instead of having entirely separate datasets, one way you could achieve the same result in Xcessiv is by combining all your datasets, and then setting up your base learners to be sklearn Pipelines with sklearn.preprocessing.FunctionTransformer selecting only a subset of your feature columns, as shown in http://scikit-learn.org/stable/auto_examples/preprocessing/plot_function_transformer.html#sphx-glr-auto-examples-preprocessing-plot-function-transformer-py. That way, some base learners could be trained on a subset and others on another subset.

I'm afraid that Xcessiv is very dependent on the scikit-learn (X, y) single dataset paradigm and I'm not sure how multiple sources would fit in with the rest of the codebase without a major overhaul or departure from that paradigm. I might add here that I intend to keep the Xcessiv interface as closely tied to the scikit-learn interface as possible. Those guys have put a lot of thought into it and the result of Xcessiv following that is that things just kind of mesh together perfectly.

Perhaps you could give a more specific example of your use case?

techscientist · 2017-06-10T11:56:19Z

Hi @reiinakano , thanks for the pointers!

My use case is a regression problem.

So, I was thinking of having one subset of estimators directly use the input features to predict the continuous target variable, and another subset of estimators using the input features to predict the "interval" in which the target variable would be (a multi-class classification problem), and then using predict_proba on these classifiers as inputs to another subset of regressors.

So, the problem with this currently, is that the classifier base estimator's data has a different target variable (ie. the interval id), but the regressor base estimators's data has a continuous target variable. As a result, I not only need to split the input features (which i could do by following the approach you shared above) but also input different target variables (the 'y') into the different subset of estimators, which is why I thought multiple datasets would be a nice option.

But I think having multiple datasets still goes well with scikit-learn approach, because each estimator will only train on one specific dataset. The main point, rather, is that Xcessiv allows users to import and define these multiple datasets, and then configure each instance of a base estimator with one of these specific datasets. In essence, the user defines a list of imported datasets and then wires each base estimator instance with one of these datasets.

The ensembles and stacks made from these base estimators can also be wired with a specific dataset, in order to use the stacked predictions to predict the right target variable desired by the user.

I hope I've explained it clearer than before. If not, please tell me and I will be more than willing to further clarify.

Cheers!

reiinakano · 2017-06-10T12:44:05Z

Yup, your use case is certainly clear now. It's an interesting one as well. How prevalent is this though? Is this a commonly used approach or something only done in very special cases?

My problem here would be during the exporting of a stacked ensemble. The stacked ensemble exported should be usable on its own as a single base learner. i.e. must have fit(X, y). This allows for easy deployment in production, generating predictions on unlabeled datasets, or multi-level stacking. If the ensemble is a combination of base learners from different datasets, then that requirement won't be preserved.

Anyway, I can actually see that you can still achieve your use case in the current Xcessiv implementation, albeit in a very hacky way.

First, let's assume that your problem is mainly a classification one, although you have a column containing a label suited for regression. Arrange and combine your dataset so it now looks like this:

ftr1 | ftr2 | ftr3 | regression_y | classification_y

When defining your main dataset, you use classification_y as y, and keep regression_y in your X.

Now for all your classification base learners, you use a pipeline with FunctionTransformer to exclude the regression_y column (and any other columns you want to exclude) before your regular classifier. No problem there.

Things get a bit hairy for your regressors. But remember that you have full control over what is done with your base learners' fit(X, y) and predict(X) methods. You should rewrite your fit method, and in it, you disregard the passed in y (which corresponds to the classification_y column), and instead separate the regression_y column in the passed in X. You would then fit the actual regressor on the leftover X features and regression_y label column. For the predict(X) method, you would modify the passed in X again and cut out the regression_y column, and call predict on this new X.

It's quite convoluted, but if you export your entire stacked ensemble in the end, you would be able to call fit(X, y) on a full dataset without needing to specify a different data source for each base learner. Each base learner can take care of cutting out columns on its own. You can then proceed to use it as a base learner and do multi-level stacking.

What do you think?

reiinakano · 2017-06-10T12:48:24Z

Thinking about it some more, I assume this is something mostly done for regression (because of the binning thing)? If so, then I think it's an even simpler fix. Use regression_y as your y column and generate your binned class column on the fly for classifiers i.e. in classifier fit method, convert passed in y to binned classes before actually fitting. There would be no need for a separate classification_y column (and none of the embarrassing hacks in my previous post). Plus you'd get to do neat things like different bins for different classifiers e.g. linear bins, logarithmic bins, and stacking them all together!

Remember, base learners in Xcessiv are just neat little black boxes that convert features (X) into their corresponding meta-features. They're all just transformers! You can get as creative as you want with that little black box inside. Whether they're regressors or classifiers is moot. In fact, Xcessiv doesn't really have a concept of regressor or classifier, just meta-features!

Thanks for this. It never occurred to me that it'd actually make sense to use classifiers for a regression problem. And it's something that can only be done through stacking! How powerful!

EDIT: You did point out it was a regression problem! My bad, didn't notice!

techscientist · 2017-06-10T13:06:18Z

Great! Thanks for your help! I'll give it a try and let you know how it goes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for multiple datasets #33

Add support for multiple datasets #33

techscientist commented Jun 9, 2017

reiinakano commented Jun 9, 2017 •

edited

Loading

techscientist commented Jun 10, 2017 •

edited

Loading

reiinakano commented Jun 10, 2017

reiinakano commented Jun 10, 2017 •

edited

Loading

techscientist commented Jun 10, 2017

Add support for multiple datasets #33

Add support for multiple datasets #33

Comments

techscientist commented Jun 9, 2017

reiinakano commented Jun 9, 2017 • edited Loading

techscientist commented Jun 10, 2017 • edited Loading

reiinakano commented Jun 10, 2017

reiinakano commented Jun 10, 2017 • edited Loading

techscientist commented Jun 10, 2017

reiinakano commented Jun 9, 2017 •

edited

Loading

techscientist commented Jun 10, 2017 •

edited

Loading

reiinakano commented Jun 10, 2017 •

edited

Loading