TargetMeanDiscretiser: sorts variables in bins and replaces bins by target mean value #419

Morgan-Sell · 2022-04-16T01:14:00Z

Closes #394.

The transformer accepts a dictionary that defines how numeric variables will be discretized/organized into bins. The transformer calculates and returns the average for the respective bins.

Morgan-Sell · 2022-04-16T17:35:04Z

@solegalli,

I was initially thinking that this class would be similar to ArbitraryDiscretiser where the user passes the desired bins. However, I'm now realizing we may want to incorporate the EqualWidthDiscretiser and EqualFrequencyDiscretiser. What are your thoughts?

Morgan-Sell · 2022-04-17T03:40:31Z

hi @solegalli,

With regards to implementation, are you envisioning the following order of operations or something of the sort?

var_A is a numeric variable in dataframe X.
fit() copies the original values of var_A.
A discretizer, e.g. EqualFrequencyDiscretiser, discretizes and assigns a bin number to the respective values. Let's call this variable var_A_disc.
MeanEncoder.fit() accepts var_A_disc as X and var_A as y.

In the case of multiple variables, the code would iterate through the numeric variables while performing the abovementioned operations.

solegalli

Hi @Morgan-Sell

thank you for getting started with this class.

The implementation looks good to me. I think you kind of answered your own questions, right? or is there anything else I can add?

Main thing, for the first implementation I would not include the arbitrary discretizer. The rest looks good.

Thank you!

solegalli · 2022-04-17T06:25:38Z

feature_engine/discretisation/target_mean.py

+ variables: Union[None, int, str, List[Union[str, int]]] = None,
+ bins: int = 5,
+ strategy: str = "equal_frequency",
+ binning_dict: Dict[Union[str, int], List[Union[str, int]]] = None,


we do not need this parameter

Are you referring to binning_dict? Yes, makes sense, given that we're removing the ArbitraryDiscretiser.

solegalli · 2022-04-17T06:25:55Z

feature_engine/discretisation/target_mean.py

+ binning_dict: Dict[Union[str, int], List[Union[str, int]]] = None,
+ errors: str = "ignore",
+ ) -> None:
+ # TODO: do we include ArbitraryDiscretiser?


I would prefer to just use the equal bin or equal frequency only.

I'm assuming equal bin is analogous to equal width.

Morgan-Sell · 2022-04-20T00:11:53Z

hi @solegalli,

I'm developing the tests for TargetMeanDiscretiser(). I see that the team has made substantial edits to the unit tests. I'm using the EqualFrequencyDiscretiser and EqualWidthDiscretiser tests as guides.

Both classes have the following test:

X_t = [x for x in range(0, 10)]
assert all(x for x in X["var"].unique() if x not in X_t)

This seems to check whether the transformed variable X["var"] possesses an integer in the range of 0 to 9. What's the test's rationale?

Gracias!

Morgan-Sell · 2022-04-21T04:47:16Z

@solegalli,

I'm creating the TargetMeanDiscretiser unit tests. I mentioned in an earlier post that I was using the EqualWidthDiscretiser and EqualFrequencyDiscretiser unit tests as guides. However, I think those unit tests do not reflect our most-recent practices, e.g. @pytest.mark.parameterize. At the same time, I see that the team has made significant edits to the tests directory. Which unit tests should I use as a guide?

Gracias!

feature_engine/discretisation/target_mean.py

tests/test_discretisation/test_check_estimator_discretisers.py

solegalli · 2022-05-07T07:11:24Z

feature_engine/discretisation/target_mean.py

+ binning_dict: Dict[Union[str, int], List[Union[str, int]]] = None,
+ errors: str = "ignore",
+ ) -> None:
+ # TODO: do we include ArbitraryDiscretiser?


solegalli · 2022-05-07T07:12:36Z

which line of code are you referring to?

solegalli

Hi @Morgan-Sell

Sorry, late response. I added a few comments here and there, minor. We do have the target mean functionality already, no need to code it again.

Basically, we need a pipeline with the discretizer returning the bins as object, followed by the TargetEncoder.

The discretisers tests in general are a bit poor, I agree with you. This is some legacy code. They are some of our first transformers.

Having said this, most tests would be done out-of-the box already by adding the new transformer to the test_check_estimators.py which you have done already.

So for the new tests, I would just ensure that the transformer outputs the expected result.

solegalli · 2022-05-07T07:18:03Z

@solegalli,

I'm creating the TargetMeanDiscretiser unit tests. I mentioned in an earlier post that I was using the EqualWidthDiscretiser and EqualFrequencyDiscretiser unit tests as guides. However, I think those unit tests do not reflect our most-recent practices, e.g. @pytest.mark.parameterize. At the same time, I see that the team has made significant edits to the tests directory. Which unit tests should I use as a guide?

Gracias!

I would say, have a look at the tests for the TargetMeanRegressor and TargetMeanClassifier.

Morgan-Sell · 2022-05-09T01:55:59Z

@solegalli,

Soy un boludo! I misinterpreted the instructions, which are in the "TargetMeanDiscretiser". Consequently, I created the _encode_X() and the rest of the boludese (castellano cientifico).

Let me know about adding the ArbitraryDiscretiser(). I'll update the test pronto!

solegalli

Hi @Morgan-Sell

This is looking really good thank you.

The tests are failing, any idea which ones and how to fix it?

Also, do you run isort and flake8 on the new files? that would order the imports at the top and tell you which ones are unused, we need to remove to pass the code style tests.

Next steps, documentation! hurray!

We need to expand the docstrings here, and then add the transformer in docs/index and also within the api and user_guide folders, in the last one with a nice demo.

I guess you can take some inspiration from the MeanEncoder's documentation?

solegalli · 2022-05-10T06:37:59Z

feature_engine/discretisation/target_mean.py

+ ("discretiser", self._make_discretiser()),
+ ("encoder", MeanEncoder(
+ variables=self.variables_numerical_,
+ ignore_format=True)


this param should be False.

I thought ignore_format should be True to encode the integers that represent the bins created by the discretiser. Otherwise, the transformer throws an error because the variables that are being encoded are not strings.

That also works. I think it is safest to tell the discretiser to return object variables (return_object=True) and then use the default parameters of the encoder.

tests/test_discretisation/test_check_estimator_discretisers.py

tests/test_discretisation/test_target_mean_discretiser.py

Morgan-Sell · 2022-05-11T01:37:48Z

hola @solegalli,

I worked on the documentation (still need to review). I've added the class to multiple index.rst files; however, it seems that I'm still missing the table of contents, i.e., toctree. Where is it?

Also, the new testing methodology is really cool! I'm still unclear exactly how it operates though. Maybe we can discuss it sometime?

Abrazo!

solegalli · 2022-05-11T07:21:19Z

rked on the documentation (still nee

You need to add the transformer to this file: https://github.com/feature-engine/feature_engine/blob/main/docs/user_guide/discretisation/index.rst

I am off on hols since Friday, maybe we can do it on my return?

Cheers

solegalli · 2022-05-11T07:25:52Z

feature_engine/discretisation/target_mean.py

+ X, y = check_X_y(X, y)
+
+ # identify numerical variables
+ self.variables_numerical_ = _find_or_check_numerical_variables(


the attribute should be self.variables_

that should resolve the test failing

solegalli · 2022-05-11T07:26:08Z

One of the common tests is failing with this error:

E AttributeError: 'TargetMeanDiscretiser' object has no attribute 'variables_'

Morgan-Sell · 2022-05-11T23:37:44Z

Gracias, @solegalli!

Enjoy your vacay! And, yes, it would be great to review the tests when you return ;)

solegalli · 2022-07-05T13:56:48Z

@Morgan-Sell

Did I completely forget about this PR?

Looks like it is more or less good to go?

solegalli · 2022-07-05T14:00:00Z

Ok, I went through this quickly, the docstrings still need to be added.

The idea of this transformer was to perform discretization, and then replace the values by the mean of the target

Now this functionality could be achieved already by combining the equal-frequency or equal-width discretizer with the MeanEncoder in a pipeline.

So I am having second thoughts as to whether we should create this class, when it is possible to obtain the same result with the transformers that we already have.

Would you mind if we keep this PR on hold?

Apologies :/

Morgan-Sell · 2022-07-06T22:15:15Z

hi @solegalli,

I'll pause on editing the docstrings.

How did you envision this class to be different than combining the equal-frequency or equal-width discretizer with the MeanEncoder in a pipeline? I thought algorithmically this was the purpose.

I guess one way to think of this class as a tool in the feature-engine toolbox that a user may not be aware of until reading the feature-engine docs, assuming people read the documentation in depth ;)

solegalli · 2022-08-03T07:41:43Z

Hi @Morgan-Sell

There is a paper in the KDD 2009 data science competition, where the authors discretize all numerical variables, and then replace their values by the target mean. And they use this as predictions to basically select features.

For categorical variables, they replace the categories by the target mean. And they proceed to select features in the same way.

So the MeanEncoder and the TargetMeanDiscretizer aim to do the same thing: replace the categories by the target mean and sort the variables into intervals and then replace the intevals by the target mean.

This is where my thinking came from.

Most users would not be aware of this possibility, because well, they don't know all the literature and probably don't have time to catch up. So having the class ready, is helpful, in the sense that it makes the transformation "visible".

On the other hand, it is more work for us, given that the encoding could be done already with the classes that already exist. So yeah, I am in 2 minds.

What do you think?

solegalli · 2022-08-03T09:11:15Z

I made a PR with some changes here:

Morgan-Sell#11

only 2 things remain doing and then we are good to go!

solegalli · 2022-08-19T08:02:10Z

@Morgan-Sell this in reply to whether or not we should move forward. I am still in 2 minds. What's your view?

Morgan-Sell · 2022-08-19T22:25:18Z

I'm leaning towards creating the transformer.

I see Python packages as toolboxes and like to consider the novice craftsman - e.g. me - when designing the tools. Like you said, people are going to open the toolbox and be amzed by the collection of our wonderful tools. It's unlikely for a novice to make the connection between the numerical discretizer and mean encoder.

Also, the use may be unaware of sklearn's Pipeline class.

I think we've already done most of the work.

Once created, how much maintenance is required for each class? I'm assuming it's not much but could depend on the class/transformer's complexity. I guess this class will need to be updated whenever changes are made to the numerical discretizers or mean encoder.

I'm still leaning towards creating the class for the newbies as I'm one of them ;)

solegalli · 2022-08-22T09:03:41Z

Sounds good, Let's finish the IV selector, and then we come back to this one. In the meantime, we sleep on the idea a bit longer :p

solegalli reviewed Apr 17, 2022

View reviewed changes

Morgan-Sell added 8 commits April 17, 2022 17:14

initial commit

a2b0c9c

create fit()

e517953

update init()

c823f7c

expand init() and fit() functionality

20b902a

add functionality to fit()

6403cf8

create _make_discretiser()

9a0d662

create _make_pipeline

de4ae94

expand fit()

b6fac50

Morgan-Sell force-pushed the target_mean_discretiser branch from 7173bbe to b6fac50 Compare April 18, 2022 00:14

Morgan-Sell added 6 commits April 17, 2022 17:26

remove ArbitraryDiscretiser and correspdoning attributes

a2360a6

update fit()

bf2fc62

update fit()

23baacb

update transform() and _encode_X()

8250646

add TargetMeanDiscretiser to test_check_estimator_discretisers.py

0ac284c

create test_target_mean_discretiser.py includes initial test

265fd08

Morgan-Sell added 3 commits April 20, 2022 21:20

update unit tests

f576e3d

edit docstring

86cbbf5

add tests

20317ee

solegalli reviewed May 7, 2022

View reviewed changes

Morgan-Sell added 3 commits May 7, 2022 12:26

update fit()

f676127

(1) add _make_pipeline(); and (2) update fit() and transform()

c6372ba

fix style error

d843d0e

Morgan-Sell added 2 commits May 9, 2022 16:26

create unit test and fix bugs

138b201

create test_equal_width_strategy

5a229d4

solegalli reviewed May 10, 2022

View reviewed changes

Morgan-Sell added 6 commits May 10, 2022 16:43

fix errors

82f5acc

create rst file

ddd56e5

start user guide w/ demo

0a923a6

fix style error

d278203

update docs/index.rst

1a83491

update api_doc/discretisation/index.rst

8d7de98

solegalli reviewed May 11, 2022

View reviewed changes

fix errors

cddf873

solegalli added the wontfix This will not be worked on label Jul 5, 2022

solegalli removed the wontfix This will not be worked on label Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TargetMeanDiscretiser: sorts variables in bins and replaces bins by target mean value #419

TargetMeanDiscretiser: sorts variables in bins and replaces bins by target mean value #419

Morgan-Sell commented Apr 16, 2022

Morgan-Sell commented Apr 16, 2022

Morgan-Sell commented Apr 17, 2022

solegalli left a comment

solegalli Apr 17, 2022

Morgan-Sell Apr 18, 2022

solegalli Apr 17, 2022

Morgan-Sell Apr 18, 2022

solegalli May 7, 2022

Morgan-Sell commented Apr 20, 2022

Morgan-Sell commented Apr 21, 2022

solegalli May 7, 2022

solegalli commented May 7, 2022

solegalli left a comment

solegalli commented May 7, 2022

Morgan-Sell commented May 9, 2022

solegalli left a comment

solegalli May 10, 2022

Morgan-Sell May 10, 2022

solegalli May 11, 2022

Morgan-Sell commented May 11, 2022

solegalli commented May 11, 2022

solegalli May 11, 2022

solegalli commented May 11, 2022

Morgan-Sell commented May 11, 2022

solegalli commented Jul 5, 2022

solegalli commented Jul 5, 2022

Morgan-Sell commented Jul 6, 2022

solegalli commented Aug 3, 2022

solegalli commented Aug 3, 2022

solegalli commented Aug 19, 2022

Morgan-Sell commented Aug 19, 2022

solegalli commented Aug 22, 2022

TargetMeanDiscretiser: sorts variables in bins and replaces bins by target mean value #419

Are you sure you want to change the base?

TargetMeanDiscretiser: sorts variables in bins and replaces bins by target mean value #419

Conversation

Morgan-Sell commented Apr 16, 2022

Morgan-Sell commented Apr 16, 2022

Morgan-Sell commented Apr 17, 2022

solegalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Morgan-Sell commented Apr 20, 2022

Morgan-Sell commented Apr 21, 2022

Choose a reason for hiding this comment

solegalli commented May 7, 2022

solegalli left a comment

Choose a reason for hiding this comment

solegalli commented May 7, 2022

Morgan-Sell commented May 9, 2022

solegalli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Morgan-Sell commented May 11, 2022

solegalli commented May 11, 2022

Choose a reason for hiding this comment

solegalli commented May 11, 2022

Morgan-Sell commented May 11, 2022

solegalli commented Jul 5, 2022

solegalli commented Jul 5, 2022

Morgan-Sell commented Jul 6, 2022

solegalli commented Aug 3, 2022

solegalli commented Aug 3, 2022

solegalli commented Aug 19, 2022

Morgan-Sell commented Aug 19, 2022

solegalli commented Aug 22, 2022