Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: Lag or windows features grouped by #727

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

Ezzaldin97
Copy link

Create a New feature for the #668 issue

  • create a feature for lag features to create lag features based on a list or variable.
  • create a feature for window features to create rolling window features based on a list or variable.
  • create a feature for expanding window features to create expanding window features based on a list or variable.
  • fix type-hint error in drop_psi_features.

Copy link

codecov bot commented Feb 25, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.33%. Comparing base (1a090ba) to head (9d999b0).
Report is 4 commits behind head on main.

❗ Current head 9d999b0 differs from pull request most recent head 09db782. Consider uploading reports for the commit 09db782 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #727      +/-   ##
==========================================
+ Coverage   98.31%   98.33%   +0.01%     
==========================================
  Files         103      103              
  Lines        3928     3965      +37     
  Branches      764      774      +10     
==========================================
+ Hits         3862     3899      +37     
  Misses         23       23              
  Partials       43       43              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

feature_engine/selection/drop_psi_features.py Show resolved Hide resolved
_check_X_matches_training_df,
check_X,
)
from feature_engine.dataframe_checks import (_check_contains_inf,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that black, isort and flack8 will automatically adjust the formatting and code style, because no problems appeared during development, and CI

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is the case...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On circleCI we have automatic checks for formatting, but it does not fix it automatically. You need to isort and black your files before pushing them for the tests to pass.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to resolve it, and push again.

if isinstance(group_by_variables, list):
if len(set(group_by_variables)) != len(group_by_variables):
raise ValueError("group_by_variables contains duplicate values")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve the code I will do the following: after the checks on group_by_variables, I will put this variable into a list if the variable is just a string. In this way from now on you know that you are working with a list of string and you don't need to do other checks (like for example the one at line 180).

You could do something like

if group_by_variables:
             if isinstance(group_by_variables, str):
                      self.group_by_variables = [group_by_variables]
             elif not (
                      isinstance(group_by_variables, list) and
                      all(isinstance(element, str) for element in group_by_variables)
             ):
                       raise ValueError(
                              "group_by_variables must be an string or a list of strings. "
                              f"Got {group_by_variables} instead."
                         )
            else:
                     # note that if you are here, then group_by_variables is a list
                     if len(set(group_by_variables)) != len(group_by_variables):
                     raise ValueError("group_by_variables contains duplicate 

Copy link
Collaborator

@solegalli solegalli Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, for consistency with the sklearn api we cannot modify the input parameters. We need to leave them as they are.

@@ -51,6 +42,9 @@ class BaseForecastTransformer(BaseEstimator, TransformerMixin, GetFeatureNamesOu

{drop_original}

group_by_variables: str, list of str, default=None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this method group_by to keep it similar to pandas method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we do allow variable names to be integers, so in theory it could also take int and list of ints.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sure I am going to rename this parameter, and also allow to pass it as integer or list of integers, and use check_variables methods to check if it exists in dataframe, thank you.

# check if passed list has duplicates.
if isinstance(group_by_variables, list):
if len(set(group_by_variables)) != len(group_by_variables):
raise ValueError("group_by_variables contains duplicate values")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like tests for these 2 error catches are missing? Haven't gotten to the end of the PR so forgive me if I am wrong.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to add some test cases for them.

@@ -165,6 +177,18 @@ def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
if self.missing_values == "raise":
self._check_na_and_inf(X)

if self.group_by_variables:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add this functionality to feature-engine? Pandas will fail if the variables are not in the dataframe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@solegalli Yes, That's True Pandas will fail because of that, I am going to remove it, and push again.

@@ -165,6 +177,18 @@ def fit(self, X: pd.DataFrame, y: Optional[pd.Series] = None):
if self.missing_values == "raise":
self._check_na_and_inf(X)

if self.group_by_variables:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that comes before this, in line 168, we are explicity forbidding duplicate values in the index, but if we allow groupby, we need to think what we do with this check

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@solegalli I think no duplicates will be in resultant dataframe, because we are creating the features for every group like the example below:

>>> import pandas as pd
    >>> from feature_engine.timeseries.forecasting import ExpandingWindowFeatures
    >>> X = pd.DataFrame(dict(date = ["2022-09-18",
    >>>                          "2022-09-19",
    >>>                          "2022-09-20",
    >>>                          "2022-09-21",
    >>>                          "2022-09-22",
    >>>                          "2022-09-18",
    >>>                          "2022-09-19",
    >>>                          "2022-09-20",
    >>>                          "2022-09-21",
    >>>                          "2022-09-22"],
    >>>                  x1 = [1,2,3,4,5, 3,5,6,8,11],
    >>>                  x2 = [6,7,8,9,10, 2,9,10,15,2],
    >>>                  x3=['a','a','a','a','a', 'b','b','b','b','b']
    >>>                ))
    >>> ewf = ExpandingWindowFeatures(group_by_variables='x3')
    >>> ewf.fit_transform(X)

the result is:

date  x1  x2 x3  x1_expanding_mean  x2_expanding_mean
    0  2022-09-18   1   6  a                NaN                NaN
    1  2022-09-19   2   7  a           1.000000                6.0
    2  2022-09-20   3   8  a           1.500000                6.5
    3  2022-09-21   4   9  a           2.000000                7.0
    4  2022-09-22   5  10  a           2.500000                7.5
    5  2022-09-18   3   2  b                NaN                NaN
    6  2022-09-19   5   9  b           3.000000                2.0
    7  2022-09-20   6  10  b           4.000000                5.5
    8  2022-09-21   8  15  b           4.666667                7.0
    9  2022-09-22  11   2  b           5.500000                9.0

returned expanding window features
"""
tmp_data = []
for _, group in grouped_df:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do we need to loop?

Are we creating a grouped df for every variable passed to group_by_variables?

And is this the desired functionality? For time series forecasting, would we not have all ts in 1 col and then we would group by one or more variables that identify the ts, but we would not create many groups?

When would we need to create many groups?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me explain what I need to do here, the reason behind adding group_by_variables to time series transformers is because of this issue #668 , when we need to create some lags, rolling window, or expanding window features based on a set of groups.
the above code loop over the set of groups to create the features for every group then concatenate them, and sort by index to return the dataframe to its original
let me explain it in the following code

X = pd.DataFrame(dict(date = ["2022-09-18",
                             "2022-09-19",
                             "2022-09-20",
                             "2022-09-21",
                             "2022-09-22",
                             "2022-09-18",
                             "2022-09-19",
                             "2022-09-20",
                             "2022-09-21",
                             "2022-09-22"],
                     x1 = [1,2,3,4,5, 3,5,6,8,11],
                     x2 = [6,7,8,9,10, 2,9,10,15,2],
                     x3=['a','a','a','a','a', 'b','b','b','b','b'],
                     x4=['c','c','c','w','w','c','c','w','w','w']
))

X_grouped = X.groupby(['x3', 'x4'])
for _, group in X_grouped:
    print(group)

the result is the dataframes of every group of ('x3', 'x4')

date  x1  x2 x3 x4
0  2022-09-18   1   6  a  c
1  2022-09-19   2   7  a  c
2  2022-09-20   3   8  a  c
         date  x1  x2 x3 x4
3  2022-09-21   4   9  a  w
4  2022-09-22   5  10  a  w
         date  x1  x2 x3 x4
5  2022-09-18   3   2  b  c
6  2022-09-19   5   9  b  c
         date  x1  x2 x3 x4
7  2022-09-20   6  10  b  w
8  2022-09-21   8  15  b  w
9  2022-09-22  11   2  b  w

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for the explanation. Pandas should apply shift and rolling and expanding to the groups out of the box, there is no need to loop, as far as I understand. See for example these resources: https://www.statology.org/pandas-lag-by-group/

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Ezzaldin97

Thank you so much for the enormous contribution!

Apologies for the delayed review. There are a few things that I don't understand regarding the logic we are trying to implement.

this transformer is designed for forecasting. In forecasting we normally have all time series in one column and then we have one or more columns that identify them with an id or a combination of characteristics for example store number and product number.

So I thought we'd group one either by one or many variables. But the loop threw me off. maybe there is something that I am not understanding?

Or maybe you see a use case for multiple grouping? in which case we'd need to allow both functionalities to coexist,

Thank you for your patience and support!

@Ezzaldin97
Copy link
Author

hi @solegalli,
I am looking forward to your feedback and any suggestions you may have to improve the pull request. Thank you in advance for your time and assistance.

Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ezzaldin97

Thank you so much for the amount of work that went into implementing this functionality. I am really sorry for the delay in taking a look at the proposed changes. I know it is frustrating when you put a lot of work towards something, and it is not acknowledged. I apologize. I had a very tight agenda in the last 2 months, which means that I couldn't really catch up with what was going on with Feature-engine.

I think we are aligned with the idea of the functionality that we need to add to these classes. My first impression is that the logic should be much simpler, because pandas can groupby, and then apply shift and rolling to the groups directly, without us having to loop over the groups.

It would be great if you could take a look and try to simplify the logic of the classes and outsource everything to pandas.

Let me know if this makes sense. Thank you so much for your effort!

@@ -475,7 +475,7 @@ def fit(self, X: pd.DataFrame, y: pd.Series = None):
threshold_cat = self.threshold

# Compute the PSI by looping over the features
self.psi_values_ = {}
self.psi_values_: Dict = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We resolved this in a different PR. Could we remove this change from here please?

@@ -51,6 +51,9 @@ class BaseForecastTransformer(BaseEstimator, TransformerMixin, GetFeatureNamesOu

{drop_original}

group_by: str, str, int, or list of strings or integers, default=None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are using pandas groupby under the hood, so the docs here should probably be identical or just a summary of what we see here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html and then refer the user to pandas groupby's documentation for more details

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated it using a summary of pandas groupby

@@ -81,6 +85,7 @@ def __init__(
self.variables = _check_variables_input_value(variables)
self.missing_values = missing_values
self.drop_original = drop_original
self.group_by = _check_variables_input_value(group_by)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we defer the functionality to pandas, then we don't need this check. We just assign and let pandas handle the rest.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, pandas will handle it, Thanks for your help 🙏

@@ -93,6 +93,9 @@ class ExpandingWindowFeatures(BaseForecastTransformer):

{drop_original}

group_by: str, str, int, or list of strings or integers, default=None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are repeating the same string over and over, instead of writing it multiple times, we'd create a single text in the _docstrings module, and import it instead. Like we do with fit_transform for example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added to _docstring, Thanks for the clarification.

@@ -139,6 +142,36 @@ class ExpandingWindowFeatures(BaseForecastTransformer):
2 2022-09-20 3 8 1.5 6.5
3 2022-09-21 4 9 2.0 7.0
4 2022-09-22 5 10 2.5 7.5
create expanding window features based on other variables.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example in the class' docstrings is just meant for the user to "copy and paste" a simple example, not a full blown demo. For that we have the user guide. Could we please keep the original example?

"""
tmp_data = []
for _, group in grouped_df:
tmp = (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to loop over each group. Pandas does that under the hood if I recall correctly. So we'd just add groupby before .expanding. Check these resources:

https://www.statology.org/pandas-lag-by-group/
https://stackoverflow.com/questions/37231844/pandas-creating-a-lagged-column-with-grouped-data

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a simple way to perform the group_by operation to calculate expanding window features using the .apply() method in pandas

@@ -117,6 +120,26 @@ class LagFeatures(BaseForecastTransformer):
2 2022-09-20 3 8 2.0 7.0 1.0 6.0
3 2022-09-21 4 9 3.0 8.0 2.0 7.0
4 2022-09-22 5 10 4.0 9.0 3.0 8.0
create lags based on other variables.
>>> import pandas as pd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please keep the original example? Demos go in the user-guide :)

lag feature or dataframe of lag features
"""
tmp_data = []
for _, group in grouped_df:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to loop over the groups to apply the lags? pandas does the lags per group automatically.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried many approaches to simplify this approach, but it is only working when using periods argument with shift() method like the line in 231
, however when using freq argument with shift() method it doesn't work, so I used loop to make it work.
kindly advice if we can simplify it.

returned expanding window features
"""
tmp_data = []
for _, group in grouped_df:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for the explanation. Pandas should apply shift and rolling and expanding to the groups out of the box, there is no need to loop, as far as I understand. See for example these resources: https://www.statology.org/pandas-lag-by-group/

@Ezzaldin97
Copy link
Author

Hi @solegalli
I have updated, and pushed the latest changes according to your last instructions, but there is still no enhancement in lag_features.py file can be done, I have checked many solution to remove the loop when working with freq argument, but no one get the expected output.
the unit test of CircleCI unexpectedly failed, I don't know why, but all tests passed on local
image
please advice to resolve this issue

@KananMahammadli
Copy link
Contributor

@solegalli @Ezzaldin97 if it is ok with you, I would like to implement lag and window feature updates. I wanted to first confirm the design of the implementation before starting. When working with multiple time series, we often have covariates in time-series order, some having future values available (such as weather features), some not; we also use these lag and window features for a target. Therefore, for each variable, one needs to apply different window sizes, and also, for each window, different aggregation methods become important. I suggest using the define method: window_size(s) pairs as a dictionary for each variable and providing them as a final dictionary.
Ex:

Python

feature_method_window_size_dict = {
   "feature1": {
             "mean": 8,
             "std": [8, 9, 10],
            },
   "feature2": {
             "sum": [2, 6],
             "max":  [3, 5, 7],
            },
}

after agreeing on the design, I can start on this feature.

@solegalli
Copy link
Collaborator

Hey @KananMahammadli

I do agree with you that we'd use different windows and different operations for different features, and allowing the split with the dictionary will be handy for the most advanced user. For less experienced users, the more the class can resolve by itself the better, and in this sense, applying the same window and same functions to all features is sort of the simplest.

This PR is about allowing the transformers to create and add the lags and windows grouped by a third variable or variables that would, generally indicate some grouping, like name of the product, or land of sale, or this kind of thing.

A different PR could be to extend the functionality of the class so that it offers more granularity for the advanced user in the method that you just described. And if we do so, we need to do it in a way that does not break backward compatibility whenever possible.

I'd prefer if we could add the grouping function first and the granularity later. But this is open source, so we welcome all help, if you want to start otherwise, you are free to do it :)

@KananMahammadli
Copy link
Contributor

KananMahammadli commented Aug 30, 2024

hi @solegalli,

Sorry for not being clear enough; since this PR doesn't move forward, I thought of handling groupby in a separate PR (because finding why implementation here fails and correcting the bugs is time-consuming, modifying the current version of the feature engine to support groups is a better option). While adding the groupby, I also considered adding the flexibility of variables. I agree that it can confuse some users. Therefore I would like to get your opinion; we can do it in two ways I think:

  • separate classes for fixed and flexible variable usage
    in this setup, I will first add groupby feature to the current fixed variable method, then create a separate class for a flexible method that also supports the groupby
  • single class with enabler
    here we can add enable_flexible_variables, by default this is False, and in that case, we just use the variables argument, and it becomes the same as the current logic, but if given True, then we require mapping to be provided and check if the provided dictionary aligns with the format requirement

@solegalli
Copy link
Collaborator

Hi @KananMahammadli

Thanks for the detail.

Feel free to start a new PR for the groupby function. Please add a note to mention that it superseeds this PR so we can keep track :)

I think single class is easier to maintain and it avoids boilerplate. So I'd suggest sticking to one class, unless the code base is very different, in which case, it could be justified. For end users I think it gets confusing if we have 2 classes that do more or less the same.

If we add the new functionality in one class, it helps if we keep the addition of additional parameters to the minimum. Here for example:

imputer_dict: dict, default=None

If the user enters an imputer_dict, that superseeds the parameter arbitrary value, and we make it clear in the docstrings, so that the user only needs to change one parameter for the intended functionality.

The classes for lags and windows, as they stand today, they have the same parameters that pandas lag and rolling and shift. So I'd probably would not change that, because the users, or at least some of them, are familiar with those parameters already. So it is easy to jump from pandas to feature-engine.

I would add an extra parameter called, I don't know lag_dictionary, or, something informative (can't think of anything now), and when that parameter is not None, then we have the special functionality that you mention.

It would be easier to review, if we have 1 pr for each change. Thanks a lot for the initiative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants