Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale, Normalize or Standardize ? #14

Open
pasquierjb opened this issue Jun 5, 2018 · 1 comment
Open

Scale, Normalize or Standardize ? #14

pasquierjb opened this issue Jun 5, 2018 · 1 comment
Assignees

Comments

@pasquierjb
Copy link
Collaborator

At the moment the features are standardized before the evaluation loops (mean removal and dividing by variance) with the following:
data_features = (data_features - data_features.mean()) / data_features.std() in master.py

And they are also normalized (mean removal and dividing by l2-norm) in each cross-validation fold with the following:
model = Ridge(normalize=True) in modeller.py

This is not optimal because:

  • Normalization cancels out Standardization in the Ridge regression
  • All data transformations should be done independantly in the cross-validation folds
  • Normalization is done for Ridge and not for the other models (this is not necessarly an issue)

Strangely for some configs (2000 for example) removing the normalization in the Ridge Regression impacts a lot the results (R2 from 20% to 0%)!

A possibility to implement more complexed transformations in cross-validation fold is to use the Pipeline class of sklearn. For example to perform scaling (between 0 and 1) and Ridge, we would do:

model = Ridge()
minmax_scaler = MinMaxScaler()
pipeline = make_pipeline(minmax_scaler, model)
scores = cross_val_score(pipeline, X, y)

However, my attempts to combine Normalization and Ridge in a piepline have led to very different results compared to using the normalize=True argument of the Ridge regression...

@pasquierjb
Copy link
Collaborator Author

@lorenzori I changed the standardization (dividing by std) of the features to a max normalization (dividing by the max) to fix the problem of outliers in the features. The impact on R2 in Mali was minimum but this does not solve the problem of applying a different re-scaling between the evaluation and the scoring set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants