You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment the features are standardized before the evaluation loops (mean removal and dividing by variance) with the following: data_features = (data_features - data_features.mean()) / data_features.std() in master.py
And they are also normalized (mean removal and dividing by l2-norm) in each cross-validation fold with the following: model = Ridge(normalize=True) in modeller.py
This is not optimal because:
Normalization cancels out Standardization in the Ridge regression
All data transformations should be done independantly in the cross-validation folds
Normalization is done for Ridge and not for the other models (this is not necessarly an issue)
Strangely for some configs (2000 for example) removing the normalization in the Ridge Regression impacts a lot the results (R2 from 20% to 0%)!
A possibility to implement more complexed transformations in cross-validation fold is to use the Pipeline class of sklearn. For example to perform scaling (between 0 and 1) and Ridge, we would do:
However, my attempts to combine Normalization and Ridge in a piepline have led to very different results compared to using the normalize=True argument of the Ridge regression...
The text was updated successfully, but these errors were encountered:
@lorenzori I changed the standardization (dividing by std) of the features to a max normalization (dividing by the max) to fix the problem of outliers in the features. The impact on R2 in Mali was minimum but this does not solve the problem of applying a different re-scaling between the evaluation and the scoring set.
At the moment the features are standardized before the evaluation loops (mean removal and dividing by variance) with the following:
data_features = (data_features - data_features.mean()) / data_features.std()
in master.pyAnd they are also normalized (mean removal and dividing by l2-norm) in each cross-validation fold with the following:
model = Ridge(normalize=True)
in modeller.pyThis is not optimal because:
Strangely for some configs (2000 for example) removing the normalization in the Ridge Regression impacts a lot the results (R2 from 20% to 0%)!
A possibility to implement more complexed transformations in cross-validation fold is to use the Pipeline class of sklearn. For example to perform scaling (between 0 and 1) and Ridge, we would do:
However, my attempts to combine Normalization and Ridge in a piepline have led to very different results compared to using the normalize=True argument of the Ridge regression...
The text was updated successfully, but these errors were encountered: