Releases: AnotherSamWilson/miceforest
Releases · AnotherSamWilson/miceforest
Major Update 6
Major Update 6 comes with improvements in API usability.
- native support of imputing numpy arrays is no longer supported. It made the code too complex.
- Mean match customization is now much more simple, and handled entirely through parameters instead of custom classes. The parameters
mean_match_strategy
andmean_match_candidates
are all that are needed to control mean matching. - The saving and loading of kernels was modernized to make use of getstate and setstate, without the need for a load_kernel method.
- Major improvements to testing suite.
- Plotting was moved to plotnine.
Release for Zenodo DOI
This release will generate a DOI for this project.
Stable v5.6.0
This release implemented some major changes:
- Implemented
MeanMatchScheme
- Implemented mean matching on shap values
- Tighter controls and warnings around categorical levels
- Included type hints for major functions.
This release is marked as stable because the API will not see significant changes in the future.
v5.0.0
- New main classes (
ImputationKernel
,ImputedData
) replace (ImputationKernel
,ImputationKernel
,ImputedDataSet
,MultipleImputedDataSet
). - Data can now be referenced and imputed in place. This saves a lot of memory allocation and is much faster.
- Data can now be completed in place. This allows for only a single copy of the dataset to be in memory at any given time, even if performing multiple imputation.
- mean_match_subset parameter has been replaced with data_subset. This subsets the data used to build the model as well as the candidates.
- More performance improvements around when data is copied and where it is stored.
- Raw data is now stored as the original. Can handle pandas DataFrame and numpy ndarray.
Major update
This release improved a number of areas:
- Huge performance improvements, especially if categorical variables were being imputed. These come from not predicting candidate data if we don't need to, using a much faster neighbors search, using numpy internally for indexing instead of pandas, and others.
- Ability to tune parameters of models, and use best parameters for mice.
- Improvements to code layout - got rid of ImputationSchema.
- Raw data is now stored as a numpy array to save space and improve indexing.
- Numpy arrays can be imputed, if you want to avoid pandas.
- Options of multiple build-in mean matching functions.
- Mean matching functions can handle most lightgbm objectives.
Switch to lightgbm
This is a major release, with breaking API changes:
- The random forest package is now lightgbm
- Much more lightweight (serialized kernels tend to be 5x smaller or more)
- Much faster on big datasets (for comparable parameters)
- More flexible... We can now use gbdt if we wish. lightgbm is more flexible in general.
- Added a mean_match_subset parameter. This will help greatly speed up many processes.
- mean_match_candidates now lazily accepts dicts as long as the keys are a subset of parameters in variable_schema.
- Model parameters can be specified by variable, or globally.
- Mean matching function can be overwritten if the user wishes.
Major Update
- Models from all iterations can be saved with save_models == 2.
- Kernel classes inherit from base imputed classes - allows for methods to be called on imputed datasets obtained form impute_new_data().
- Time log was added
- MultipleImputedDataset is now a collection of ImputedDataSets with methods for comparing them. Subscripting gives the desired dataset.
- Tests updated to be much more comprehensive
- Datasets can now be added and removed from a MultipleImputedDataSet/MultipleImputedKernel.
Stable Release
Automatic testing, coverage, and formatting has been implemented. Code is (reasonably) bug free.