Feature selection for hard voting classifier and NN sparse weight initialization.
I am naming this software package in memory of my late nephew Max Joshua Hamster (* 2005 to † June 18, 2022).
Load toy data set and convert features to binary.
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import scale
X = scale(load_breast_cancer().data, axis=0) > 0 # convert to binary features
y = load_breast_cancer().target
Select binary features. Each row in the results
list contains the n_select
column indices of X
, the notice if the binary features were negated, and the sum of absolute MCC correlation coeffcients between the selected features.
import maxjoshua as mh
idx, neg, rho, results = mh.binsel(
X, y, preselect=0.8, oob_score=True, subsample=0.5,
n_select=5, unique=True, n_draws=100, random_state=42)
Algorithm.
The task is to select e.g. n_select
features from a pool of many features.
These features might be the prediction of binary classifiers.
The selected features are then combined into one hard-voting classifier.
A voting classifier should have the following properties
- each voter (a binary feature) should be highly correlated to the target variable
- the selected features should be uncorrelated.
The algorithm works as follows
- Generate multiple correlation matrices by bootstrapping. This includes
corr(X_i, X_j)
as well ascorr(Y, X_i)
computation. Also store the oob samples for evaluation. - For each correlation matrix do ...
a. Preselect the
i*
with the highestabs(corr(Y, X_i))
estimates (e.g. pick then_pre=?
highest absolute correlations) b. Slice a correlation matrixcorr(X_i*, X_j*)
and find the least correlated combination ofn_select
features. (seekorr.mincorr
) c. Compute the out-of-bag (OOB) performance (see step 1) of the hard-voter with the selectedn_select=?
features - Select the feature combination with the best OOB performance as final model.
Load toy dataset.
from sklearn.preprocessing import scale
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X = scale(housing["data"], axis=0)
y = scale(housing["target"])
Select real-numbered features. Each row in the results
list contains the n_select
column indices of X
, the ridge regression coefficents beta
and the RMSE loss
.
Warning! Please note that the features X
and the target y
must be scaled because mh.fltsel
uses an L2-penalty on beta
coefficients, and doesn't used an intercept term to shift y
.
import maxjoshua as mh
from sklearn.preprocessing import scale
idx, beta, loss, results = mh.fltsel(
scale(X), scale(y), preselect=0.8, oob_score=True, subsample=0.5,
n_select=5, unique=True, n_draws=100, random_state=42, l2=0.01)
The idea is to run mh.fltsel
to generate an ensemble of linear models, and combine them in a sparse linear neural network layer, i.e., the number of output neurons is the ensemble size.
In case of small datasets, the sparse NN layer is non-trainable because because each submodel was already estimated and selected with two-way data splits in mh.fltsel
(see oob_scores
and subsample
).
The sparse NN layers basically produces submodel predictions for meta model in the next layer, i.e., a simple dense linear layer.
The inputs of the sparse NN layer must be normalized for which a layer normalization layers is trained.
import maxjoshua as mh
import tensorflow as tf
import sklearn.preprocessing
# create toy dataset
import sklearn.datasets
X, y = sklearn.datasets.make_regression(
n_samples=1000, n_features=100, n_informative=20, n_targets=3)
# feature selection
# - always scale the inputs and targets -
indices, values, num_in, num_out = mh.pretrain_submodels(
sklearn.preprocessing.scale(X),
sklearn.preprocessing.scale(y),
num_out=64, n_select=3)
# specify the model
model = tf.keras.models.Sequential([
# sub-models
mh.SparseLayerAsEnsemble(
num_in=num_in,
num_out=num_out,
sp_indices=indices,
sp_values=values,
sp_trainable=False,
norm_trainable=True,
),
# meta model
tf.keras.layers.Dense(
units=3, use_bias=False,
# kernel_constraint=tf.keras.constraints.NonNeg()
),
# scale up
mh.InverseTransformer(
units=3,
init_bias=y.mean(),
init_scale=y.std()
)
])
model.compile(
optimizer=tf.keras.optimizers.Adam(
learning_rate=3e-4, beta_1=.9, beta_2=.999, epsilon=1e-7, amsgrad=True),
loss='mean_squared_error'
)
# train
history = model.fit(X, y, epochs=3)
The maxjoshua
git repo is available as PyPi package
pip install maxjoshua
python3.7 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -r requirements-dev.txt
pip install -r requirements-demo.txt
(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv
. Use an absolute path without whitespaces.)
- Jupyter for the examples:
jupyter lab
- Check syntax:
flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
- Run Unit Tests:
pytest
Publish
pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist
twine upload -r pypi dist/*
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .venv
Please open an issue for support.
Please contribute using Github Flow. Create a branch, add commits, and open a pull request.