Autoencoder-imputed distance-preserved embedding (AIDE) is a dimension reduction algorithms for large-scale data. It combines both Multidimentional Scaling (MDS) and AutoEncoder (AE) technique, which aims to preserve the distance between imputed data generated by AE when reducing dimension.
To use AIDE, simply run the setup.py
script to install:
python3 setup.py install
or just add the path of AIDE to $PYTHONPATH
without installation:
git clone https://github.com/tinglabs/aide.git
cd aide
export PYTHONPATH=`pwd`:$PYTHONPATH
For small data, numpy.ndarray
and scipy.sparse.csr_matrix
can be passed to AIDE directly, with dtype to be float32
or float64
:
import scipy.sparse as sp
import numpy as np
from aide import AIDE, AIDEConfig
n_samples, n_features = 1000, 2000
dtype = np.float32 # np.float32 or np.float64
X = np.random.rand(n_samples, n_features).astype(dtype)
# X[X < 0.5] = 0.0; X = sp.csr_matrix(X)
config = AIDEConfig()
encoder = AIDE(name='test_aide', save_folder='test_aide')
embedding = encoder.fit_transform(X, config) # np.ndarray; (n_samples, config.mds_units[-1])
To avoid memory error, large data need to be preprocessed and saved as .tfrecord
format first, with dtype to be float32
. See details in test/test.py
.
For np.ndarray
:
import numpy as np
from aide import AIDE, AIDEConfig
from aide.utils_tf import write_ary_to_tfrecord, write_ary_shards_to_tfrecord
n_samples, n_features = 100000, 2000
dtype = np.float32 # np.float32 only
X = np.random.rand(n_samples, n_features).astype(dtype)
train_data_folder = 'train_ary_shards'
pred_data_path = 'pred_ary.tfrecord'
write_ary_shards_to_tfrecord(X, tf_folder=train_data_folder, shard_num=10, shuffle=True)
write_ary_to_tfrecord(X, tf_path=pred_data_path, shuffle=False)
info_dict = {'n_samples': n_samples, 'n_features': n_features, 'issparse': False}
For scipy.sparse.csr_matrix
:
import scipy.sparse as sp
import numpy as np
from aide import AIDE, AIDEConfig
from aide.utils_tf import write_csr_to_tfrecord, write_csr_shards_to_tfrecord
n_samples, n_features = 100000, 2000
dtype = np.float32 # np.float32 only
X = np.random.rand(n_samples, n_features).astype(dtype)
X[X < 0.5] = 0.0; X = sp.csr_matrix(X)
train_data_folder = 'train_csr_shards'
pred_data_path = 'pred_csr.tfrecord'
write_csr_shards_to_tfrecord(X, tf_folder=train_data_folder, shard_num=10, shuffle=True)
write_csr_to_tfrecord(X, tf_path=pred_data_path, shuffle=False)
info_dict = {'n_samples': n_samples, 'n_features': n_features, 'issparse': True}
X = (train_data_folder, pred_data_path), info_dict
config = AIDEConfig()
encoder = AIDE(name='test_aide', save_folder='test_aide')
embedding = encoder.fit_transform(X, config) # np.ndarray; (n_samples, config.mds_units[-1])