Skip to content
/ aide Public

Autoencoder-imputed distance-preserved embedding (AIDE)

License

Notifications You must be signed in to change notification settings

tinglabs/aide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AIDE

Autoencoder-imputed distance-preserved embedding (AIDE) is a dimension reduction algorithms for large-scale data. It combines both Multidimentional Scaling (MDS) and AutoEncoder (AE) technique, which aims to preserve the distance between imputed data generated by AE when reducing dimension.

Installation

To use AIDE, simply run the setup.py script to install:

python3 setup.py install

or just add the path of AIDE to $PYTHONPATH without installation:

git clone https://github.com/tinglabs/aide.git
cd aide
export PYTHONPATH=`pwd`:$PYTHONPATH

Running

Small data

For small data, numpy.ndarray and scipy.sparse.csr_matrix can be passed to AIDE directly, with dtype to be float32 or float64:

import scipy.sparse as sp
import numpy as np
from aide import AIDE, AIDEConfig

n_samples, n_features = 1000, 2000
dtype = np.float32	# np.float32 or np.float64
X = np.random.rand(n_samples, n_features).astype(dtype)
# X[X < 0.5] = 0.0; X = sp.csr_matrix(X)

config = AIDEConfig()
encoder = AIDE(name='test_aide', save_folder='test_aide')
embedding = encoder.fit_transform(X, config) # np.ndarray; (n_samples, config.mds_units[-1])

Large data

To avoid memory error, large data need to be preprocessed and saved as .tfrecord format first, with dtype to be float32. See details in test/test.py.

make .tfrecord file

For np.ndarray:

import numpy as np
from aide import AIDE, AIDEConfig
from aide.utils_tf import write_ary_to_tfrecord, write_ary_shards_to_tfrecord

n_samples, n_features = 100000, 2000
dtype = np.float32	# np.float32 only
X = np.random.rand(n_samples, n_features).astype(dtype)

train_data_folder = 'train_ary_shards'
pred_data_path = 'pred_ary.tfrecord'
write_ary_shards_to_tfrecord(X, tf_folder=train_data_folder, shard_num=10, shuffle=True)
write_ary_to_tfrecord(X, tf_path=pred_data_path, shuffle=False)
info_dict = {'n_samples': n_samples, 'n_features': n_features, 'issparse': False}

For scipy.sparse.csr_matrix:

import scipy.sparse as sp
import numpy as np
from aide import AIDE, AIDEConfig
from aide.utils_tf import write_csr_to_tfrecord, write_csr_shards_to_tfrecord

n_samples, n_features = 100000, 2000
dtype = np.float32	# np.float32 only
X = np.random.rand(n_samples, n_features).astype(dtype)
X[X < 0.5] = 0.0; X = sp.csr_matrix(X)

train_data_folder = 'train_csr_shards'
pred_data_path = 'pred_csr.tfrecord'
write_csr_shards_to_tfrecord(X, tf_folder=train_data_folder, shard_num=10, shuffle=True)
write_csr_to_tfrecord(X, tf_path=pred_data_path, shuffle=False)
info_dict = {'n_samples': n_samples, 'n_features': n_features, 'issparse': True}

run AIDE with tfrecord

X = (train_data_folder, pred_data_path), info_dict
config = AIDEConfig()
encoder = AIDE(name='test_aide', save_folder='test_aide')
embedding = encoder.fit_transform(X, config) # np.ndarray; (n_samples, config.mds_units[-1])

About

Autoencoder-imputed distance-preserved embedding (AIDE)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages