Hashed Random Projection layer for TF2/Keras.
Hashed Random Projections (HRP), binary representations, encoding/decoding for storage (notebook)
The random projection or hyperplane is randomly initialized.
The initial state of the PRNG (random_state
) is required (Default: 42) to ensure reproducibility.
import keras_hrp as khrp
import tensorflow as tf
BATCH_SIZE = 32
NUM_FEATURES = 64
OUTPUT_SIZE = 1024
# demo inputs
inputs = tf.random.normal(shape=(BATCH_SIZE, NUM_FEATURES))
# instantiate layer
layer = khrp.HashedRandomProjection(
output_size=OUTPUT_SIZE,
random_state=42 # Default: 42
)
# run it
outputs = layer(inputs)
assert outputs.shape == (BATCH_SIZE, OUTPUT_SIZE)
import keras_hrp as khrp
import tensorflow as tf
import numpy as np
BATCH_SIZE = 32
NUM_FEATURES = 64
OUTPUT_SIZE = 1024
# demo inputs
inputs = tf.random.normal(shape=(BATCH_SIZE, NUM_FEATURES))
# create hyperplane as numpy array
myhyperplane = np.random.randn(NUM_FEATURES, OUTPUT_SIZE)
# instantiate layer
layer = khrp.HashedRandomProjection(hyperplane=myhyperplane)
# run it
outputs = layer(inputs)
assert outputs.shape == (BATCH_SIZE, OUTPUT_SIZE)
Python stores 1-bit boolean values always as 8-bit integers or 1-byte.
Some database technologies behave in similar way, and use up 8x-times of the theoretically required storage space (e.g., Postgres boolean
uses 1-byte instead of 1-bit).
In order to save memory or storage space, chuncks of 8 boolean vector elements can be transformed into one 1-byte int8 number.
import keras_hrp as khrp
import numpy as np
# given boolean values
hashvalues = np.array([1, 0, 1, 0, 1, 1, 0, 0])
# serialize boolean to int8
serialized = khrp.bool_to_int8(hashvalues)
# deserialize int8 to boolean
deserialized = khrp.int8_to_bool(serialized)
# check
np.testing.assert_array_equal(deserialized, hashvalues)
The keras-hrp
git repo is available as PyPi package
pip install keras-hrp
pip install git+ssh://[email protected]/ulf1/keras-hrp.git
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt --no-cache-dir
pip install -r requirements-dev.txt --no-cache-dir
pip install -r requirements-demo.txt --no-cache-dir
(If your git repo is stored in a folder with whitespaces, then don't use the subfolder .venv
. Use an absolute path without whitespaces.)
- Jupyter for the examples:
jupyter lab
- Check syntax:
flake8 --ignore=F401 --exclude=$(grep -v '^#' .gitignore | xargs | sed -e 's/ /,/g')
- Run Unit Tests:
PYTHONPATH=. pytest
Publish
# pandoc README.md --from markdown --to rst -s -o README.rst
python setup.py sdist
twine upload -r pypi dist/*
find . -type f -name "*.pyc" | xargs rm
find . -type d -name "__pycache__" | xargs rm -r
rm -r .pytest_cache
rm -r .venv
Please open an issue for support.
Please contribute using Github Flow. Create a branch, add commits, and open a pull request.
The "Evidence" project was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 433249742 (GU 798/27-1; GE 1119/11-1).
- till 31.Aug.2023 (v0.1.0) the code repository was maintained within the DFG project 433249742
- since 01.Sep.2023 (v0.2.0) the code repository is maintained by @ulf1.
Please cite the arXiv Preprint when using this software for any purpose.
@misc{hamster2023rediscovering,
title={Rediscovering Hashed Random Projections for Efficient Quantization of Contextualized Sentence Embeddings},
author={Ulf A. Hamster and Ji-Ung Lee and Alexander Geyken and Iryna Gurevych},
year={2023},
eprint={2304.02481},
archivePrefix={arXiv},
primaryClass={cs.CL}
}