Skip to content

escherba/python-cityhash

Repository files navigation

CityHash/FarmHash

Python wrapper for FarmHash and CityHash, a family of fast non-cryptographic hash functions.

Build Status PyPI Version Conda-Forge Version Downloads License Supported Python Versions

Getting Started

To install from PyPI:

pip install cityhash

To install in a Conda environment:

conda install -c conda-forge python-cityhash

The package exposes Python APIs for CityHash and FarmHash under cityhash and farmhash namespaces, respectively. Each provides 32-, 64- and 128-bit implementations.

Usage Examples

Stateless hashing

Usage example for FarmHash:

>>> from farmhash import FarmHash32, FarmHash64, FarmHash128
>>> FarmHash32("abc")
1961358185
>>> FarmHash64("abc")
2640714258260161385
>>> FarmHash128("abc")
76434233956484675513733017140465933893

Hardware-independent fingerprints

Fingerprints are seedless hashes that are guaranteed to be hardware- and platform-independent. This can be useful for networking applications which require persisting hashed values.

>>> from farmhash import Fingerprint128
>>> Fingerprint128("abc")
76434233956484675513733017140465933893

Incremental hashing

CityHash and FarmHash do not support incremental hashing and thus are not ideal for hashing of character streams. If you require incremental hashing, consider another hashing library, such as MetroHash or xxHash.

Fast hashing of NumPy arrays

The Buffer Protocol allows Python objects to expose their data as raw byte arrays for fast access without having to copy to a separate location in memory. NumPy is one well-known library that extensively uses this protocol.

All hashing functions in this package will read byte arrays from objects that expose them via the buffer protocol. Here is an example showing hashing of a four-dimensional NumPy array:

>>> import numpy as np
>>> from farmhash import FarmHash64
>>> arr = np.zeros((256, 256, 4))
>>> FarmHash64(arr)
1550282412043536862

The NumPy arrays need to be contiguous for this to work. To convert a non-contiguous array, use NumPy's ascontiguousarray() function.

SSE4.2 support

For x86-64 platforms, the PyPI repository for this package includes wheels compiled with SSE4.2 support. The 32- and 64-bit (but not the 128-bit) variants of FarmHash significantly benefit from SSE4.2 instructions.

The vanilla CityHash functions (under cityhash module) do not take advantage of SSE4.2. Instead, one can use the cityhashcrc module provided with this package which exposes 128- and 256-bit CRC functions that do harness SSE4.2. These functions are very fast, and beat FarmHash128 on speed (FarmHash does not include a 256-bit function). Since FarmHash is the intended successor of CityHash, I would be careful before using the CityHash-CRC functions, however, and would verify whether they provide sufficient randomness for your intended application.

Development

Local workflow

For those wanting to contribute, here is a quick start using Make commands:

git clone https://github.com/escherba/python-cityhash.git
cd python-cityhash
make env           # create a virtual environment
make test          # run Python tests
make cpp-test      # run C++ tests
make shell         # enter IPython shell

To find out which Make targets are available, enter:

make help

Distribution

The package wheels are built using cibuildwheel and are distributed to PyPI using GitHub actions. The wheels contain compiled binaries and are available for the following platforms: windows-amd64, ubuntu-x86, linux-x86_64, linux-aarch64, and macosx-x86_64.

See Also

For other fast non-cryptographic hash functions available as Python extensions, see MetroHash, MurmurHash, and xxHash.

Authors

The original CityHash Python bindings are due to Alexander [Amper] Marshalov. They were rewritten in Cython by Eugene Scherba, who also added the FarmHash bindings. The CityHash and FarmHash algorithms and their C++ implementation are by Google.

License

This software is licensed under the MIT License. See the included LICENSE file for details.