The knncolle Python package implements Python bindings to the C++ library of the same name for nearest neighbor (NN) searches. Downstream packages can re-use the NN search algorithms in knncolle, either via Python or by directly calling C++ through shared pointers. This is inspired by the BiocNeighbors Bioconductor package, which does the same thing for R packages.
Install it:
pip install knncolle
And run the desired search:
# Mocking up data with 20 dimensions, 1000 observations
import numpy
y = numpy.random.rand(20, 1000)
# Building a search index with vantage point trees:
import knncolle
params = knncolle.VptreeParameters()
idx = knncolle.build_index(params, y)
# Performing the search:
res = knncolle.find_knn(idx, num_neighbors=10)
res.index
res.distance
Check out the reference documentation for details.
We can easily switch to a different algorithm by just passing a different params
object.
For example, we could use the Approximate Nearest Neighbors Oh Yeah (Annoy) algorithm:
an_params = knncolle.AnnoyParameters()
an_idx = knncolle.build_index(an_params, y)
We can also tweak the search parameters in our Parameters
object, during or after construction.
For example, with the hierarchical navigable small worlds (HNSW) algorithm:
h_params = knncolle.HnswParameters(num_links=20, distance="Manhattan")
h_params.ef_construction = 150
h_idx = knncolle.build_index(h_params, y)
Currently, we support Annoy, HNSW, vantage point trees, k-means k-nearest neighbors, and (for testing) an exhaustive brute-force search. More algorithms can be added by extending knncolle as described below without any change to end-user code.
Given a query dataset, we can find the nearest neighbors in the prebuilt search index:
q = numpy.random.rand(20, 50)
qres = knncolle.query_knn(idx, q, num_neighbors=10)
qres.index
qres.distance
We can ask find_knn()
to report variable numbers of neighbors for each observation:
variable_k = (numpy.random.rand(y.shape[1]) * 10).astype(numpy.uint32)
var_res = knncolle.find_knn(idx, num_neighbors=variable_k)
var_res.index
var_res.distance
We can find all observations within a distance threshold of each observation via find_neighbors()
.
This also supports a variable threshold for each observation as well as querying of observations in a separate dataset.
range_res = knncolle.find_neighbors(idx, threshold=10)
range_res.index
range_res.distance
The raison d'être of the knncolle Python package is to enable re-use within (pybind11-wrapped) C++ code in other Python packages. The idea is that downstream packages will link against the knncolle C++ interface so that they can re-use the search indices created by the knncolle Python package. This allows downstream packages to (i) save time by avoiding the need to re-compile all algorithms and (ii) support more algorithms in knncolle extensions. To do so:
- Add
knncolle.includes()
andassorthead.includes()
to the compiler's include path for the package. This can be done throughinclude_dirs=
of theExtension()
definition insetup.py
or by adding atarget_include_directories()
in CMake, depending on the build system. - Call
knncolle.build_index()
to construct aGenericIndex
instance. This exposes a shared pointer to the C++-allocated index via itsptr
property. - Pass
ptr
to C++ code as auintptr_t
referencing aknncolle::Prebuilt
. which can be interrogated as described in the knncolle documentation.
So, for example, the C++ code in our downstream package might look like this:
#include "knncolle_py.h"
int do_something(uintptr_t ptr) {
const auto& prebuilt = knncolle_py::cast_prebuilt(ptr)->ptr;
// Do something with the search index interface.
return 1;
}
PYBIND11_MODULE(lib_downstream, m) {
m.def("do_something", &do_something);
}
Which can then be called from Python:
from . import lib_downstream as lib
from knncolle import GenericIndex
def do_something(idx: GenericIndex):
return lib.do_something(idx.ptr)
In some scenarios, it may be more convenient to construct the search index inside C++,
e.g., if the dataset to be searched is not available before the call to the C++ function.
This can be accommodated by accepting a uintptr_t
to a knncolle::Builder
in the C++ code:
#include "knncolle_py.h"
int do_something_mk2(uintptr_t ptr) {
const auto& builder = knncolle_py::cast_builder(ptr)->ptr;
// The builder is a algorithm-specific factory that accepts a matrix and
// returns a search index for that algorithm. Presumably we construct a
// new search index inside this function and use it.
return 1;
}
PYBIND11_MODULE(lib_downstream, m) {
m.def("do_something_mk2", &do_something_mk2);
}
A pointer to the knncolle::Builder
is then be created in Python by the define_builder()
function, and then passed to C++:
from . import lib_downstream as lib
from knncolle import define_builder, Parameters
def do_something(param: Parameters):
builder, cls = define_builder(param)
return lib.do_something_mk2(builder.ptr)
Check out the included header for more definitions.
The best way to extend knncolle is to do so in C++. This involves writing subclasses of the interfaces in the knncolle library. Once this is done, it is a simple matter of writing the following Python bindings:
- Implement a
SomeNewParameters
class that inherits fromParameters
. - Implement a
SomeNewIndex
class that inherits fromGenericIndex
. This should accept a singleptr
in its constructor and have aptr
property that returns the same value. - Register a
define_builder()
method that dispatches onSomeNewParameters
. This should call into C++ and return a tuple containing aBuilder
object and theSomeNewIndex
constructor.
No new methods are required for find_knn()
, build_index()
, etc. as the default method will work automatically if a define_builder()
method is available.
This approach also allows the new method to be used in C++ code of downstream packages.
If it is not possible to implement the search algorithm in C++, we can still extend knncolle in Python. Each extension package should:
- Implement a
SomeNewParameters
class that inherits fromParameters
. - Implement a
SomeNewIndex
class that inherits fromIndex
. This can have an arbitrary structure, i.e., it does not need to have aptr
property. - Register a
build_index()
method that dispatches onSomeNewParameters
. This should return an instance ofSomeNewIndex
. - Register a method for any number of these generics:
find_knn()
,find_distance()
,find_neighbors()
,query_knn()
,query_distance()
,query_neighbors()
. These methods should dispatch onSomeNewParameters
and return the appropriate result object.
This approach will not support re-use by C++ code in other Python packages.