A Julia package for locality-sensitive hashing to accelerate similarity search.
Traditionally, if you have a data point x
, and want to find the most similar
point(s) to x
in your database, you would compute the similarity between x
and all of the points in your database, and keep whichever points were the most
similar. For instance, this type of approach is used by the classic k-nearest
neighbors algorithm.
However, it has two major problems:
- The time to find the most similar point(s) to
x
is linear in the number of points in your database. This can make similarity search prohibitively expensive for even moderately large datasets. - In addition, the time complexity to compute the similarity between two datapoints is typically linear in the number of dimensions of those datapoints. If your data are high-dimensional (i.e. in the thousands to millions of dimensions), every similarity computation you perform can be fairly costly.
Locality-sensitive hashing (LSH) is a technique for accelerating these kinds
of similarity searches. Instead of measuring how similar your query point is to
every point in your database, you calculate a few hashes of the query point and
only compare it against those points with which it experiences a hash collision.
Locality-sensitive hash functions are randomly generated, with the fundamental
property that as the similarity between x
and y
increases, the probability
of a hash collision between x
and y
also increases.
You can install LSHFunctions.jl from the Julia REPL with
pkg> add LSHFunctions
So far, there are hash functions for the similarity functions:
- Cosine similarity (
SimHash
) - Jaccard similarity (
MinHash
) - L1 (Manhattan / "taxicab") distance:
L1Hash
- L2 (Euclidean) distance:
L2Hash
- Inner product
SignALSH
(recommended)MIPSHash
- Function-space hashes (supports L1, L2, and cosine similarity)
MonteCarloHash
ChebHash
This package still needs a lot of work, including improvement to the documentation and API.
The easiest way to start constructing new hash functions is by calling
LSHFunction
with the following syntax:
hashfn = LSHFunction(similarity function,
number of hash functions to generate;
[LSH family-specific keyword arguments])
For example, the following snippet generates 10 locality-sensitive hash
functions (bundled together into a single SimHash
) for cosine similarity:
julia> using LSHFunctions;
julia> hashfn = LSHFunction(cossim, 10);
julia> n_hashes(hashfn)
10
julia> similarity(hashfn)
cossim
You can hash inputs by calling hashfn()
:
julia> x = randn(128);
julia> x_hashes = hashfn(x);
For more details, check out the LSHFunctions.jl documentation.