A Julia package for locality-sensitive hashing to accelerate similarity search.
Traditionally, if you have a data point x, and want to find the most similar
point(s) to x in your database, you would compute the similarity between x
and all of the points in your database, and keep whichever points were the most
similar. For instance, this type of approach is used by the classic k-nearest
neighbors algorithm.
However, it has two major problems:
- The time to find the most similar point(s) to
xis linear in the number of points in your database. This can make similarity search prohibitively expensive for even moderately large datasets. - In addition, the time complexity to compute the similarity between two datapoints is typically linear in the number of dimensions of those datapoints. If your data are high-dimensional (i.e. in the thousands to millions of dimensions), every similarity computation you perform can be fairly costly.
Locality-sensitive hashing (LSH) is a technique for accelerating these kinds
of similarity searches. Instead of measuring how similar your query point is to
every point in your database, you calculate a few hashes of the query point and
only compare it against those points with which it experiences a hash collision.
Locality-sensitive hash functions are randomly generated, with the fundamental
property that as the similarity between x and y increases, the probability
of a hash collision between x and y also increases.
You can install LSHFunctions.jl from the Julia REPL with
pkg> add LSHFunctions
So far, there are hash functions for the similarity functions:
- Cosine similarity (
SimHash) - Jaccard similarity (
MinHash) - L1 (Manhattan / "taxicab") distance:
L1Hash - L2 (Euclidean) distance:
L2Hash - Inner product
SignALSH(recommended)MIPSHash
- Function-space hashes (supports L1, L2, and cosine similarity)
MonteCarloHashChebHash
This package still needs a lot of work, including improvement to the documentation and API.
The easiest way to start constructing new hash functions is by calling
LSHFunction with the following syntax:
hashfn = LSHFunction(similarity function,
number of hash functions to generate;
[LSH family-specific keyword arguments])
For example, the following snippet generates 10 locality-sensitive hash
functions (bundled together into a single SimHash ) for cosine similarity:
julia> using LSHFunctions;
julia> hashfn = LSHFunction(cossim, 10);
julia> n_hashes(hashfn)
10
julia> similarity(hashfn)
cossimYou can hash inputs by calling hashfn():
julia> x = randn(128);
julia> x_hashes = hashfn(x);For more details, check out the LSHFunctions.jl documentation.