# LSHFunctions.jl

A Julia package for locality-sensitive hashing to accelerate similarity search.

## What's LSH?

Traditionally, if you have a data point `x`

, and want to find the most similar
point(s) to `x`

in your database, you would compute the similarity between `x`

and all of the points in your database, and keep whichever points were the most
similar. For instance, this type of approach is used by the classic k-nearest
neighbors algorithm.
However, it has two major problems:

- The time to find the most similar point(s) to
`x`

is linear in the number of points in your database. This can make similarity search prohibitively expensive for even moderately large datasets. - In addition, the time complexity to compute the similarity between two datapoints is typically linear in the number of dimensions of those datapoints. If your data are high-dimensional (i.e. in the thousands to millions of dimensions), every similarity computation you perform can be fairly costly.

**Locality-sensitive hashing** (LSH) is a technique for accelerating these kinds
of similarity searches. Instead of measuring how similar your query point is to
every point in your database, you calculate a few hashes of the query point and
only compare it against those points with which it experiences a hash collision.
Locality-sensitive hash functions are randomly generated, with the fundamental
property that as the similarity between `x`

and `y`

increases, the probability
of a hash collision between `x`

and `y`

also increases.

## Installation

You can install LSHFunctions.jl from the Julia REPL with

```
pkg> add LSHFunctions
```

## Supported similarity functions

So far, there are hash functions for the similarity functions:

- Cosine similarity (
`SimHash`

) - Jaccard similarity (
`MinHash`

) - L1 (Manhattan / "taxicab") distance:
`L1Hash`

- L2 (Euclidean) distance:
`L2Hash`

- Inner product
`SignALSH`

(recommended)`MIPSHash`

- Function-space hashes (supports L1, L2, and cosine similarity)
`MonteCarloHash`

`ChebHash`

This package still needs a lot of work, including improvement to the documentation and API.

## Examples

The easiest way to start constructing new hash functions is by calling
`LSHFunction`

with the following syntax:

```
hashfn = LSHFunction(similarity function,
number of hash functions to generate;
[LSH family-specific keyword arguments])
```

For example, the following snippet generates 10 locality-sensitive hash
functions (bundled together into a single `SimHash`

) for cosine similarity:

```
julia> using LSHFunctions;
julia> hashfn = LSHFunction(cossim, 10);
julia> n_hashes(hashfn)
10
julia> similarity(hashfn)
cossim
```

You can hash inputs by calling `hashfn()`

:

```
julia> x = randn(128);
julia> x_hashes = hashfn(x);
```

For more details, check out the LSHFunctions.jl documentation.