MinHash.jl

Fast and generic implementation of the minhash algorithm
Author jakobnissen
Popularity
14 Stars
Updated Last
10 Months Ago
Started In
March 2020

MinHash

Dev CI Codecov

Efficient minhashing in Julia

MinHash.jl offers generic, efficient MinHash sketching, and functions to efficiently compute the number of shared minhashes between sketches. This package is envisioned to be used as a dependency for other Julia packages that needs minhashing.

Interface

Types

MinHasher{F}(s::Integer)

A MinHasher object performs the minhashing, using function F as a hash function, and storing the s smallest hashes only. F defaults to Base.hash.

MinHashSketch(::MinHasher)

Stores the information of a MinHasher (namely, the hash function, maximal number of hashes, and the hashes themselves) in a more efficient type. This type should be used to store the actual hashes.

Methods

update!(::MinHasher, it)

Iterate over it, adding each element to the minhasher.

sketch(F, it, s::Integer) Hash all elements of it using function F, storing at most the s smallest hashes. Equivalent to:

hasher = MinHasher{F}(s)
update!(hasher, it)
return MinHashSketch(hasher)

sketch(it, s::Integer)

Same as sketch(Base.hash, it, s)

intersectionlength(a::MinHashSketch, b::MinHashSketch)

Efficiently compute the number of hashes both in a and b. Does not check that the hash functions for the two sketches are the same, result will be meaningless if they are not.

intersectionlength(::AbstractVector{MinHashSketch})

Efficiently compute a lower triangular matrix (of type Matrix{Int}) of shared hashes for all pairs in the input vector. For long vectors, this is much more efficient than calculating the distances pairwise.