MinHash
Efficient minhashing in Julia
MinHash.jl
offers generic, efficient MinHash sketching, and functions to efficiently compute the number of shared minhashes between sketches. This package is envisioned to be used as a dependency for other Julia packages that needs minhashing.
Interface
Types
MinHasher{F}(s::Integer)
A MinHasher object performs the minhashing, using function F
as a hash function, and storing the s
smallest hashes only. F
defaults to Base.hash
.
MinHashSketch(::MinHasher)
Stores the information of a MinHasher
(namely, the hash function, maximal number of hashes, and the hashes themselves) in a more efficient type. This type should be used to store the actual hashes.
Methods
update!(::MinHasher, it)
Iterate over it
, adding each element to the minhasher.
minhash(F, it, s::Integer)
Hash all elements of it
using function F
, storing at most the s
smallest hashes. Equivalent to:
hasher = MinHasher{F}(s)
update!(hasher, it)
return MinHashSketch(hasher)
minhash(it, s::Integer)
Same as minhash(Base.hash, it, s)
intersectionlength(a::MinHashSketch, b::MinHashSketch)
Efficiently compute the number of hashes both in a
and b
. Does not check that the hash functions for the two sketches are the same, result will be meaningless if they are not.
intersectionlength(::AbstractVector{MinHashSketch})
Efficiently compute a lower triangular matrix (of type Matrix{Int}
) of shared hashes for all pairs in the input vector. For long vectors, this is much more efficient than calculating the distances pairwise.