Efficient minhashing in Julia
MinHash.jl
offers generic, efficient MinHash sketching, and functions to efficiently compute the number of shared minhashes between sketches. This package is envisioned to be used as a dependency for other Julia packages that needs minhashing.
MinHasher{F}(s::Integer)
A MinHasher object performs the minhashing, using function F
as a hash function, and storing the s
smallest hashes only. F
defaults to Base.hash
.
MinHashSketch(::MinHasher)
Stores the information of a MinHasher
(namely, the hash function, maximal number of hashes, and the hashes themselves) in a more efficient type. This type should be used to store the actual hashes.
update!(::MinHasher, it)
Iterate over it
, adding each element to the minhasher.
sketch(F, it, s::Integer)
Hash all elements of it
using function F
, storing at most the s
smallest hashes. Equivalent to:
hasher = MinHasher{F}(s)
update!(hasher, it)
return MinHashSketch(hasher)
sketch(it, s::Integer)
Same as sketch(Base.hash, it, s)
intersectionlength(a::MinHashSketch, b::MinHashSketch)
Efficiently compute the number of hashes both in a
and b
. Does not check that the hash functions for the two sketches are the same, result will be meaningless if they are not.
intersectionlength(::AbstractVector{MinHashSketch})
Efficiently compute a lower triangular matrix (of type Matrix{Int}
) of shared hashes for all pairs in the input vector. For long vectors, this is much more efficient than calculating the distances pairwise.