Efficient minhashing in Julia
MinHash.jl offers generic, efficient MinHash sketching, and functions to efficiently compute the number of shared minhashes between sketches. This package is envisioned to be used as a dependency for other Julia packages that needs minhashing.
MinHasher{F}(s::Integer)
A MinHasher object performs the minhashing, using function F as a hash function, and storing the s smallest hashes only. F defaults to Base.hash.
MinHashSketch(::MinHasher)
Stores the information of a MinHasher (namely, the hash function, maximal number of hashes, and the hashes themselves) in a more efficient type. This type should be used to store the actual hashes.
update!(::MinHasher, it)
Iterate over it, adding each element to the minhasher.
sketch(F, it, s::Integer)
Hash all elements of it using function F, storing at most the s smallest hashes. Equivalent to:
hasher = MinHasher{F}(s)
update!(hasher, it)
return MinHashSketch(hasher)
sketch(it, s::Integer)
Same as sketch(Base.hash, it, s)
intersectionlength(a::MinHashSketch, b::MinHashSketch)
Efficiently compute the number of hashes both in a and b. Does not check that the hash functions for the two sketches are the same, result will be meaningless if they are not.
intersectionlength(::AbstractVector{MinHashSketch})
Efficiently compute a lower triangular matrix (of type Matrix{Int}) of shared hashes for all pairs in the input vector. For long vectors, this is much more efficient than calculating the distances pairwise.