StableHashTraits
The aim of StableHashTraits is to make it easy to compute a stable hash of any Julia value with minimal boilerplate using trait-based dispatch; here, "stable" means the value will not change across Julia versions (or between Julia sessions).
For example:
struct MyType
a
b
end
StableHashTraits.hash_method(::MyType) = UseProperties()
stable_hash(MyType(1,2)) == stable_hash((a=1, b=2)) # true
stable_hash
instead of Base.hash
?
Why use This package can be useful when:
- you want to be ensure the hash value will not change when you update Julia or start a new session, OR
- you want to compute a hash for an object that does not have
hash
defined.
This is useful for content-addressed caching, in which e.g. some function of a value is stored at a location determined by a hash. Given the value, one can recompute the hash to determine where to look to see if the function evaluation on that value has already been cached.
It isn't intended for secure hashing.
Details
There is one exported method: stable_hash
. You call this on any number of
objects and the returned value is a hash of those objects (the argument order
matters).
You can customize its behavior for particular types by implementing the trait
StableHashTraits.hash_method
. Any method of hash_method
should simply return one of the following values.
UseWrite()
: writes the object to a binary format usingStableHashTraits.write(io, x)
and takes a hash of that (this is the default behavior).StableHashTraits.write(io, x)
falls back toBase.write(io, x)
if no specialized methods are defined for x.UseIterate()
: assumes the object is iterable and finds a hash of all elementsUseProperties()
: assumes a struct of some type and usespropertynames
andgetproperty
to compute a hash of all fields. You can further customize its behavior by passing the symbol:ByOrder
(to hash properties in the order they are listed bypropertynames
), which is the default, or:ByName
(sorting properties by their name before hashing).UseTable()
: assumes the object is aTables.istable
and usesTables.columns
andTables.columnnames
to compute a hash of each columns content and name, alaUseProperties
. This method should rarely need to be specified by the user, as the fallback method forAny
should normally handle this case.UseQualifiedName()
: hash the stringparentmodule(T).nameof(T)
whereT
is the type of the object. Throws an error if the name includes#
(e.g. an anonymous function). If you wish to include this qualified name and another method, pass one of the other methods as an arugment (e.g.UseQualifiedName(UseProperties())
). This can be used to include the type as part of the hash. Do you want a named tuple with the same properties as your custom struct to hash to the same value? If you don't, then useUseQualifiedName
.UseSize(method)
: hash the result of callingsize
on the object and usemethod
to hash the contents of the value (e.g.UseIterate
).
Your hash will be stable if the output for the given method remains the same: e.g. if
write
is the same for an object that uses UseWrite
, its hash will be the same; if the
properties are the same for UseProperties
, the hash will be the same; etc...
hash_method
Implemented methods of Any
: eitherUseWrite()
ORUseTable()
for any objectx
whereTables.istable(x)
is true
Function
:UseQualifiedName()
NamedTuples
:UseProperties()
AbstractVector
,Tuple
,Pair
:UseIterate()
AbstractArray
:UseSize(UseIterate())
Missing
,Nothing
:UseQualifiedNamed()
VersionNumber
:UseProperties()
UUID
:UseProperties()
Dates.AbstractTime
:UseProperties()
For more complicated scenarios where impleneting hash_method
will not suffice, refer to
the documentaiton of transform
and write
. For instance Set
objects are supported using
transform
.
Breaking changes
In 0.3:
To prevent reshaped arrays from having the same hash (stable_hash([1 2; 3 4]) == stable_hash(vec([1 2; 3 4]))
) the hashes for all arrays with more than 1 dimension have
changed.
In 0.2:
To support hasing of all tables (Tables.istable(x) == true
), hashes have changed for such
objects when:
- calling
stable_hash(x)
did not previously error x
is not aDataFrame
(these previosuly errored)x
is not aNamedTuple
of tables columns (these have the same hash as before)x
is not anAbstractArray
ofNamedTuple
rows (these have the same hash as before)x
can be succefully written to an IO buffer viaBase.write
orStableHashTraits.write
(otherwise it previosuly errored)x
has no specializedstable_hash
method defined for it (otherwise the hash will be the same)
Any such table now uses the method UseTable
, rather than UseWrite
, and so would have the
same hash as a DataFrame
or NamedTuple
with the same column contents instead of its
previous hash value. For example if you had a custom table type MyCustomTable
for which
you only defined a StableHashTraits.write
method and no hash_method
, its hash will be
changed unless you now define hash_method(::MyCustomTable) = UseWrite()
.
Avoiding Type Piracy
It can be very tempting to define hash_method
for types that were defined by another
package or from Base. This is type piracy, and can easily lead to two different packags
defining the same method: in this case, the method which gets used depends on the order of
using
statements... yuck.
To avoid this problem, it is possible to define a version of any method you specialize (e.g.
hash_method
, transform
and/or write
) with one additional argument. This final argument
can be anything you want, so long as it is a type you have defined. For example:
using DataFrames
struct MyContext end
StableHashTraits.hash_method(::DataFrame, ::MyContext) = UseProperties(:ByOrder)
stable_hash(DataFrames(a=1:2, b=1:2); context=MyContext())
By default the context is StableHashTraits.GlobalContext
and fall back methods are defined
that pass through to the methods without a context argument (e.g. hash_method(x, context) = hash_method(x)
)
In this way, you only need to define methods for the types that have non-default behavior for your context; furthermore, those who have no need of a particular context objects can simply define methods without it.
Hashing gotcha's
Here-in is a list of hash collisions that have been deemed to be acceptable in practice:
stable_hash(sin) == stable_hash("Base.sin")
stable_hash([1,2,3]) == stable_hash((1,2,3))
stable_hash(DataFrame(x=1:10)) == stable_hash((; x=collect(1:10)))
stable_hash(1:10) == stable_hash((;start=1, stop=10))