The aim of StableHashTraits is to make it easy to compute a stable hash of any Julia value with minimal boilerplate using trait-based dispatch; here, "stable" means the value will not change across Julia versions (or between Julia sessions).
For example:
using StableHashTraits
using StableHashTraits: Transformer
using Dates
struct MyType
data::Vector{UInt8}
metadata::Dict{Symbol, Any}
end
# ignore `metadata`, `data` will be hashed using fallbacks for `AbstractArray` type
StableHashTraits.transformer(::Type{<:MyType}) = Transformer(pick_fields(:data))
# NOTE: `pick_fields` is a helper function implemented by `StableHashTraits`
# it creates a named tuple with the given object fields; in the above code it is used
# in its curried form e.g. `pick_fields(:data)` is the same as `x -> pick_fields(x, :data)`
a = MyType(read("myfile.txt"), Dict{Symbol, Any}(:read => Dates.now()))
b = MyType(read("myfile.txt"), Dict{Symbol, Any}(:read => Dates.now()))
stable_hash(a; version=4) == stable_hash(b; version=4) # true
Users can define a method of transformer
to customize how an object is hashed. It should dispatch on the type to be transformed, and return a function wrapped in Transformer
. During hashing, this function is called and its result is the value that is actually hashed.
StableHashTraits aims to guarantee a stable hash so long as you only upgrade to non-breaking versions (e.g. StableHashTraits = "1"
in [compat]
of Project.toml
); any changes in an object's hash in this case would be considered a bug.
Warning
Hash versions 4 constitutes a substantial redesign of StableHashTraits so as to avoid reliance on the internals of Base julia and packages. Hash versions 1-3 are deprecated and will be removed in a soon-to-be released StableHashTraits 2.0. Hash version 4 will remain unchanged in this 2.0 release. For backwards compatibility, hash version 1 is currently the default version if you don't specify a version. You can read the documentation for hash version 1 here and hash version 2-3 here.
StableHashTraits is designed to be used in cases where there is an object you wish to serialize in a content-addressed cache. How and when objects pass the same input to a hashing algorithm is meant to be predictable and well defined, so that the user can reliably define methods of transformer
to modify this behavior.
This covers the behavior when using the latest hash version (4). You can read the documentation for hash version 1 here and hash version 2-3 here.
When you call stable_hash(x; version=4)
, StableHashTraits hashes both the value x
and its type T
. Rather than hashing the type T
itself directly, in most cases instead SructTypes.StructType(T)
is hashed, using
SructTypes.jl. For example, since the "StructType" of Float64 and Float32 are both NumberType
, when hashing Float64 and Float32 values, value and NumberType
are hashed. This provides a simple trait-based system that doesn't need to rely on internal details. See below for more details.
You can customize how the value is hashed using StableHashTraits.transformer
,
and how its type is hashed using StableHashTraits.transform_type
.
If you need to customize either of these functions for a type that you don't own, you can use a @context to avoid type piracy.
StructType.DataType
denotes a type that is some kind of "record"; i.e. its content is defined by the fields (getfield(f) for f in fieldnames(T)
) of the type. Since it is the default, it is used to hash most types.
To hash the value, each field value (getfield(f) for f in fieldnames(T)
) is hashed.
If StructType(T) <: StructTypes.UnorderedStruct
(the default), the field values are first sorted by the lexographic order of the field names.
The type of a data type is hashed using string(nameof(T))
, the fieldnames(T)
, (sorting them for UnorderedStruct
), along with a hash of the type of each element of fieldtypes(T)
according to their StructType
.
No type parameters are hashed by default. To hash these you need to specialize on StableHashTraits.transform_type
for your struct. Note that because fieldtypes(T)
is hashed, you don't need to do this unless your type parameters are not used in the specification of your field types.
ArrayType
is used when hashing a sequence of values.
To hash the value, each element of an array type is hashed using iterate
. If the object isa AbstractArray
, the size(x)
of the object is also hashed.
If StableHashTraits.is_ordered
returns false
the elements are first sort
ed according to StableHashTraits.hash_sort_by
.
To hash the type, the string "StructTypes.ArrayType"
is hashed (meaning that the kind of array won't matter to the hash value), and the type of the elype
is hashed, according to its StructType
. If the type <: AbstractArray
, the ndims(T)
is hashed.
To hash the value, each key-value pair of a dict type is hashed, as returned by StructType.keyvaluepairs(x)
.
If StableHashTraits.is_ordered
returns false
(which is the default return value) the pairs are first sort
ed according their keys using StableHashTraits.hash_sort_by
.
To hash the type, the string "StructTypes.DictType"
is hashed (meaning that the kind of dictionary won't matter), and the type of the keytype
and valtype
is hashed, according to its StructType
.
AbstractRange
constitutes an exception to the rule that we use StructType
: for efficient hashing, ranges are treated as another first-class container object, separate from array types.
The value is hashed as (first(x), step(x), last(x))
.
The type is hashed as "Base.AbstractRange"
along with the type of the eltype
, according to its StructType
. Thus, the type of range doesn't matter (just that it is a range).
To hash the value, the result of Base.write
ing the object is hashed.
To hash the type, the value of string("StructType.", nameof_string(StructType(T))))
is used (c.f. StableHashTraits.nameof_string
for details). Note that this means the type of the value itself is not being hashed, rather a string related to its struct type.
For any StructType.CustomStruct
, the object is first StructType.lower
ed and the result is hashed according to the lowered StructType
.
There is no value hashed for missing
or nothing
; the type is hashed as the string "Base.Missing"
and "Base.Nothing"
respectively. Note in particular the string "Base.Missing"
does not have the same hash as missing
, since the former would have its struct type hashed.
Null and Singleton types are hashed solely according to their type (no value is hashed)
Their types is hashed by StableHashTraits.nameof_string
This means the module of the type does not matter: the module of a type is often considered an implementation detail, so it is left out to avoid unexpected hash changes from non-breaking releases that change the module of a type.
Note
If you wish to disambiguate functions or types that have the same name but that come from different modules you can overload StableHashTraits.transform_type
for those functions. If you want to include the module name for a broad set of types, rather than explicitly specifying a module name for each type, you may want to consider calling StableHashTraits.module_nameof_string
in the body of your transform_type
method. This can avoid a number of footguns when including the module names: for example, module_nameof_string
renames Core
to Base
to elide Base julia changes to the location of a functions between these two modules and it renames pluto workspace modules to prevent structs from having a different hash each time the notebook is run.
Function values are hashed according to their their fields (getfield(f) for f in fieldnames(f)
) as per StructType.UnorderedStruct
; functions can have fields when they are curried (e.g. ==(2)
), and so, for this reason, the fields are included in the hash by default.
The type of the function is hashed according to its StableHashTraits.nameof_string
, therefore excluding its module. The exact module of a function is often considered an implementation detail, so it is left out to avoid unexpected hash changes from non-breaking releases that change the module of a function.
When hashing a type as a value (e.g. stable_hash(Int; version=4)
) the value of `StableHashTraits.nameof_string is hashed. The exact module of a type is often considered an implementation detail, so it is left out to avoid unexpected hash changes from non-breaking releases that change the module of a type.
All of the following hash examples follow directly from the definitions above, but may not be so obvious to the reader.
Most of the behaviors described below can be customized/changed by using your own hash StableHashTraits.@context
, which can be passed as the second argument to stable_hash
. StableHashTraits tries to defer to StructTypes for most defaults instead of making more opinionated choices.
The order of NamedTuple pairs does not matter, because NamedTuple
has a struct type of UnorderedStruct
:
stable_hash((;a=1,b=2); version=4) == stable_hash((;b=2,a=1); version=4)
Two structs with the same fields and name hash equivalently, because the default struct type is UnorderedStruct
:
module A
struct X
bar::Int
foo::Float64
end
end
module B
struct X
foo::Float64
bar::Int
end
end
stable_hash(B.X(2, 1); version=4) == stable_hash(A.X(1, 2); version=4)
Different array types with the same content hash to the same value:
stable_hash(view([1,2,3], 1:2); version=4) == stable_hash([1,2]; version=4)
Byte-equivalent arrays of all NumberType
values will hash to the same value:
stable_hash([0.0, 0.0]; version=4) == stable_hash([0, 0]; version=4)
stable_hash([0.0f0, 0.0f0]; version=4) != stable_hash([0, 0]; version=4) # not byte equivalent
Also, even though the bytes are the same, since the size is hashed, we have:
stable_hash([0.0f0, 0.0f0]; version=4) != stable_hash([0]; version=4)
If the eltype has a different StructType
, no collision will occur:
stable_hash(Any[0.0, 0.0]; version=4) != stable_hash([0, 0]; version=4)
Even if the mathematical values are the same, if the bytes are not the same no collision will occur:
stable_hash([1.0, 2.0]; version=4) != stable_hash([1, 2]; version=4)
Two types with the same name but different type parameters will hash the same (unless you define
a transform_type_value
method for your type to include those type parameters in its return value):
struct MyType{T} end
stable_hash(MyType{:a}) == stable_hash(MyType{:b}) # true
Numerical changes will, of course, change the hash. One way this can catch you off guard
are some differences in StaticArray
outputs between julia versions:
julia> using StaticArrays, StableHashTraits;
julia> begin
rotmatrix2d(a) = @SMatrix [cos(a) sin(a); -sin(a) cos(a)]
rotate(a, p) = rotmatrix2d(a) * p
rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006))
end;
In julia 1.9.4:
julia> bytes2hex(stable_hash(rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006)); version=4))
"4ccdc172688dd2b5cd50ba81071a19217c3efe2e3b625e571542004c8f96c797"
julia> rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006))
2-element SVector{2, Float64} with indices SOneTo(2):
7.419375817039376e-17
-0.5953242152248626
In julia 1.6.7
julia> bytes2hex(stable_hash(rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006)); version=4))
"3b8d998f3106c05f8b74ee710267775d0d0ce0e6780c1256f4926d3b7dcddf9e"
julia> rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006))
2-element SVector{2, Float64} with indices SOneTo(2):
5.551115123125783e-17
-0.5953242152248626
This release includes a new hash version 4 that has breaking API changes relative to earlier versions, documented above. The prior API is deprecated, however remains the default to avoid breaking users's code. In version 2 of StableHashTraits, which will be released in relatively short order, only hash version 4 will be available.
This release includes a bugfix to stable_type_id
and the underlying hashes that depend on it (true for most types). This bug caused stable_type_id
to yield a different value depending on the scope in which stable_type_id
was first called for a given type.
Now that 1.3 is available, 1.2 should not be used, as it addresses the same bug with a better API, rather than the hotfix applied here.
1.2 defines hash version 3, which uses a fixed version of stable_type_id
that can be used by leveraging hash version 3. E.g. if you call stable_hash(x, version=3)
or use HashVersion{3}()
where you would have used HashVersion{2}()
you will not be susceptible to the bug. If you make use of stable_type_id
directly and want to avoid this bug, you should use StableHashTraits.stable_type_id_fixed
.
Because existing uses of StableHashTraits
might depend on the extant, broken behavior, versions 1 and 2 of hashing remain unchanged.
This release includes speed improvements of about 100 fold.
- Feature:
HashVersion{2}
is a new hash context that can be up to ~100x faster thanHashVersion{1}
. - Feature: The requirements for
HashVersion{2}
on the passed hash function have been relaxed, such thatalg=crc32
should again work (no need to callalg=(x,s=UInt32(0)) -> crc32c(copy(x),s)
). - Feature:
@ConstantHash
allows for precomputed hash values of constant strings and numbers. - Feature:
stable_typename_id
andstable_type_id
provide compile-time 64 bit hashes of the types of objects - Feature:
root_version
: Most users can safely ignore this new function. If you are implementing a root context (one that returnsparent_context(::MyContext) = nothing
) you will need to define this function. It indicates what version of the hashing implementations to use (1 or 2). It defaults to 1 to avoid changing the hash values of existing root contexts, but should be defined to return 2 to make use of the more optimized implementations used byHashVersion{2}
. - Deprecation:
HashVersion{1}
has been deprecated, favor version 2 over 1 in all cases where backwards compatibility is not required. - Deprecation:
qualified_name
andqualified_type
have been deprecated, in favor ofstable_typename_id
andstable_type_id
. - Deprecation:
ConstantHash
has been deprecated in favor of the more efficient@ConstantHash
. To remove deprecated API: any call toConstantHash(x)
wherex
is a constant literal should be changed to@ConstantHash(x)
. Ifx
is an expression you can useFnHash(_ -> x)
to achieve the same result. Note however that the use of a non-literal is probably a code smell, ashash_method
should normally only depend on the type of its arguments.
This is a very breaking release, almost all values hash differently and the API has changed.
However, far fewer manual definitions of hash_method
become necessary. The fallback for
Any
should handle many more cases.
- Breaking:
transform
has been removed, its features are covered byFnHash
andHashAndContext
. - Breaking:
stable_hash
no longer accepts multiple objects to hash (wrap them in a tuple instead); it now accepts a single object to hash, and the second positional argument is the context (see below for details on contexts). - Breaking: The default
alg
forstable_hash
issha256
; to use the old default (crc32c) you can passalg=(x,s=UInt32(0)) -> crc32c(copy(x),s)
. - Deprecation: The traits to return from
hash_method
have changed quite a bit. You will need to replace the old names as follows to avoid deprecation warnings during your tests:- Favor
StructHash()
(which usesfieldnames
instead ofpropertynames
) toUseProperties()
. - BUT to reproduce
UseProperties()
, callStructHash(propertynames => getproperty)
- Replace
UseQualifiedName()
withFnHash(qualified_name, HashWrite())
- Replace
UseSize(method)
with(FnHash(size), method)
- Replace
UseTable
withFnHash(Tables.columns, StructHash(Tables.columnnames => Tables.getcolumn))
- Favor
- Deprecation: The fallback methods for hashing are defined within a specific
context (
HashVersion{1}
). Any contexts you make should define aparent_context
method that returns e.g.HashVersion{1}
so that the fallback implementation for any methods ofhash_method
you don't implement work properly. (A default version ofparent_context
raises a deprecation warning and returnsHashVersion{1}
). Refer to the discussion below about contexts.
To prevent reshaped arrays from having the same hash (stable_hash([1 2; 3 4]) == stable_hash(vec([1 2; 3 4]))
) the hashes for all arrays with more than 1 dimension have
changed.
To support hasing of all tables (Tables.istable(x) == true
), hashes have changed for such
objects when:
- calling
stable_hash(x)
did not previously error x
is not aDataFrame
(these previously errored)x
is not aNamedTuple
of tables columns (these have the same hash as before)x
is not anAbstractArray
ofNamedTuple
rows (these have the same hash as before)x
can be successfully written to an IO buffer viaBase.write
orStableHashTraits.write
(otherwise it previously errored)x
has no specializedstable_hash
method defined for it (otherwise the hash will be the same)
Any such table now uses the method UseTable
, rather than UseWrite
, and so would have the
same hash as a DataFrame
or NamedTuple
with the same column contents instead of its
previous hash value. For example if you had a custom table type MyCustomTable
for which
you only defined a StableHashTraits.write
method and no hash_method
, its hash will be
changed unless you now define hash_method(::MyCustomTable) = UseWrite()
.