This is a fork of BSON.jl, with much better performance for loading composite data types in particular. The 'qs' stands for "quick structs".
BSONqs appears to be between 2-4x faster than BSON in read benchmarks. See the performance section below for more info.
Usage is mostly the same as the original package. You should be able to use this as a drop in replacement, except that you may need to alias the package name.
using BSONqs
const BSON = BSONqs
Currently the load
function does not support some of the more exotic data
types, however you may use load_compat
instead. There are also now two forms
of the parse
function.
parse(x::Union{IO, String})
parse(x::Union{IO, String}, ctx::ParseCtx; mmap=false)
The later provides the best performance in most circumstances and is the least
compatible. It should be noted that load(x; args...) = parse(x, ParseCtx(); args...)
, but that parse(x)
is not the same as load_compat(x)
.
On Linux atleast you can set mmap=true
for better perfomance when loading
from a file.
Finally there is an experimental interface for lazily loading a BSON document one member at a time. This also allows you to specify the type of each document member.
import Mmap
open("a_file.bson") do fio
io = IOBuffer(Mmap.mmap(fio))
doc = Document(io, DefaultMemberType)
# Get a doc member, parsing it as the DefaultMemberType
val = doc[:a_member_key]
# Get a doc member, parsing it as T
val = doc[T, :a_member_key]
# iterate over all the members, parsing them as DefaultMemberType
for (k, v) in doc
...
end
# Load the entire document into a Dict, again using DefaultMemberType
Dict(doc)
end
Lazy or partial loading could be useful if you have a large document with many
large members. I have used the word 'partial', because calling Document
will
scan the input stream and build an index which may not be considered lazy
enough. Note that there is no automatic caching of parsed results.
Specifying a concrete type allows some parsing to be skipped.
This library works best with large, repetitive datasets with concrete types. You should always try to use concrete types in structs in Julia (see the official docs).
This library makes heavy use of @generated
and type specific functions, so
the compilation time is longer than the original. This means that (in theory)
for smaller or highly irregular datasets, the original library may be faster
for the first call to load
where the dataset is small and contains one or
more uncached data types.
Also for generic BSON documents, without any Julia type data, this library may or may not be faster.
If your platform supports it, use mmap
. It is generally faster anyway, but
in this case it also allows us to avoid unecessary copies and
allocations. Below is a comparison of the original library and BSONqs with and
without mmap on a common dataset (on the second pass).
julia> @time BSON.load("vgg19.bson");
0.626661 seconds (1.97 k allocations: 1.071 GiB, 33.95% gc time)
julia> @time BSONqs.load("vgg19.bson");
0.482821 seconds (1.82 k allocations: 1.071 GiB, 42.69% gc time)
julia> @time BSONqs.load("vgg19.bson"; mmap=true);
0.237986 seconds (1.82 k allocations: 548.121 MiB, 26.12% gc time)
In any case, if performance is important to you, you should create a benchmark with your data. Finally note that this library mainly uses the same serialization code as the original, so write performance should be the same (although at the time of writing the original has a bug which makes it much worse). If you are interested in write performance then please let me know.