BGZF is a compression format that supports random access using virtual file offsets.
See the SAM/BAM file format specs for the details of BGZF: https://samtools.github.io/hts-specs/SAMv1.pdf.
using BGZFStreams
# The first argument is a filename or an IO object (e.g. IOStream).
stream = BGZFStream("data.bgz")
# BGZFStream is a subtype of IO and works like a usual IO object.
while !eof(stream)
byte = read(stream, UInt8)
# do something...
end
# BGZFStream is also seekable with a VirtualOffset.
seek(stream, VirtualOffset(0, 2))
# The current virtual file offset is available.
virtualoffset(stream)
close(stream)
The BGZFStreams.jl package exports three types and a function to the package user:
- Types:
BGZFStream
: anIO
stream of the BGZF file formatVirtualOffset
: data offset in a BGZF fileBGZFDataError
: an error type thrown when reading a malformed byte stream
- Function:
virtualoffset(stream)
: returns the current virtual file offset ofstream
The BGZFStream
type wraps an underlying IO
object and transparently inflate
(for reading) or deflate (for writing) data. Since it is a subtype of IO
, an
instance of it behaves like other IO
objects, but the seek
method takes a
virtual offset instead of a normal file offset as its second argument.
The VirtualOffset
type represents a 64-bit virtual file offset as described in
the specification of the SAM file format. That is, the most significant 48-bit
integer of a virtual offset is a byte offset to the BGZF file to the beginning
position of a BGZF block and the least significant 16-bit integer is a byte
offset to the uncompressed byte(s).
The BGZFDataError
type is a subtype of Exception
and used to throw an
exception when invalid data are read.
The virtualoffset(stream::BGZFStream)
returns the current virtual file offset.
More specifically, it returns the virtual offset of the next reading byte while
reading and the next writing byte while writing.
A major bottleneck of processing a BGZF file is the inflation and deflation
process. The throughput of reading data is ~100 MiB/s, which is quite slower
than that of raw reading from a file. In order to alleviate this problem, this
package supports parallelized inflation when reading compressed data. This
requires the support of multi-threading introduced in Julia 0.5. The
JULIA_NUM_THREADS
environmental variable sets the number of threads used for
processing.
bash-3.2$ JULIA_NUM_THREADS=2 julia -q
julia> using Base.Threads
julia> nthreads()
2
We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.
Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.
We also welcome financial contributions in full transparency on our open collective. Anyone can file an expense. If the expense makes sense for the development the core contributors and the person who filed the expense will be reimbursed.
Thank you to all our backers and sponsors!
Love our work and community? Become a backer.
Does your company use BioJulia? Help keep BioJulia feature rich and healthy by sponsoring the project. Your logo will show up here with a link to your website.
If you have a question about contributing or using BioJulia software, come on over and chat to us on the Julia Slack workspace, or you can try the Bio category of the Julia discourse site.