BGZFStreams.jl

BGZF Stream
Author BioJulia
Popularity
13 Stars
Updated Last
9 Months Ago
Started In
June 2016

BGZFStreams

Unit Tests codecov.io Downstream

BGZF is a compression format that supports random access using virtual file offsets.

See the SAM/BAM file format specs for the details of BGZF: https://samtools.github.io/hts-specs/SAMv1.pdf.

Synopsis

using BGZFStreams

# The first argument is a filename or an IO object (e.g. IOStream).
stream = BGZFStream("data.bgz")

# BGZFStream is a subtype of IO and works like a usual IO object.
while !eof(stream)
    byte = read(stream, UInt8)
    # do something...
end

# BGZFStream is also seekable with a VirtualOffset.
seek(stream, VirtualOffset(0, 2))

# The current virtual file offset is available.
virtualoffset(stream)

close(stream)

Usage

The BGZFStreams.jl package exports three types and a function to the package user:

  • Types:
    • BGZFStream: an IO stream of the BGZF file format
    • VirtualOffset: data offset in a BGZF file
    • BGZFDataError: an error type thrown when reading a malformed byte stream
  • Function:
    • virtualoffset(stream): returns the current virtual file offset of stream

The BGZFStream type wraps an underlying IO object and transparently inflate (for reading) or deflate (for writing) data. Since it is a subtype of IO, an instance of it behaves like other IO objects, but the seek method takes a virtual offset instead of a normal file offset as its second argument.

The VirtualOffset type represents a 64-bit virtual file offset as described in the specification of the SAM file format. That is, the most significant 48-bit integer of a virtual offset is a byte offset to the BGZF file to the beginning position of a BGZF block and the least significant 16-bit integer is a byte offset to the uncompressed byte(s).

The BGZFDataError type is a subtype of Exception and used to throw an exception when invalid data are read.

The virtualoffset(stream::BGZFStream) returns the current virtual file offset. More specifically, it returns the virtual offset of the next reading byte while reading and the next writing byte while writing.

Parallel Processing

A major bottleneck of processing a BGZF file is the inflation and deflation process. The throughput of reading data is ~100 MiB/s, which is quite slower than that of raw reading from a file. In order to alleviate this problem, this package supports parallelized inflation when reading compressed data. This requires the support of multi-threading introduced in Julia 0.5. The JULIA_NUM_THREADS environmental variable sets the number of threads used for processing.

bash-3.2$ JULIA_NUM_THREADS=2 julia -q
julia> using Base.Threads

julia> nthreads()
2

Contributing

We appreciate contributions from users including reporting bugs, fixing issues, improving performance and adding new features.

Take a look at the contributing files detailed contributor and maintainer guidelines, and code of conduct.

Financial contributions

We also welcome financial contributions in full transparency on our open collective. Anyone can file an expense. If the expense makes sense for the development the core contributors and the person who filed the expense will be reimbursed.

Backers & Sponsors

Thank you to all our backers and sponsors!

Love our work and community? Become a backer.

backers

Does your company use BioJulia? Help keep BioJulia feature rich and healthy by sponsoring the project. Your logo will show up here with a link to your website.

Questions?

If you have a question about contributing or using BioJulia software, come on over and chat to us on the Julia Slack workspace, or you can try the Bio category of the Julia discourse site.