Parquet.jl

Julia implementation of parquet columnar file format reader
Popularity
32 Stars
Updated Last
6 Months Ago
Started In
January 2016

Parquet

Build Status Build status Coverage Status

Reader

Load a parquet file. Only metadata is read initially, data is loaded in chunks on demand. (Note: ParquetFiles.jl also provides load support for Parquet files under the FileIO.jl package.)

ParFile represents a Parquet file at path open for reading. Options to map logical types can be provided via map_logical_types.

ParFile(path; map_logical_types) => ParFile

map_logical_types can be one of:

  • false: no mapping is done (default)
  • true: default mappings are attempted on all columns (bytearray => String, int96 => DateTime)
  • A user supplied dict mapping column names to a tuple of type and a converter function

ParFile also keeps a handle to the open file and the file metadata and also holds a LRU cache of raw bytes of the pages read. If the parquet file references other files in its metadata, they will be opened as and when required for reading and closed when they are not needed anymore.

The close method closes the reader, releases open files and makes cached internal data structures available for GC. A ParFile instance must not be used once closed.

julia> using Parquet

julia> parfile = "customer.impala.parquet";

julia> p = ParFile(parfile; map_logical_types=true)
Parquet file: customer.impala.parquet
    version: 1
    nrows: 150000
    created by: impala version 1.2-INTERNAL (build a462ec42e550c75fccbff98c720f37f3ee9d55a3)
    cached: 0 column chunks

Examine the schema.

julia> nrows(p)
150000

julia> ncols(p)
8

julia> colnames(p)
8-element Array{Array{String,1},1}:
 ["c_custkey"]
 ["c_name"]
 ["c_address"]
 ["c_nationkey"]
 ["c_phone"]
 ["c_acctbal"]
 ["c_mktsegment"]
 ["c_comment"]

julia> schema(p)
Schema:
    schema {
      optional INT64 c_custkey
      optional BYTE_ARRAY c_name
      optional BYTE_ARRAY c_address
      optional INT32 c_nationkey
      optional BYTE_ARRAY c_phone
      optional DOUBLE c_acctbal
      optional BYTE_ARRAY c_mktsegment
      optional BYTE_ARRAY c_comment
    }

Create cursor to iterate over batches of column values. Each iteration returns a named tuple of column names with batch of column values. One batch corresponds to one row group of the parquet file.

julia> cc = Parquet.BatchedColumnsCursor(par)
Batched Columns Cursor on customer.impala.parquet
    rows: 1:150000
    batches: 1
    cols: c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment

julia> batchvals, state = iterate(cc);

julia> propertynames(batchvals)
(:c_custkey, :c_name, :c_address, :c_nationkey, :c_phone, :c_acctbal, :c_mktsegment, :c_comment)

julia> length(batchvals.c_name)
150000

julia> batchvals.c_name[1:5]
5-element Array{Union{Missing, String},1}:
 "Customer#000000001"
 "Customer#000000002"
 "Customer#000000003"
 "Customer#000000004"
 "Customer#000000005"

Create cursor to iterate over records. In parallel mode, multiple remote cursors can be created and iterated on in parallel.

julia> rc = RecordCursor(p)
Record Cursor on customer.impala.parquet
    rows: 1:150000
    cols: c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment

julia> records = collect(rc);

julia> length(records)
150000

julia> first_record = first(records);

julia> isa(first_record, NamedTuple)
true

julia> propertynames(first_record)
(:c_custkey, :c_name, :c_address, :c_nationkey, :c_phone, :c_acctbal, :c_mktsegment, :c_comment)

julia> first_record.c_custkey
1

julia> first_record.c_name
"Customer#000000001"

julia> first_record.c_address
"IVhzIApeRb ot,c,E"

The reader will interpret logical types based on the map_logical_types provided. The following logical type mapping methods are available in the Parquet package and are applied by default if map_logical_types is set to true.

  • logical_timestamp(v; offset::Dates.Period=Dates.Second(0)): Applicable for timestamps that are INT96 values. Without this they are represented in a Int128 type. With this they are converted to DateTime types.
  • logical_string(v): Applicable for strings that are BYTE_ARRAYvalues. Without this, they are represented in aVector{UInt8}type. With this they are converted toString` types.

Variants of these methods or custom methods can also be applied by caller.

Writer

You can write any Tables.jl column-accessible table that contains columns of these types and their union with Missing: Int32, Int64, String, Bool, Float32, Float64.

However, CategoricalArrays are not yet supported. Furthermore, these types are not yet supported: Int96, Int128, Date, and DateTime.

Writer Example

tbl = (
    int32 = Int32.(1:1000),
    int64 = Int64.(1:1000),
    float32 = Float32.(1:1000),
    float64 = Float64.(1:1000),
    bool = rand(Bool, 1000),
    string = [randstring(8) for i in 1:1000],
    int32m = rand([missing, 1:100...], 1000),
    int64m = rand([missing, 1:100...], 1000),
    float32m = rand([missing, Float32.(1:100)...], 1000),
    float64m = rand([missing, Float64.(1:100)...], 1000),
    boolm = rand([missing, true, false], 1000),
    stringm = rand([missing, "abc", "def", "ghi"], 1000)
)

file = tempname()*".parquet"
write_parquet(file, tbl)