Implement the TeaFile format. It is row-oriented, binary, and primarily intended for time-series data.
The primary API is compatible with the Tables.jl interface.
We create the following toy dataset:
using Dates
using DataFrames
x = DataFrame(t=[DateTime(2000), DateTime(2001), DateTime(2002)], a=[1, 2, 3], b=[10.0, 20.0, 30.0])This produces the following table:
3×3 DataFrame
Row │ t a b
│ DateTime Int64 Float64
─────┼─────────────────────────────────────
1 │ 2000-01-01T00:00:00 1 10.0
2 │ 2001-01-01T00:00:00 2 20.0
3 │ 2002-01-01T00:00:00 3 30.0
To write this to disk, we use TeaFiles.write.
A tea file contains a header with various metadata, including column names and types which are automatically inferred from the table's schema.
Other supported metadata can be specified with optional arguments to TeaFiles.write.
Note that the first column of DateTime type, if present, is used as the primary index for the tea file.
As such the values therein must be non-decreasing in order to comply with the specification.
using TeaFiles
TeaFiles.write("moo.tea", x)The data can be read back with TeaFiles.read, which returns a Tables-compatible object.
We can pipe this into the DataFrame constructor to get an object that is equal to the origianl.
TeaFiles.read("moo.tea") |> DataFrameIf there is a time column, it is guaranteed that its values will be non-decreasing.
We can therefore efficiently read a small time interval in a large file by performing a binary search to find the start point.
One can specify this interval as an argument to TeaFiles.read, for example:
y = TeaFiles.read("moo.tea"; lower=DateTime(2001)) |> DataFrame
println(y)gives:
2×3 DataFrame
Row │ t a b
│ DateTime Int64 Float64
─────┼─────────────────────────────────────
1 │ 2001-01-01T00:00:00 2 20.0
2 │ 2002-01-01T00:00:00 3 30.0
-
We define the epoch relative to 0001-01-01. The specification states that the reference is 0000-01-01, however this seems to be an error. The example given within the specification, and Python & .NET implementations by DiscreteLogics, are consistent with a reference of 0001-01-01.
-
The specification makes no mention of time zones, and therefore we work with time-zone naive
DateTimeobjects in Julia. Users are recommended to store times in UTC to avoid ambiguities around DST changepoints. -
We do not plan to support the .NET decimal type (type code
0x200in the standard).