DLMReader.jl

High-performance delimited-file reader and writer for Julia
Author sl-solution
Popularity
28 Stars
Updated Last
3 Months Ago
Started In
August 2021

DLMReader

An efficient multi-threaded package for reading(writing) delimited files. It is designed as a file parser for InMemoryDatasets.jl.

DLMReader writes and reads AbstractDatasets types, i.e. other types must be converted to/from AbstractDatasets.

It works very well for huge files (long or/and wide).

DLMReader does not guess delimiter and if it is different from ,, it must be passed via the delimiter keyword argument. By default, the DLMReader package assumes Strings are not quoted, if they are quoted, user must pass the quote character via the quotechar keyword argument.

Documentation

Features

DLMReader.jl has some interesting features which distinguish it from other packages for reading delimited files. In what follows, we list few of them;

  • Informats: The DLMReader package uses informats to call a class of functions on the raw text before parsing its value(s). This provides a flexible and extendable approach to parse values with special patterns. For instance, using the predefined informat COMMA! allows users to read a numeric column with "thousands separator" and/or the dollar sign, e.g. using this informat, the raw text like "$12,000.00" will be parsed as "12000.00". Moreover, informats support function composing, e.g. COMMA! ∘ ACC! parses "$(12,000.00)" as "-12000.00", i.e. ACC! is first applied and then COMMA! is applied on its result.

    • Additionally, informats can be applied on whole line before processing individual values.
  • Fixed-width text: If users pass the columns locations via the fixed keyword argument, the package reads those columns as fixed-width format. For instance, passing fixed = Dict(1=>1:1, 2=>2:2) helps to parse "10" as "[1,0]". Mixing fixed-width format and delimited format is also allowed.

  • Multiple observations per line: The package allows reading more than one observation per line. This can be done by passing the multiple_obs = true keyword argument. The multithreading feature (plus some other features) will be switched off if this option is set.

  • Fast file writer: The DLMReader package exploits the byrow function from InMemoryDatasets.jl to write delimited files into disk. This enables DLMReader to convert values to string using multiple threads.

  • Alternative delimiters: User can pass a vector of delimiters to the function. In this case, filereader treats any of the passed delimiters as field delimiter.

  • Multiple Date formats: User can pass different date formats for different columns.

  • Different integer base: The DLMReader package allows users pass the integer base if it is different from 10 when parsing integers.

  • String as delimiter: User can pass a string as delimiter of values. This must be passed via the dlmstr keyword argument.

  • Informative warnings/info: If something goes wrong during the reading phase, the package will provide detailed warnings/info to help user investigate the issue.

Benchmarks

See here for some benchmarks.

Examples

The following files will be used during the examples, it is assumed that the files are located in the current working directory

ex1.csv

a, b, c
1,2,NA
2,3,2001-1-2
2,4,2020-4-2
1,2,2000-12-1

ex2.csv

a::b::C::DD
12::1345::15::15
12::13::15::15
12::13::15::15
12::13::15::15
12::13::15::15
12::13::15::15
12::13::15::15
12::13::::15
12::13::15::15
12::13::15::157

ex3.csv

1
2
4;5
6
8;9
1
4;

ex4.csv

1   3,5
2   4,6
33  5,7

ex5.csv

x1;x2:x3,x4
1;2;123;3
2;4,4,5

ex6.csv

id1 $2,000,000 3
id2 $34,000 4
id3 $200,000 1

And the code to read them into Julia

julia> using DLMReader
julia> filereader("ex1.csv", dtformat = Dict(3 => dateformat"y-m-d"))
julia> filereader("ex2.csv", dlmstr = "::")
julia> filereader("ex3.csv", types = [Int, Int, Int], header = false, linebreak = ';', delimiter = '\n')
julia> filereader("ex4.csv", fixed = Dict(1 => 1:4), header = false)
julia> filereader("ex5.csv", delimiter = [';', ':', ','])
julia> filereader("ex6.csv", delimiter = ' ', informat = Dict(2=>COMMA!), header = [:ID, :price, :quarter])

COMMA! is a built-in informat which removes the comma from numbers. If number contains dollar or sterling signs, it also removes them. The trimmed text is sent to the parser for converting to a number.

Extra examples

julia> filereader(IOBuffer("1,2,3,4,5\n6,7,8\n10\n"),
                  header = [:x1, :x2],
                  types = [Int, Int],
                  multiple_obs = true)
5×2 Dataset
 Row │ x1        x2       
     │ identity  identity
     │ Int64?    Int64?   
─────┼────────────────────
   11         2
   23         4
   35         6
   47         8
   510   missing

julia> filereader(IOBuffer(""" name1 name2 avg1 avg2  y
              0   A   D   75   5    32
              1   A   D   75   5    32
              2   D   L   32   7    12
              3   F   C   99   8    42
              4   F   C   99   8    42
              5   C   A   43   6    39
              6   C   A   43   6    39
              7   L   R   53   3    11
              8   R   F   21   2    25
              9   R   F   21   2    25
              """), delimiter = ' ', ignorerepeated = true, emptycolname = true)
10×6 Dataset
 Row │ NONAME1   name1     name2     avg1      avg2      y        
     │ identity  identity  identity  identity  identity  identity
     │ Int64?    String?   String?   Int64?    Int64?    Int64?   
─────┼────────────────────────────────────────────────────────────
   10  A         D               75         5        32
   21  A         D               75         5        32
   32  D         L               32         7        12
   43  F         C               99         8        42
   54  F         C               99         8        42
   65  C         A               43         6        39
   76  C         A               43         6        39
   87  L         R               53         3        11
   98  R         F               21         2        25
  109  R         F               21         2        25

Used By Packages