SASLib.jl
SASLib is a fast reader for sas7bdat files. The goal is to allow easier integration with SAS processes. Only sas7bdat
format is supported. SASLib is licensed under the MIT Expat license.
Installation
Pkg.add("SASLib")
Read Performance
I did benchmarking mostly on my Macbook Pro laptop. In general, the Julia implementation is somewhere between 10-100x faster than the Python Pandas. Test results are documented in the test/perf_results_<version>
folders.
Latest performance test results for v1.0.0 is as follows:
Test | Result |
---|---|
py_jl_homimp_50.md | 30x faster than Python/Pandas |
py_jl_numeric_1000000_2_100.md | 10x faster than Python/Pandas |
py_jl_productsales_100.md | 50x faster than Python/Pandas |
py_jl_test1_100.md | 120x faster than Python/Pandas |
py_jl_topical_30.md | 30x faster than Python/Pandas |
User Guide
julia> using SASLib
Reading SAS Files
Use the readsas
function to read a SAS7BDAT file.
julia> rs = readsas("productsales.sas7bdat")
Read productsales.sas7bdat with size 1440 x 10 in 0.00256 seconds
SASLib.ResultSet (1440 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
3: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
4: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
5: 656.0, 646.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-05-01
⋮
Accessing Results
There are several ways to access the data conveniently without using any third party packages. Each cell value may be retrieved directly via the regular [i,j]
index. Accessing an entire row or column returns a tuple and a vector respectively.
Direct cell access
julia> rs[4,2]
533.0
julia> rs[4, :PREDICT]
533.0
Indexing by row number returns a named tuple
julia> rs[1]
(ACTUAL = 925.0, PREDICT = 850.0, COUNTRY = "CANADA", REGION = "EAST", DIVISION = "EDUCATION", PRODTYPE = "FURNITURE", PRODUCT = "SOFA", QUARTER = 1.0, YEAR = 1993.0, MONTH = 1993-01-01)
Columns access by name via indexing or as a property
julia> rs[:ACTUAL]
1440-element Array{Float64,1}:
925.0
999.0
608.0
⋮
julia> rs.ACTUAL
1440-element Array{Float64,1}:
925.0
999.0
608.0
⋮
Slice a range of rows
julia> rs[2:4]
SASLib.ResultSet (3 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
2: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
3: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
Slice a subset of columns
julia> rs[:ACTUAL, :PREDICT, :YEAR, :MONTH]
SASLib.ResultSet (1440 rows x 4 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:YEAR, 4:MONTH
1: 925.0, 850.0, 1993.0, 1993-01-01
2: 999.0, 297.0, 1993.0, 1993-02-01
3: 608.0, 846.0, 1993.0, 1993-03-01
4: 642.0, 533.0, 1993.0, 1993-04-01
5: 656.0, 646.0, 1993.0, 1993-05-01
⋮
Mutation
You may assign values at the cell level, causing a side effect in memory:
julia> srs = rs[:ACTUAL, :PREDICT, :YEAR, :MONTH][1:2]
SASLib.ResultSet (2 rows x 4 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:YEAR, 4:MONTH
1: 925.0, 850.0, 1993.0, 1993-01-01
2: 999.0, 297.0, 1993.0, 1993-02-01
julia> srs[2,2] = 3
3
julia> rs[1:2]
SASLib.ResultSet (2 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999.0, 3.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
Iteration
ResultSet implements the usual standard iteration interface, so it's easy to walk through the results:
julia> mean(r.ACTUAL - r.PREDICT for r in rs)
16.695833333333333
Metadata
There are simple functions to retrieve meta information about a ResultSet.
names(rs)
size(rs)
length(rs)
Tables.jl / DataFrame
It may be beneficial to convert the result set to DataFrame for more complex queries and manipulations.
The SASLib.ResultSet
object implements the Tables.jl interface,
so you can directly create a DataFrame as shown below:
julia> df = DataFrame(rs);
julia> first(df, 5)
5×10 DataFrame
│ Row │ ACTUAL │ PREDICT │ COUNTRY │ REGION │ DIVISION │ PRODTYPE │ PRODUCT │ QUARTER │ YEAR │ MONTH │
│ │ Float64 │ Float64 │ String │ String │ String │ String │ String │ Float64 │ Float64 │ Dates…⍰ │
├─────┼─────────┼─────────┼─────────┼────────┼───────────┼───────────┼─────────┼─────────┼─────────┼────────────┤
│ 1 │ 925.0 │ 850.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 1.0 │ 1993.0 │ 1993-01-01 │
│ 2 │ 999.0 │ 297.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 1.0 │ 1993.0 │ 1993-02-01 │
│ 3 │ 608.0 │ 846.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 1.0 │ 1993.0 │ 1993-03-01 │
│ 4 │ 642.0 │ 533.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 2.0 │ 1993.0 │ 1993-04-01 │
│ 5 │ 656.0 │ 646.0 │ CANADA │ EAST │ EDUCATION │ FURNITURE │ SOFA │ 2.0 │ 1993.0 │ 1993-05-01 │
Inclusion/Exclusion of Columns
Column Inclusion
It is always faster to read only the columns that you need. The include_columns
argument comes in handy:
julia> rs = readsas("productsales.sas7bdat", include_columns=[:YEAR, :MONTH, :PRODUCT, :ACTUAL])
Read productsales.sas7bdat with size 1440 x 4 in 0.00151 seconds
SASLib.ResultSet (1440 rows x 4 columns)
Columns 1:ACTUAL, 2:PRODUCT, 3:YEAR, 4:MONTH
1: 925.0, SOFA, 1993.0, 1993-01-01
2: 999.0, SOFA, 1993.0, 1993-02-01
3: 608.0, SOFA, 1993.0, 1993-03-01
4: 642.0, SOFA, 1993.0, 1993-04-01
5: 656.0, SOFA, 1993.0, 1993-05-01
⋮
Column Exclusion
Likewise, you can read all columns except the ones you don't want as specified in exclude_columns
argument:
julia> rs = readsas("productsales.sas7bdat", exclude_columns=[:YEAR, :MONTH, :PRODUCT, :ACTUAL])
Read productsales.sas7bdat with size 1440 x 6 in 0.00265 seconds
SASLib.ResultSet (1440 rows x 6 columns)
Columns 1:PREDICT, 2:COUNTRY, 3:REGION, 4:DIVISION, 5:PRODTYPE, 6:QUARTER
1: 850.0, CANADA, EAST, EDUCATION, FURNITURE, 1.0
2: 297.0, CANADA, EAST, EDUCATION, FURNITURE, 1.0
3: 846.0, CANADA, EAST, EDUCATION, FURNITURE, 1.0
4: 533.0, CANADA, EAST, EDUCATION, FURNITURE, 2.0
5: 646.0, CANADA, EAST, EDUCATION, FURNITURE, 2.0
⋮
Case Sensitivity and Column Number
Column symbols are matched in a case insensitive manner with SAS column names.
Both include_columns
and exclude_columns
accept column number. In fact, you can mixed column symbols and column numbers as such:
julia> readsas("productsales.sas7bdat", include_columns=[:actual, :predict, 8, 9, 10])
Read productsales.sas7bdat with size 1440 x 5 in 0.16378 seconds
SASLib.ResultSet (1440 rows x 5 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:QUARTER, 4:YEAR, 5:MONTH
1: 925.0, 850.0, 1.0, 1993.0, 1993-01-01
2: 999.0, 297.0, 1.0, 1993.0, 1993-02-01
3: 608.0, 846.0, 1.0, 1993.0, 1993-03-01
4: 642.0, 533.0, 2.0, 1993.0, 1993-04-01
5: 656.0, 646.0, 2.0, 1993.0, 1993-05-01
⋮
Incremental Reading
If you need to read files incrementally, you can use the SASLib.open
function to obtain a handle of the file. Then, use the SASLib.read
function to fetch a number of rows. Remember to close the handler with SASLib.close
to avoid memory leak.
julia> handler = SASLib.open("productsales.sas7bdat")
SASLib.Handler[productsales.sas7bdat]
julia> rs = SASLib.read(handler, 2)
Read productsales.sas7bdat with size 2 x 10 in 0.06831 seconds
SASLib.ResultSet (2 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925.0, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999.0, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
julia> rs = SASLib.read(handler, 3)
Read productsales.sas7bdat with size 3 x 10 in 0.00046 seconds
SASLib.ResultSet (3 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 608.0, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
2: 642.0, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
3: 656.0, 646.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-05-01
julia> SASLib.close(handler)
Note that there is no facility at the moment to jump and read a subset of rows.
SASLib always read from the beginning.
String Column Constructor
By default, string columns are read into a special AbstractArray structure called ObjectPool
in order to conserve memory space that might otherwise be wasted for duplicate string values. SASLib tries to be smart -- when it encounters too many unique values (> 10%) in a large array (> 2000 rows), it falls back to a regular Julia array.
You can use a different array type (e.g. CategoricalArray or PooledArray) for any columns as you wish by specifying a string_array_fn
parameter when reading the file. This argument must be a Dict that maps a column symbol into a function that takes an integer argument and returns any array of that size.
Here's the normal case:
julia> rs = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION]);
Read productsales.sas7bdat with size 1440 x 2 in 0.00193 seconds
julia> typeof.(columns(rs))
2-element Array{DataType,1}:
SASLib.ObjectPool{String,UInt16}
SASLib.ObjectPool{String,UInt16}
If you really want a regular String array, you can force SASLib to do so as such:
julia> rs = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION],
string_array_fn=Dict(:COUNTRY => (n)->fill("",n)));
Read productsales.sas7bdat with size 1440 x 2 in 0.00333 seconds
julia> typeof.(columns(rs))
2-element Array{DataType,1}:
Array{String,1}
SASLib.ObjectPool{String,UInt16}
For convenience, SASLib.REGULAR_STR_ARRAY
is a function that does exactly that. In addition, if you need all columns to be configured the same then the key of the string_array_fn
dict may be just the symbol :_all_
.
julia> rs = readsas("productsales.sas7bdat", include_columns=[:COUNTRY, :REGION],
string_array_fn=Dict(:_all_ => REGULAR_STR_ARRAY));
Read productsales.sas7bdat with size 1440 x 2 in 0.00063 seconds
julia> typeof.(columns(rs))
2-element Array{DataType,1}:
Array{String,1}
Array{String,1}
Numeric Columns Constructor
In general, SASLib allocates native arrays when returning numerical column data. However, you can provide a custom constructor so you would be able to either pre-allcoate the array or construct a different type of array. The number_array_fn
parameter is a Dict
that maps column symbols to the custom constructors. Similar to string_array_fn
, this Dict may be specified with a special symbol :_all_
to indicate such constructor be used for all numeric columns.
Example - create SharedArray
:
julia> rs = readsas("productsales.sas7bdat", include_columns=[:ACTUAL,:PREDICT],
number_array_fn=Dict(:ACTUAL => (n)->SharedArray{Float64}(n)));
Read productsales.sas7bdat with size 1440 x 2 in 0.00385 seconds
julia> typeof.(columns(rs))
2-element Array{DataType,1}:
SharedArray{Float64,1}
Array{Float64,1}
Example - preallocate arrays:
julia> A = zeros(1440, 2);
julia> f1(n) = @view A[:, 1];
julia> f2(n) = @view A[:, 2];
julia> readsas("productsales.sas7bdat", include_columns=[:ACTUAL,:PREDICT],
number_array_fn=Dict(:ACTUAL => f1, :PREDICT => f2));
Read productsales.sas7bdat with size 1440 x 2 in 0.00041 seconds
julia> A[1:5,:]
5×2 Array{Float64,2}:
925.0 850.0
999.0 297.0
608.0 846.0
642.0 533.0
656.0 646.0
Column Type Conversion
Often, you want a column to be an integer but the SAS7BDAT stores everything as Float64. Specifying the column_type
argument does the conversion for you.
julia> rs = readsas("productsales.sas7bdat", column_types=Dict(:ACTUAL=>Int))
Read productsales.sas7bdat with size 1440 x 10 in 0.08043 seconds
SASLib.ResultSet (1440 rows x 10 columns)
Columns 1:ACTUAL, 2:PREDICT, 3:COUNTRY, 4:REGION, 5:DIVISION, 6:PRODTYPE, 7:PRODUCT, 8:QUARTER, 9:YEAR, 10:MONTH
1: 925, 850.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-01-01
2: 999, 297.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-02-01
3: 608, 846.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 1.0, 1993.0, 1993-03-01
4: 642, 533.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-04-01
5: 656, 646.0, CANADA, EAST, EDUCATION, FURNITURE, SOFA, 2.0, 1993.0, 1993-05-01
julia> typeof(rs[:ACTUAL])
Array{Int64,1}
File Metadata
You may obtain meta data for a SAS data file using the metadata
function.
julia> md = metadata("productsales.sas7bdat")
File: productsales.sas7bdat (1440 x 10)
1:ACTUAL(Float64) 5:DIVISION(String) 9:YEAR(Float64)
2:PREDICT(Float64) 6:PRODTYPE(String) 10:MONTH(Date/Missings.Missing)
3:COUNTRY(String) 7:PRODUCT(String)
4:REGION(String) 8:QUARTER(Float64)
It's OK to access the fields directly.
julia> fieldnames(SASLib.Metadata)
9-element Array{Symbol,1}:
:filename
:encoding
:endianness
:compression
:pagesize
:npages
:nrows
:ncols
:columnsinfo
julia> md = metadata("test3.sas7bdat");
julia> md.compression
:RDC
Related Packages
ReadStat.jl uses the ReadStat C-library. However, ReadStat-C does not support reading RDC-compressed binary files.
StatFiles.jl is a higher-level package built on top of ReadStat.jl and implements the FileIO interface.
Python Pandas package has an implementation of SAS file reader that SASLib borrows heavily from.
Credits
- Jared Hobbs, the author of the SAS reader code from Pandas. See LICENSE_SAS7BDAT.md.
- Evan Miller, the author of ReadStat C/C++ library. See LICENSE_READSTAT.md.
- David Anthoff, who provided many valuable ideas at the early stage of development.
- Tyler Beason
- susabi
I also want to thank all the active members at the Julia Discourse community. This project wouldn't be possible without all the help I got from the community. That's the beauty of open-source development.