Popularity
77 Stars
Updated Last
2 Months Ago
Started In
April 2017

Impute

stable latest CI codecov

Impute.jl provides various methods for handling missing data in Vectors, Matrices and Tables.

Installation

julia> using Pkg; Pkg.add("Impute")

Quickstart

Let's start by loading our dependencies:

julia> using DataFrames, Impute

We'll also want some test data containing missings to work with:

julia> df = Impute.dataset("test/table/neuro") |> DataFrame
469×6 DataFrame
 Row │ V1         V2         V3       V4        V5         V6
     │ Float64?   Float64?   Float64  Float64?  Float64?   Float64?
─────┼───────────────────────────────────────────────────────────────
   1missing       -203.7    -84.1      18.5  missing    missing
   2missing       -203.0    -97.8      25.8      134.7  missing
   3missing       -249.0    -92.1      27.8      177.1  missing
   4missing       -231.5    -97.5      27.0      150.3  missing
   5missing    missing     -130.1      25.8      160.0  missing
   6missing       -223.1    -70.7      62.1      197.5  missing
   7missing       -164.8    -12.2      76.8      202.8  missing
   8missing       -221.6    -81.9      27.5      144.5  missing
                                                 
 463-242.6     -142.0    -21.8      69.8      148.7  missing
 464-235.9     -128.8    -33.1      68.8      177.1  missing
 465missing       -140.8    -38.7      58.1      186.3  missing
 466missing       -149.5    -40.3      62.8      139.7      242.5
 467-247.6     -157.8    -53.3      28.3      122.9      227.6
 468missing       -154.9    -50.8      28.1      119.9      201.1
 469missing       -180.7    -70.9      33.7      114.8      222.5
                                                     454 rows omitted

Our first instinct might be to drop all observations, but this leaves us too few rows to work with:

julia> Impute.filter(df; dims=:rows)
4×6 DataFrame
 Row │ V1       V2       V3       V4       V5       V6
     │ Float64  Float64  Float64  Float64  Float64  Float64
─────┼──────────────────────────────────────────────────────
   1-247.0   -132.2    -18.8     28.2     81.4    237.9
   2-234.0   -140.8    -56.5     28.0    114.3    222.9
   3-215.8   -114.8    -18.4     65.3    171.6    249.7
   4-247.6   -157.8    -53.3     28.3    122.9    227.6

We could try imputing the values with linear interpolation, but that still leaves missing data at the head and tail of our dataset:

julia> Impute.interp(df)
469×6 DataFrame
 Row │ V1           V2         V3       V4        V5         V6
     │ Float64?     Float64?   Float64  Float64?  Float64?   Float64?
─────┼───────────────────────────────────────────────────────────────────
   1missing        -203.7     -84.1      18.5  missing    missing
   2missing        -203.0     -97.8      25.8      134.7  missing
   3missing        -249.0     -92.1      27.8      177.1  missing
   4missing        -231.5     -97.5      27.0      150.3  missing
   5missing        -227.3    -130.1      25.8      160.0  missing
   6missing        -223.1     -70.7      62.1      197.5  missing
   7missing        -164.8     -12.2      76.8      202.8  missing
   8missing        -221.6     -81.9      27.5      144.5  missing
                                                   
 463-242.6      -142.0     -21.8      69.8      148.7      224.125
 464-235.9      -128.8     -33.1      68.8      177.1      230.25
 465-239.8      -140.8     -38.7      58.1      186.3      236.375
 466-243.7      -149.5     -40.3      62.8      139.7      242.5
 467-247.6      -157.8     -53.3      28.3      122.9      227.6
 468missing        -154.9     -50.8      28.1      119.9      201.1
 469missing        -180.7     -70.9      33.7      114.8      222.5
                                                         454 rows omitted

Finally, we can chain multiple simple methods together to give a complete dataset:

julia> Impute.interp(df) |> Impute.locf() |> Impute.nocb()
469×6 DataFrame
 Row │ V1        V2         V3       V4        V5        V6
     │ Float64?  Float64?   Float64  Float64?  Float64?  Float64?
─────┼────────────────────────────────────────────────────────────
   1-233.6      -203.7     -84.1      18.5     134.7   222.7
   2-233.6      -203.0     -97.8      25.8     134.7   222.7
   3-233.6      -249.0     -92.1      27.8     177.1   222.7
   4-233.6      -231.5     -97.5      27.0     150.3   222.7
   5-233.6      -227.3    -130.1      25.8     160.0   222.7
   6-233.6      -223.1     -70.7      62.1     197.5   222.7
   7-233.6      -164.8     -12.2      76.8     202.8   222.7
   8-233.6      -221.6     -81.9      27.5     144.5   222.7
                                               
 463-242.6      -142.0     -21.8      69.8     148.7   224.125
 464-235.9      -128.8     -33.1      68.8     177.1   230.25
 465-239.8      -140.8     -38.7      58.1     186.3   236.375
 466-243.7      -149.5     -40.3      62.8     139.7   242.5
 467-247.6      -157.8     -53.3      28.3     122.9   227.6
 468-247.6      -154.9     -50.8      28.1     119.9   201.1
 469-247.6      -180.7     -70.9      33.7     114.8   222.5
                                                  454 rows omitted

Warning:

  • Your approach should depend on the properties of you data (e.g., MCAR, MAR, MNAR).
  • In-place calls aren't guaranteed to mutate the original data, but it will try avoid copying if possible. In the future, it may be possible to detect whether in-place operations are permitted on an array or table using traits: