# MLPreprocessing

Package Status |
Package Evaluator |
Build Status |
---|---|---|

## Overview

Utility package that provides end user friendly methods for feature scalings and polynomial
basis expansion. Feature scalings work on `Matrix`

, `Vector`

and `DataFrames`

. It is possible to
have observations stored as columns or rows of a matrix. In order to distinguish between these cases
one can provide the parameter `obsdim`

, where `obsdim=1`

corresponds to "observations as rows" and
`obsdim=2`

to "observations as columns". Transformations can be computed on a subset
of columns/rows by defining a vector `operate_on`

.

### StandardScaler

Standardization of data sets result in variables with a mean of 0 and variance of 1.
A common use case would be to fit a `StandardScaler`

to the training data and later
apply the same transformation to the test data. `StandardScaler`

is used with the
functions `fit()`

, `transform()`

and `fit_transform()`

as shown below.

```
fit(StandardScaler, X[, μ, σ; obsdim, operate_on])
fit_transform(StandardScaler, X[, μ, σ; obsdim, operate_on])
```

`X`

: Data of type Matrix or `DataFrame`

.

`μ`

: Vector or scalar describing the translation.
Defaults to mean(X; dims=obsdim)

`σ`

: Vector or scalar describing the scale.
Defaults to std(X; dims=obsdim)

`obsdim`

: Specify which axis corresponds to observations.
Defaults to obsdim=2 (observations are columns of matrix)
For DataFrames `obsdim`

is obsolete and rescaling occurs
column wise.

`operate_on`

: Specify the indices of columns or rows to be centered.
Defaults to all columns/rows.
For DataFrames this must be a vector of symbols, not indices.
E.g. `operate_on`

=[1,3] will perform centering on columns
with index 1 and 3 only (if obsdim=1, else rows 1 and 3)

Note on DataFrames:
Columns containing `missing`

values are skipped.
Columns containing non numeric elements are skipped.

Examples:

```
Xtrain = rand(100, 4)
Xtest = rand(10, 4)
x = rand(4)
Dtrain = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])
Dtest = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])
scaler = fit(StandardScaler, Xtrain)
scaler = fit(StandardScaler, Xtrain, obsdim=1)
scaler = fit(StandardScaler, Xtrain, obsdim=1, operate_on=[1,3])
transform(Xtest, scaler)
transform!(Xtest, scaler)
transform(x, scaler)
transform!(x, scaler)
scaler = fit(StandardScaler, Dtrain)
scaler = fit(StandardScaler, Dtrain, operate_on=[:A,:B])
transform(Dtest, scaler)
transform!(Dtest, scaler)
Xscaled, scaler = fit_transform(StandardScaler, X, obsdim=1, operate_on=[1,2,4])
scaler = fit_transform!(StandardScaler, X, obsdim=1, operate_on=[1,2,4])
```

Note that for `transform!`

the data matrix `X`

has to be of type <: AbstractFloat
as the scaling occurs inplace. (E.g. cannot be of type Matrix{Int64}). This is not
the case for `transform`

however.
For `DataFrames`

`transform!`

can be used on columns of type <: Integer.

### FixedRangeScaler

`FixedRangeScaler`

is used with the functions `fit()`

, `transform()`

and `fit_transform()`

to scale data in a Matrix `X`

or DataFrame to a fixed range [lower:upper].
After fitting a `FixedRangeScaler`

to one data set, it can be used to perform the same
transformation to a new set of data. E.g. fit the `FixedRangeScaler`

to your training
data and then apply the scaling to the test data at a later stage. (See examples below).

```
fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])
fit_transform(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])
```

`X`

: Data of type Matrix or `DataFrame`

.

`lower`

: (Scalar) Lower limit of new range.
Defaults to 0.

`upper`

: (Scalar) Upper limit of new range.
Defaults to 1.

`obsdim`

: Specify which axis corresponds to observations.
Defaults to obsdim=2 (observations are columns of matrix)
For DataFrames `obsdim`

is obsolete and rescaling occurs
column wise.

`operate_on`

: Specify the indices of columns or rows to be centered.
Defaults to all columns/rows.
For DataFrames this must be a vector of symbols, not indices.
E.g. `operate_on`

=[1,3] will perform centering on columns
with index 1 and 3 only (if obsdim=1, else rows 1 and 3)

Note on DataFrames:
Columns containing `NA`

values are skipped.
Columns containing non numeric elements are skipped.

Examples:

```
Xtrain = rand(100, 4)
Xtest = rand(10, 4)
x = rand(10)
D = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10])
scaler = fit(FixedRangeScaler, Xtrain)
scaler = fit(FixedRangeScaler, Xtrain, -1, 1)
scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1)
scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1, operate_on=[1,3])
scaler = fit(FixedRangeScaler, D, -1, 1, operate_on=[:A,:B])
Xscaled = transform(Xtest, scaler)
transform!(Xtest, scaler)
Xscaled, scaler = fit_transform(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4])
scaler = fit_transform!(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4])
```

### Lower Level Functions

The lower level functions on which `StandardScaler`

and `FixedRangeScaler`

are built on can also
be used seperately.

#### center!()

` μ = center!(X[, μ; obsdim, operate_on])`

Shift `X`

along `obsdim`

by `μ`

according to X = X - μ
where `X`

is of type Matrix or Vector and `D`

of type DataFrame.

#### fixedrange!()

` lower, upper, xmin, xmax = fixedrange!(X[, lower, upper, xmin, xmax; obsdim, operate_on])`

Normalize `X`

or `D`

along `obsdim`

to the interval [lower:upper]
where `X`

is of type Matrix or Vector and `D`

of type DataFrame.
If `lower`

and `upper`

are omitted the default range is [0:1].

#### standardize!()

` μ, σ = standardize!(X[, μ, σ; obsdim, operate_on])`

Standardize `X`

along `obsdim`

according to X = (X - μ) / σ.
If μ and σ are omitted they are computed such that variables have a mean of zero.

### Polynomial Basis Expansion

` M = expand_poly(x[, degree=5, obsdim]) `

Perform a polynomial basis expansion of the given `degree`

for the vector `x`

.

```
julia> expand_poly(1:5, degree=3)
3×5 Array{Float64,2}:
1.0 2.0 3.0 4.0 5.0
1.0 4.0 9.0 16.0 25.0
1.0 8.0 27.0 64.0 125.0
julia> expand_poly(1:5, degree=3, obsdim=1)
5×3 Array{Float64,2}:
1.0 1.0 1.0
2.0 4.0 8.0
3.0 9.0 27.0
4.0 16.0 64.0
5.0 25.0 125.0
julia> expand_poly(1:5, 3, ObsDim.First()); # same but type-stable
```