ScientificTypes.jl

An API for dispatching on the "scientific" type of data instead of the machine type
Popularity
96 Stars
Updated Last
4 Months Ago
Started In
August 2019

ScientificTypes.jl

Linux Coverage Documentation
Build Status codecov.io

This package makes a distinction between machine type and scientific type of a Julia object:

  • The machine type refers to the Julia type being used to represent the object (for instance, Float64).

  • The scientific type is one of the types defined in ScientificTypesBase.jl reflecting how the object should be interpreted (for instance, Continuous or Multiclass).

Contents

Installation

using Pkg
Pkg.add("ScientificTypes")

Who is this repository for?

  • developers of statistical and scientific software who want to articulate their data type requirements in a generic, purpose-oriented way, and who are furthermore happy to adopt an existing convention about what data types should be used for what purpose (a convention first developed for the MLJ ecosystem, but useful in a general context)

What's provided here?

The module ScientificTypes defined in this repo rexports the scientific types and associated methods defined in ScientificTypesBase.jl and provides:

  • a collection of scitype definitions that articulate a default convention.

  • a coerce function, for changing machine types to reflect a specified scientific interpretation (scientific type)

  • an autotype fuction for "guessing" the intended scientific type of data

Very quick start

For more information and examples please refer to the manual.

using ScientificTypes, DataFrames
X = DataFrame(
    a = randn(5),
    b = [-2.0, 1.0, 2.0, missing, 3.0],
    c = [1, 2, 3, 4, 5],
    d = [0, 1, 0, 1, 0],
    e = ['M', 'F', missing, 'M', 'F'],
    )
sch = schema(X)

will print

┌───────┬────────────────────────────┬─────────────────────────┐
│ names │ scitypes                   │ types                   │
├───────┼────────────────────────────┼─────────────────────────┤
│ a     │ Continuous                 │ Float64                 │
│ b     │ Union{Missing, Continuous} │ Union{Missing, Float64} │
│ c     │ Count                      │ Int64                   │
│ d     │ Count                      │ Int64                   │
│ e     │ Union{Missing, Unknown}    │ Union{Missing, Char}    │
└───────┴────────────────────────────┴─────────────────────────┘

Detail is obtained in the obvious way; for example:

julia> sch.names
(:a, :b, :c, :d, :e)

To specify that instead b should be regared as Count, and that both d and e are Multiclass, we use the coerce function:

Xc = coerce(X, :b=>Count, :d=>Multiclass, :e=>Multiclass)
schema(Xc)

which prints

┌───────┬───────────────────────────────┬────────────────────────────────────────────────┐
│ names │ scitypes                      │ types                                          │
├───────┼───────────────────────────────┼────────────────────────────────────────────────┤
│ a     │ Continuous                    │ Float64                                        │
│ b     │ Union{Missing, Count}         │ Union{Missing, Int64}                          │
│ c     │ Count                         │ Int64                                          │
│ d     │ Multiclass{2}                 │ CategoricalValue{Int64, UInt32}                │
│ e     │ Union{Missing, Multiclass{2}} │ Union{Missing, CategoricalValue{Char, UInt32}} │
└───────┴───────────────────────────────┴────────────────────────────────────────────────┘

Acknowledgements and history

ScientificTypes is based on code from MLJScientificTypes.jl (now deprecated) and in particular builds on contributions of Anthony Blaom (@ablaom), Thibaut Lienart (@tlienart), Samuel Okon (@OkonSamuel), and others not recorded in the ScientificTypes commit history.

ScientificTypes.jl 2.0 implements the DefaultConvention, which coincides with the deprecated MLJ convention of MLJScientificTypes.jl 0.4.8. The code at ScientificTypes 1.1.2 (which defined only the API) became ScientificTypesBase.jl 1.0.