ScientificTypes
[Linux] | Coverage |
---|---|
A light-weight, dependency-free, Julia interface defining a collection of types (without instances) for implementing conventions about the scientific interpretation of data.
This package makes a distinction between the machine type and scientific type of a Julia object:
-
The machine type refers to the Julia type being used to represent the object (for instance,
Float64
). -
The scientific type is one of the types defined in this package reflecting how the object should be interpreted (for instance,
Continuous
orMulticlass
).
The distinction is useful because the same machine type is often used
to represent data with differing scientific interpretations - Int
is used for product numbers (a factor) but also for a person's weight
(a continuous variable) - while the same scientific type is frequently
represented by different machine types - both Int
and Float64
are used to represent weights, for example.
For implementation of a concrete convention assigning specific scientific types (interpretations) to julia objects, see instead the MLJScientificTypes package.
Finite{N}
├─ Multiclass{N}
└─ OrderedFactor{N}
Infinite
├─ Continuous
└─ Count
Image{W,H}
├─ ColorImage{W,H}
└─ GrayImage{W,H}
ScientificTimeType
├─ ScientificDate
├─ ScientificTime
└─ ScientificDateTime
Table{K}
Textual
ManifoldPoint{MT}
Unknown
Figure 1. The type hierarchy defined in ScientificTypes.jl (The Julia native
Missing
type is also regarded as a scientific type).
Contents
Who is this repository for?
This package should only be used by developers who intend to define their own scientific type convention. The MLJScientificTypes.jl package implements such a convention, first adopted in the MLJ universe, but which can be adopted by other statistical and scientific software.
The purpose of this package is to provide a mechanism for articulating conventions around the scientific interpretation of data. With such a convention in place, a numerical algorithm declares its data requirements in terms of scientific types, the user has a convenient way to check compliance of his data with that requirement, and the developer understands precisely the constraints his data specification places on the actual machine type of the data supplied.
What is provided here?
1. Scientific types
ScientificTypes provides the new julia types appearing in Figure 1 above, signifying "scientific type" for use in method dispatch (e.g., for trait values). Instances of the types play no role.
The types Finite{N}
, Multiclass{N}
and OrderedFactor{N}
are all
parametrised by the number of levels N
, while Image{W,H}
,
GrayImage{W,H}
and ColorImage{W,H}
are all parametrised by the
image width and height dimensions, (W, H)
. The type
ManifoldPoint{MT}
, intended for points lying on a manifold, is
parameterized by the type MT
of the manifold to which the points
belong.
The scientific type ScientificDate
is for representing dates (for
example, the 23rd of April, 2029), ScientificTime
represents time
within a 24-hour day, while ScientificDateTime
represents both a
time of day and date. These types mirror the types Date
, Time
and
DateTime
from the Julia standard library Dates (and indeed, in the
MLJ
convention
the difference is only a formal one).
The type parameter K
in Table{K}
is for conveying the scientific
type(s) of a table's columns. See More on the Table
type.
The julia native types Missing
and Nothing
are also regarded as scientific
types.
scitype
and Scitype
methods
2. The ScientificTypes provides a method scitype
for articulating a
particular convention: scitype(X)
is the scientific type of object
X
. For example, in the MLJ
convention, implemented by
MLJScientificTypes,
one has scitype(3.14) = Continuous
and scitype(42) = Count
.
Aside.
scitype
is not a mapping of types to types but from instances to types. This is because one may want to distinguish the scientific type of objects having the same machine type. For example, in theMLJ
convention, someCategoricalArrays.CategoricalValue
objects have the scitypeOrderedFactor
but others areMulticlass
. In CategoricalArrays.jl theordered
attribute is not a type parameter and so it can only be extracted from instances.
The developer implementing a particular scientific type convention
overloads the scitype
method
appropriately. However, this package provides certain rudimentary
fallback behaviour; only Property 1 below should be altered by the
developer:
Property 0. scitype(missing) = Missing
and scitype(nothing) = Nothing
(regarding Missing
and Nothing
as native scientific
types).
Property 1. scitype(X) = Unknown
, unless X
is a tuple, an
abstract array, or missing
.
Property 2. The scitype of a k
-tuple is Tuple{S1, S2, ..., Sk}
where Sj
is the scitype of the j
th element.
For example, in the MLJ
convention:
julia> scitype((1, 4.5))
Tuple{Count, Continuous}
Property 3. The scitype of an AbstractArray
, A
, is
alwaysAbstractArray{U}
where U
is the union of the scitypes of the
elements of A
, with one exception: If typeof(A) <: AbstractArray{Union{Missing,T}}
for some T
different from Any
,
then the scitype of A
is AbstractArray{Union{Missing, U}}
, where
U
is the union over all non-missing elements, even if A
has no
missing elements.
The exception is made for performance reasons. In MLJ
:
julia> v = [1.3, 4.5, missing]
julia> scitype(v)
AbstractArray{Union{Missing, Continuous},1}
julia> scitype(v[1:2])
AbstractArray{Union{Missing, Continuous},1}
Performance note. Computing type unions over large arrays is expensive and, depending on the convention's implementation and the array eltype, computing the scitype can be slow. In the common case that the scitype of an array can be determined from the machine type of the object alone, the implementer of a new connvention can speed up compututations by implementing a
Scitype
method. Do?ScientificTypes.Scitype
for details.
3. Trait dictionary
Scientific types provides a dictionary TRAIT_FUNCTION_GIVEN_NAME
for
registering names (symbols) for boolean-value trait functions used to
dispatch scitype
in cases that direct type-dispatch is
inadequate. See below for
details.
4. Convenience methods
Scientific provides the following convenience functions:
-
trait(X)
- return the trait name associated with the trait holding forX
-
set_convention(C)
- activate the convention namedC
-
set_convention()
- inspect the active convention -
scitype_union(A)
- return the union of the scitypes of all elements of iterableA
-
elscitype(A)
- return the "element scitype" of arrayA
Query the doc-strings for details.
Table
type
More on the An object of scitype Table{K}
is expected to have a notion of
"columns", which are AbstractVector
s, and the intention of the type
parameter K
is to encode the scientific type(s) of its
columns. Specifically, developers are requested to adhere to the
following:
Tabular data convention. If scitype(X) <: Table
, then in fact
scitype(X) == Table{Union{scitype(c1), ..., scitype(cn)}}
where c1
, c2
, ..., cn
are the columns of X
. With this
definition, common type checks can be performed with tables. For
instance, you could check that each column of X
has an element
scitype that is either Continuous
or Finite
:
scitype(X) <: Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Finite}}}
A built-in Table
constructor provides a shorthand for the right-hand side:
scitype(X) <: Table(Continuous, Finite)
Note that Table(Continuous,Finite)
is a type union and not a Table
instance.
Defining a new convention
If you want to implement your own convention, you can consider the MLJScientificTypes.jl as a blueprint.
The steps below summarise the possible steps in defining such a convention:
- declare a new convention,
- add explicit
scitype
(andScitype
) definitions, - register any traits that were needed to define scitypes,
- optionally define
coerce
methods for your convention
Each step is explained below, taking the MLJ convention as an example.
Naming the convention
In the module, define a
struct MyConvention <: ScientificTypes.Convention end
and add an init function with:
function __init__()
ScientificTypes.set_convention(MyConvention())
end
scitype
declarations.
Adding explicit When overloading scitype
one needs to dipatch over the convention,
as in this example:
ScientificTypes.scitype(::Integer, ::MLJ) = Count
In some cases, however, the scientific type to be attributed to an
object might depend on the evaluation of a boolean-valued trait
function. There is a mechanism for "registering" such traits to
streamline trait-based dispatch of the scitype
method. This is best
illustrated with an example.
In the MLJ convention, all containers that meet the
Tables.jl
interface are
deemed to have scitype Table
. These are detected using the Tables.jl
trait istable
. Our first step is to choose a name for the trait, in
this case :table
. Our scitype
declaration then reads:
function ScientificTypes.scitype(X, ::MLJ, ::Val{:table})
K = <some type depending on columns of X>
return Table{K}
end
For this to work we now need to register the trait, which means adding
to the TRAIT_FUNCTION_GIVEN_NAME
dictionary, which should be
performed within the init function of the defining package:
function __init__()
ScientificTypes.set_convention(MLJ())
ScientificTypes.TRAIT_FUNCTION_GIVEN_NAME[:table] = Tables.istable
end
Important limitation. One may not add a trait function to
the TRAIT_FUNCTION_GIVEN_NAME
dictionary if it holds true
on some
object X
for which an existing trait already holds true.
coerce
function
Defining a It may be very useful to define a function to coerce machine types so
as to correct an unintended scientific interpretation, according to a
given convention. In the MLJ
convention, this is implemented by
defining coerce
methods (no stub provided by ScientificTypes
)
For instance consider the simplified:
function coerce(y::AbstractArray{T}, T2::Type{<:Union{Missing,Continuous}}
) where T <: Union{Missing,Real}
return float(y)
end
Under this definition, coerce([1, 2, 4], Continuous)
is mapped to
[1.0, 2.0, 4.0]
, which has scitype AbstractVector{Continuous}
.
In the case of tabular data, one might additionally define coerce
methods to selectively coerce data in specified columns. See
MLJScientificType
for examples.