[Linux] | Coverage |
---|---|
A light-weight, dependency-free, Julia interface defining a collection of types (without instances) for implementing conventions about the scientific interpretation of data.
This package makes a distinction between the machine type and scientific type of a Julia object:
-
The machine type refers to the Julia type being used to represent the object (for instance,
Float64
). -
The scientific type is one of the types defined in this package reflecting how the object should be interpreted (for instance,
Continuous
orMulticlass{3}
).
The distinction is useful because the same machine type is often used
to represent data with differing scientific interpretations - Int
is used for product numbers (a factor) but also for a person's weight
(a continuous variable) - while the same scientific type is frequently
represented by different machine types - both Int
and Float64
are used to represent weights, for example.
For implementation of a concrete convention assigning specific scientific types (interpretations) to julia objects, see instead the ScientificTypes.jl package.
Formerly "ScientificTypesBase.jl" code lived at "ScientificTypes.jl". Since version 2.0 the code at "ScientificTypes.jl" is code that formerly resided at "MLJScientificTypes.jl" (now deprecated).
Finite{N}
├─ Multiclass{N}
└─ OrderedFactor{N}
Infinite
├─ Continuous
└─ Count
Image{W,H}
├─ ColorImage{W,H}
└─ GrayImage{W,H}
ScientificTimeType
├─ ScientificDate
├─ ScientificTime
└─ ScientificDateTime
Sampleable{Ω}
└─ Density{Ω}
Annotated{S}
AnnotationFor{S}
Multiset{S}
Table{K}
Textual
ManifoldPoint{MT}
Compositional{D}
Unknown
Figure 1. The type hierarchy defined in ScientificTypesBase.jl (The Julia native
Missing
andNothing
type are also regarded as a scientific types).
This package should only be used by developers who intend to define their own scientific type convention. The ScientificTypes.jl package (versions 2.0 and higher) implements such a convention, first adopted in the MLJ universe, but which can be adopted by other statistical and scientific software.
The purpose of this package is to provide a mechanism for articulating conventions around the scientific interpretation of data. With such a convention in place, a numerical algorithm declares its data requirements in terms of scientific types, the user has a convenient way to check compliance of his data with that requirement, and the developer understands precisely the constraints his data specification places on the actual machine type of the data supplied.
ScientificTypesBase provides the new julia types appearing in Figure 1 above, signifying "scientific type" for use in method dispatch (e.g., for trait values). Instances of the types play no role.
The types Finite{N}
, Multiclass{N}
and OrderedFactor{N}
are all
parametrised by the number of levels N
, while Image{W,H}
,
GrayImage{W,H}
and ColorImage{W,H}
are all parametrised by the
image width and height dimensions, (W, H)
. The parameter Ω
in
Sampleable{Ω}
and Density{Ω}
is the scientific type of the sample
space. The type ManifoldPoint{MT}
, intended for points lying on a
manifold, is parameterized by the type MT
of the manifold to which
the points belong.
The scientific type ScientificDate
is for representing dates (for
example, the 23rd of April, 2029), ScientificTime
represents time
within a 24-hour day, while ScientificDateTime
represents both a
time of day and date. These types mirror the types Date
, Time
and
DateTime
from the Julia standard library Dates (and indeed, in the
convention defined in ScientificTypes.jl
the difference is only a formal one).
The type parameter K
in Table{K}
is for conveying the scientific
type(s) of a table's columns. See More on the Table
type.
The julia native types Missing
and Nothing
are also regarded as scientific
types.
ScientificTypesBase provides a method scitype
for articulating a
particular convention: scitype(X, C())
is the scientific type of object
X
under convention C
. For example, in the DefaultConvention
convention, implemented
by ScientificTypes,
one has scitype(3.14, Defaultconvention()) = Continuous
and
scitype(42, Defaultconvention()) = Count
.
Aside.
scitype
is not a mapping of types to types but from instances to types. This is because one may want to distinguish the scientific type of objects having the same machine type. For example, in theDefaultConvention
implemented in ScientificTypes.jl, someCategoricalArrays.CategoricalValue
objects have the scitypeOrderedFactor
but others areMulticlass
. In CategoricalArrays.jl theordered
attribute is not a type parameter and so it can only be extracted from instances.
The developer implementing a particular scientific type convention
overloads the scitype
method
appropriately. However, this package provides certain rudimentary
fallback behaviour:
Property 0. For any convention C
, scitype(missing, C()) == Missing
and scitype(nothing, C()) == Nothing
(regarding Missing
and Nothing
as native scientific types).
Property 1. For any convention C
scitype(X, C()) == Unknown
, unless
X
is a tuple, an abstract array, nothing
, or missing
.
Property 2. For any convention C
, The scitype of a k
-tuple is
Tuple{S1, S2, ..., Sk}
where Sj
is the scitype of the j
th element under
convention C
.
For example, in the Defaultconvention
convention implemented
by ScientificTypes:
julia> scitype((1, 4.5), Defaultconvention())
Tuple{Count, Continuous}
Property 3. For any given convention C
, the scitype of an
AbstractArray
, A
, is alwaysAbstractArray{U}
where U
is the union
of the scitypes of the elements of A
under convention C
, with one
exception: If typeof(A) <:AbstractArray{Union{Missing,T}}
for some T
different from Any
, then the scitype of A
is AbstractArray{Union{Missing, U}}
,
where U
is the union over all non-missing elements under convention C
, even
if A
has no missing elements.
This exception is made for performance reasons. In DefaultConvention
implemented
by ScientificTypes:
julia> v = [1.3, 4.5, missing]
julia> scitype(v, DefaultConvention())
AbstractArray{Union{Missing, Continuous}, 1}
julia> scitype(v[1:2], DefaultConvention())
AbstractArray{Union{Missing, Continuous},1}
Performance note. Computing type unions over large arrays is expensive and, depending on the convention's implementation and the array eltype, computing the scitype can be slow. In the common case that the scitype of an array can be determined from the machine type of the object alone, the implementer of a new connvention can speed up compututations by implementing a
Scitype
method. Do?ScientificTypesBase.Scitype
for details.
An object of scitype Table{K}
is expected to have a notion of
"columns", which are AbstractVector
s, and the intention of the type
parameter K
is to encode the scientific type(s) of its
columns. Specifically, developers are requested to adhere to the
following:
Tabular data convention. If scitype(X, C()) <: Table
, for a given
convention C
then in fact
scitype(X, C()) == Table{Union{scitype(c1, C), ..., scitype(cn, C)}}
where c1
, c2
, ..., cn
are the columns of X
. With this
definition, common type checks can be performed with tables. For
instance, you could check that each column of X
has an element
scitype that is either Continuous
or Finite
:
scitype(X, C()) <: Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Finite}}}
A built-in Table
constructor provides a shorthand for the right-hand side:
scitype(X, C()) <: Table(Continuous, Finite)
Note that Table(Continuous, Finite)
is a type union and not a Table
instance.
If you want to implement your own convention, you can consider the ScientificTypes.jl as a blueprint.
The steps below summarise the possible steps in defining such a convention:
- declare a new convention,
- add explicit
scitype
(andScitype
) definitions, - optionally define
coerce
methods for your convention
Each step is explained below, taking DefaultConvenion
as an example.
In the module, define a singleton as thus
struct MyConvention <: ScientificTypesBase.Convention end
When overloading scitype
one needs to dipatch over the convention,
as in this example:
ScientificTypesBase.scitype(::Integer, ::MyConvention) = Count
To avoid method ambiguities, avoid dispatching only on the first argument. For example, defining
ScientificTypesBase.scitype(::AbstractFloat, C) = Continous
would lead to ambiguities in another package defining
ScientificTypesBase.scitype(a, ::MyConvention) = Count
Since ScientificTypesBase.jl
does not define a single-argument scitype(X)
method, an implementation of a new scientific convention will typically want to explicitly implement the single argument method in their package, to save users from needing to explicitly specify a convention. That is, so the user can call scitype(2.3)
instead of scitype(2.3, MyConvention())
.
For example, one declares:
scitype(X) = scitype(X, MyConvention())
It may be very useful to define a function to coerce machine types so
as to correct an unintended scientific interpretation, according to a
given convention. In the DefaultConvention
convention, this is implemented by
defining coerce
methods (no stub provided by ScientificTypesBase
)
For instance consider the simplified:
function coerce(y::AbstractArray{T}, T2::Type{<:Union{Missing, Continuous}}
) where T <: Union{Missing, Real}
return float(y)
end
Under this definition, coerce([1, 2, 4], Continuous)
is mapped to
[1.0, 2.0, 4.0]
, which has scitype AbstractVector{Continuous}
.
In the case of tabular data, one might additionally define coerce
methods to selectively coerce data in specified columns. See
ScientificTypes
for examples.