FeatureDescriptors.jl is an interface package for describing features used in models, feature engineering, and other data-science workflows.
Descriptors provide a way to define at a high level the properties of - and relationships between - a collection of features and the corresponding data.
By associating a type with a given dataset, it allows users to write methods that can dispatch on the given feature and compose pipelines that are agnostic to the characteristics of any particular feature.
Subtypes of Descriptors inherit the properties of the supertype, but these can be overloaded as required.
For example, say that some weather data is contained in a weather.csv file where each column describes a different feature we are interested in using, such as :temperature and :humidity.
We can define a general Weather Descriptor with subtypes Temperature and Humidity that are loaded from the same table but use the appropriate columns:
using FeatureDescriptors
abstract type Weather <: Descriptor end
FeatureDescriptors.sources(::Type{<:Weather}) = ["weather.csv"] # only one table is needed
FeatureDescriptors.categorical_keys(::Type{<:Weather}) = [] # no categories necessary
abstract type Temperature <: Weather end
FeatureDescriptors.quantity_key(::Type{Temperature}) = :temperature
abstract type Humidity <: Weather end
FeatureDescriptors.quantity_key(::Type{Humidity}) = :humidity
A more specific instance of a feature can also be defined, such as a MeanTemperature, perhaps if that feature requires some feature engineering before it can be used.
abstract type MeanTemperature <: Temperature end
using FeatureTransforms
# A trivial feature engineering step in preparing MeanTemperature
function FeatureTransforms.transform(D::Type{<:MeanTemperature}, df)
return combine(groupby(df, :time), quantity_key(D) => mean => quantity_key(D))
endFinally, another useful feature might be stored in a different table entirely, but we may still want to encode its relationships to the others. For example, if we had rainfall data that was saved in another file that we also wanted to use.
abstract type Precipitation <: Weather end
FeatureDescriptors.sources(::Type{<:Precipitation}) = ["rainfall.csv"]
FeatureDescriptors.quantity_key(::Type{<:Precipitation}) = :rainfall All Descriptors are required to implement the following:
- A
sourcesmethod that specifies where to retrieve the data. ADescriptormay be associated with multiple sources, because it may be necessary to perform feature engineering to create the derived feature. - A
quantity_keymethod that denotes the name of the quantitative variable for the feature, such as:temperatureor:price. For the sake of transparency and simplicity, only onequantity_keymay be associated with a givenDescriptor. - A
categorical_keysmethod that denotes the names of the categorical variables for the feature, such as:colouror:type. If no categorical variables are needed, this returns an empty vector.
You can ensure your Descriptor is implemented correctly by calling the TestUtils.test_interface function.