DataManipulation.jl

Author JuliaAPlavin
Popularity
3 Stars
Updated Last
4 Months Ago
Started In
February 2024

DataManipulation.jl

General and composable utilities for manipulating tabular, quasi-tabular, and non-tabular datasets.

Base Julia provides the most foundational building blocks for data processing, such as map and filter functions. The goal of DataManipulation is to extend common data processing functionality on top of that. This package provides more general mapping over datasets, grouping, selecting, reshaping and so on.

DataManipulation handles basic data structures already familiar to Julia users and doesn't require switching to custom specialized dataset types. All functions aim to stay composable and support datasets represented as Julia collections: arrays, dictionaries, tuples, many Tables types, ... .

Note This package is intended primarily for interactive usage: load DataManipulation and get access to a wide range of data processing tools. Other packages should prefer depending on individual packages listed below, when possible. Nevertheless, DataManipulation follows semver and is well-tested.

DataManipulation functionality consists of two major parts.

  • Reexports from companion packages, each focused on a single area. These packages are designed so that they work together nicely and provide consistent and uniform interfaces:

    • DataPipes.jl: boilerplate-free piping for data manipulation
    • FlexiMaps.jl: functions such as flatmap, filtermap, and mapview, extending the all-familiar map
    • FlexiGroups.jl: group data by arbitrary keys & more
    • Skipper.jl: skip(predicate, collection), a generalization of skipmissing
    • Accessors.jl: modify nested data structures conveniently with optics

    See the docs of these packages for more details.

    Additionally, FlexiJoins.jl is considered a companion package with relevant goals and compatible API. For now it's somewhat heavy in terms of dependencies and isn't included in DataManipulation, but can be added in the future.

  • Functions defined in DataManipulation itself: they do not clearly belong to a narrower-scope package, or just have not been split out yet. See the docstrings for details.

    • findonly, filteronly, filterfirst, uniqueonly: functions with the semantics of corresponding Base function combinations, but more efficient and naturally support Accessors.set.
    • sortview, uniqueview, collectview: non-copying versions of corresponding Base functions.
    • zero-cost property selection, indexing, and nesting by regex: sr"regex" literal, nest function
    • And more: discreterange, shift_range, rev

Featured example

using DataManipulation

# let's say you have this raw table, probably read from a file:
julia> data_raw = [(; t_a=rand(), t_b=rand(), t_c=rand(), id=rand(1:5), i) for i in 1:10]
10-element Vector{...}:
 (t_a = 0.18651300247498126, t_b = 0.17891408921013918, t_c = 0.25088919057346093, id = 4, i = 1)
 (t_a = 0.008638783104697567, t_b = 0.2725301420722497, t_c = 0.3731421925708567, id = 1, i = 2)
 (t_a = 0.9263839548209668, t_b = 0.043017734093856785, t_c = 0.35927442939296217, id = 2, i = 3)
 ...

# we want to work with `t_a,b,c` values from the table, but it's not very convenient as-is:
# they are mixed with other unrelated columns, `id` and `i`
# let's gather all `t`s into one place by nesting another namedtuple:
julia> data_1 = @p data_raw |> map(nest(_, sr"(t)_(\w+)"))
10-element Vector{...}:
 (t = (a = 0.18651300247498126, b = 0.17891408921013918, c = 0.25088919057346093), id = 4, i = 1)
 (t = (a = 0.008638783104697567, b = 0.2725301420722497, c = 0.3731421925708567), id = 1, i = 2)
 (t = (a = 0.9263839548209668, b = 0.043017734093856785, c = 0.35927442939296217), id = 2, i = 3)
 ...

# much better!
# in practice, all related steps can be written into a single pipeline @p ...,
# here, we split them to demonstrate individual functions

# for the sake of example, let's normalize all `t`s to sum to 1 for each row:
julia> data_2 = @p data_1 |> mapset(t=Tuple(_.t) ./ sum(_.t))
10-element Vector{...}:
 (t = (0.3026254665729048, 0.29029589897330355, 0.40707863445379167), id = 4, i = 1)
 (t = (0.013202867673153734, 0.4165146131252091, 0.5702825192016372), id = 1, i = 2)
 (t = (0.6972233052557745, 0.0323763884223678, 0.27040030632185774), id = 2, i = 3)
 ...

# finally, let's find the maximum over all `t`s for each `id`
# we'll demonstrate two approaches leading to the same result here

# group immediately, then aggregate individual `t`s for each group at two levels - within row, and among rows:
julia> @p data_2 |> group(_.id) |> map(gr -> maximum(r -> maximum(r.t), gr))
5-element Dictionaries.Dictionary{Int64, Float64}
 10.5702825192016372
 20.6972233052557745
 30.8107403478840245
 40.4865089865249148
 50.44064846734993746

# alternatively, flatten `t` first into a flat column, then group and aggregate at a single level:
julia> data_2_flat = @p data_2 |> flatmap(_.t, (;_..., t=_2))
30-element Vector{...}:
 (t = 0.3026254665729048, id = 4, i = 1)
 (t = 0.29029589897330355, id = 4, i = 1)
 (t = 0.40707863445379167, id = 4, i = 1)
 (t = 0.013202867673153734, id = 1, i = 2)
 ...
julia> @p data_2_flat |> group(_.id) |> map(gr -> maximum(r -> r.t, gr))
5-element Dictionaries.Dictionary{Int64, Float64}
 10.5702825192016372
 20.6972233052557745
 30.8107403478840245
 40.4865089865249148
 50.44064846734993746