General and composable utilities for manipulating tabular, quasitabular, and nontabular datasets.
Base Julia provides the most foundational building blocks for data processing, such as map
and filter
functions. The goal of DataManipulation
is to extend common data processing functionality on top of that. This package provides more general mapping over datasets, grouping, selecting, reshaping and so on.
DataManipulation
handles basic data structures already familiar to Julia users and doesn't require switching to custom specialized dataset types. All functions aim to stay composable and support datasets represented as Julia collections: arrays, dictionaries, tuples, many Tables
types, ... .
Note This package is intended primarily for interactive usage: load DataManipulation
and get access to a wide range of data processing tools. Other packages should prefer depending on individual packages listed below, when possible. Nevertheless, DataManipulation
follows semver and is welltested.
DataManipulation
functionality consists of two major parts.

Reexports from companion packages, each focused on a single area. These packages are designed so that they work together nicely and provide consistent and uniform interfaces:
 DataPipes.jl: boilerplatefree piping for data manipulation
 FlexiMaps.jl: functions such as
flatmap
,filtermap
, andmapview
, extending the allfamiliarmap
 FlexiGroups.jl: group data by arbitrary keys & more
 Skipper.jl:
skip(predicate, collection)
, a generalization ofskipmissing
 Accessors.jl: modify nested data structures conveniently with optics
See the docs of these packages for more details.
Additionally, FlexiJoins.jl is considered a companion package with relevant goals and compatible API. For now it's somewhat heavy in terms of dependencies and isn't included in
DataManipulation
, but can be added in the future. 
Functions defined in
DataManipulation
itself: they do not clearly belong to a narrowerscope package, or just have not been split out yet. See the docstrings for details.findonly
,filteronly
,filterfirst
,uniqueonly
: functions with the semantics of corresponding Base function combinations, but more efficient and naturally supportAccessors.set
.sortview
,uniqueview
,collectview
: noncopying versions of corresponding Base functions. zerocost property selection, indexing, and nesting by regex:
sr"regex"
literal,nest
function  And more:
discreterange
,shift_range
,rev
using DataManipulation
# let's say you have this raw table, probably read from a file:
julia> data_raw = [(; t_a=rand(), t_b=rand(), t_c=rand(), id=rand(1:5), i) for i in 1:10]
10element Vector{...}:
(t_a = 0.18651300247498126, t_b = 0.17891408921013918, t_c = 0.25088919057346093, id = 4, i = 1)
(t_a = 0.008638783104697567, t_b = 0.2725301420722497, t_c = 0.3731421925708567, id = 1, i = 2)
(t_a = 0.9263839548209668, t_b = 0.043017734093856785, t_c = 0.35927442939296217, id = 2, i = 3)
...
# we want to work with `t_a,b,c` values from the table, but it's not very convenient asis:
# they are mixed with other unrelated columns, `id` and `i`
# let's gather all `t`s into one place by nesting another namedtuple:
julia> data_1 = @p data_raw > map(nest(_, sr"(t)_(\w+)"))
10element Vector{...}:
(t = (a = 0.18651300247498126, b = 0.17891408921013918, c = 0.25088919057346093), id = 4, i = 1)
(t = (a = 0.008638783104697567, b = 0.2725301420722497, c = 0.3731421925708567), id = 1, i = 2)
(t = (a = 0.9263839548209668, b = 0.043017734093856785, c = 0.35927442939296217), id = 2, i = 3)
...
# much better!
# in practice, all related steps can be written into a single pipeline @p ...,
# here, we split them to demonstrate individual functions
# for the sake of example, let's normalize all `t`s to sum to 1 for each row:
julia> data_2 = @p data_1 > mapset(t=Tuple(_.t) ./ sum(_.t))
10element Vector{...}:
(t = (0.3026254665729048, 0.29029589897330355, 0.40707863445379167), id = 4, i = 1)
(t = (0.013202867673153734, 0.4165146131252091, 0.5702825192016372), id = 1, i = 2)
(t = (0.6972233052557745, 0.0323763884223678, 0.27040030632185774), id = 2, i = 3)
...
# finally, let's find the maximum over all `t`s for each `id`
# we'll demonstrate two approaches leading to the same result here
# group immediately, then aggregate individual `t`s for each group at two levels  within row, and among rows:
julia> @p data_2 > group(_.id) > map(gr > maximum(r > maximum(r.t), gr))
5element Dictionaries.Dictionary{Int64, Float64}
1 │ 0.5702825192016372
2 │ 0.6972233052557745
3 │ 0.8107403478840245
4 │ 0.4865089865249148
5 │ 0.44064846734993746
# alternatively, flatten `t` first into a flat column, then group and aggregate at a single level:
julia> data_2_flat = @p data_2 > flatmap(_.t, (;_..., t=_2))
30element Vector{...}:
(t = 0.3026254665729048, id = 4, i = 1)
(t = 0.29029589897330355, id = 4, i = 1)
(t = 0.40707863445379167, id = 4, i = 1)
(t = 0.013202867673153734, id = 1, i = 2)
...
julia> @p data_2_flat > group(_.id) > map(gr > maximum(r > r.t, gr))
5element Dictionaries.Dictionary{Int64, Float64}
1 │ 0.5702825192016372
2 │ 0.6972233052557745
3 │ 0.8107403478840245
4 │ 0.4865089865249148
5 │ 0.44064846734993746