LazyGroupBy.jl exports a single API grouped. It can be used to run
group-by operation using the dot-call syntax:
reducer.(..., grouped(key, collection), ...)where reducer runs on each group (thus, grouped(key, collection)
can be considered a as a key-value pairs with
Dictionaries.jl-like
broadcasting rule). Roughly speaking, grouped(key, collection) is
equivalent to Dict(k_1 => [v_11, v_12, ...], k_2 => [v_21, v_22, ...], ...) where k_i is an output of value of
key(v_ij) for v_ij in collection and each call of reducer is
evaluated with a group "vector" [v_i1, v_i2, ...].
For example:
julia> using LazyGroupBy
julia> collect.(grouped(isodd, 1:7))
Transducers.GroupByViewDict{Bool,Array{Int64,1},…} with 2 entries:
false => [2, 4, 6]
true => [1, 3, 5, 7]
julia> length.(grouped(isodd, 1:7))
Transducers.GroupByViewDict{Bool,Int64,…} with 2 entries:
false => 3
true => 4
julia> keys.(grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Array{Int64,1},…} with 2 entries:
false => [1, 7, 9]
true => [2, 3, 4, 5, 6, 8, 10]
julia> foldl.(tuple, grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Any,…} with 2 entries:
false => ((0, 4), 0)
true => ((((((7, 3), 1), 5), 9), 3), 5)
julia> foldl.(tuple, grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]); init = -1)
Transducers.GroupByViewDict{Bool,Tuple{Any,Int64},…} with 2 entries:
false => (((-1, 0), 4), 0)
true => (((((((-1, 7), 3), 1), 5), 9), 3), 5)
julia> extrema_rf((min1, max1), (min2, max2)) = (min(min1, min2), max(max1, max2));
julia> mapfoldl.(x -> (x, x), extrema_rf, grouped(isodd, [0, 7, 3, 1, 5, 9, 4, 3, 0, 5]))
Transducers.GroupByViewDict{Bool,Tuple{Int64,Int64},…} with 2 entries:
false => (0, 4)
true => (1, 9)Following generic and standard reducers are supported:
collect.(op, grouped(...))→DICT{Key,Vector{...}}view.(grouped(_, array))→DICT{Key,SubArray}map.(f, grouped(...))length.(op, grouped(...))→DICT{Key,Int}count.([f,] op, grouped(...))→DICT{Key,Int}sum.([f,] op, grouped(...))→DICT{Key,Number}prod.([f,] op, grouped(...))→DICT{Key,Number}any.(f, op, grouped(...))→DICT{Key,Bool}all.(f, op, grouped(...))→DICT{Key,Bool}minimum.([f,] op, grouped(...))maximum.([f,] op, grouped(...))extrema.([f,] op, grouped(...))keys.(op, grouped(_, collection))→DICT{Key,Vector{keytype(collection)}}pairs.(op, grouped(_, collection))→DICT{Key,DICT{keytype(collection),valtype(collection)}}findfirst.(f, grouped(_, array))→DICT{Key,keytype(collection)}findlast.(f, grouped(_, array))→DICT{Key,keytype(collection)}findall.(f, grouped(_, array))→DICT{Key,Vector{keytype(collection)}}foldl.(op, grouped(...); [init])mapfoldl.(f, op, grouped(...); [init])
where DICT{K,V} above is a short-hand for AbstractDict{<:K,<:V}
and Key is the type of the values returned from key function
passed to grouped.
For more complex tasks, Transducers.jl and OnlineStats.jl can also be used:
foldl.(op, xf, grouped(...); [init])foldxl.(op, [xf,] grouped(...); [init])foldxt.(op, [xf,] grouped(...); [init])(multi-threaded)foldxd.(op, [xf,] grouped(...); [init])(distributed)collect.(xf, grouped(...))tcollect.(xf, grouped(...))(multi-threaded version ofcollect)dcollect.(xf, grouped(...))(distributed version ofcollect)
where xf::Transducer is initiated for each group individually and
op is either a two-argument function or an OnlineStat object
(e.g., OnlineStats.Mean).
The dot-call syntax is used for defining the "domain-specific
language" (DSL) and it is different from the standard semantics of
broadcasting on arrays. In particular, reducer.(..., grouped(key, collection), ...) may not actually call reducer. Rather, it is
pattern-matched and dispatched to an alternative definition based on
Transducers.jl.
LazyGroupBy.jl is implemented as a direct transformation to
foldl/foldxt/foldxd and GroupBy from Transducer.jl. Consider
foldl.(rf, xf, grouped(key, collection); init = init)This is simply translated to
foldl(right, GroupBy(key, xf, rf, init), collection)Other reducers like sum and collect are implemented in terms of
above transformation.