DataFrames.jl has a special mini-language for data transformations, which is powerful but often verbose. Here's an example:
transform(df, :A => ByRow(x -> x + 1) => :B)
With DataFrameMacros.jl, you don't have to separately specify the input columns, the transformation function, and the output columns. You can just write it as a normal expression which is transformed to the mini-language for you:
@transform(df, :B = :A + 1)
DataFrameMacros.jl also helps you when you have to transform many columns in a similar way.
Every expression in DataFrameMacros.jl is automatically executed once for every column in a multi-column specifier, such as All
, Between
, Not
or regular expressions like r"cat|dog"
.
Here's how you could divide all columns of a DataFrame that start with "a"
by 10 and add the suffix _div_10
to each new name:
@select(df, "{}_div_10" = {r"^a"} / 10)
To execute more complex name transformations on the left-hand side, you can use {n}
expressions.
For example, to concatenate each first column name with only the last split off part of each second column name in a multi-column transformation, you could write:
@select(df, {1} * split({2})[end] = {first_set_of_cols} + {second_set_of_cols})
You can also use multiple columns together as a Tuple with the double brace syntax, which is useful when you need to run an aggregation over those columns in an expression. In this example we keep all rows where the value in the :January
column is larger than the median from :February
to :December
:
@subset(df, :January > median({{ Between(:February, :December) }}))
DataFrameMacros.jl exports these macros:
@transform
and@transform!
@select
and@select!
@groupby
@combine
@subset
and@subset!
@sort
and@sort!
@unique
DataFramesMeta.jl is another package that provides macros for DataFrames.jl. The syntax looks similar in many cases, but here are some noteworthy differences:
- Except
@combine
, all macros work row-wise by default in DataFrameMacros.jl@transform(df, :y = :x + 1) @combine(df, :sum = sum(:x))
- In DataFrameMacros.jl, you can apply the same expression to several columns in
{}
braces at once and even broadcast across multiple sets of columns. You can also use a shortcut syntax to derive new column names from old ones.
@transform(df, "{}_plus_one" = {r"^col"} + 1) # for all columns starting with "col"
- In DataFrameMacros.jl, you can use special
{{ }}
multi-column expressions where you can operate on a tuple of all values at once which makes it easier to do aggregates across columns.@select(df, :july_larger_than_rest = :july > maximum({{Not(:july)}}))
- DataFrameMacros.jl uses
{}
to signal column expressions instead of$()
@select(df, :y = {"x"} + 1) col = :x @transform(df, :z = {col} * 5)
- DataFrameMacros.jl has a special syntax to make use of
transform!
on a view returned fromsubset
.@transform!(df, @subset(:x > 5), :x = :y + 10) # update x in all rows where x > 5
- DataFrameMacros.jl has a
@groupby
macro, which is a shortcut to executetransform
and thengroupby
on a DataFrame.jl. This is nice when you want to group on a column that you have to create first. Instead ofYou can write:df2 = @transform(df, :Y = :X + 1) groupby(df2, :Y)
@groupby(df, :Y = :X + 1)
If any of these points have changed, please open an issue.
You can use the separate package Chain.jl for writing chains of transformations, this way you don't have to repeat the DataFrame argument every time. This is similar to the tidyverse piping syntax that you might know from R:
@chain df begin
@subset(:A > 50)
@transform(:B = :A + :C)
@groupby(iseven(:B), :D)
@combine(:mean_E = mean(:E))
@sort(abs(:F))
end