Kezdi.jl is a Julia package that provides a Stata-like interface for data manipulation and analysis. It is designed to be easy to use for Stata users who are transitioning to Julia.1
It imports and reexports CSV, DataFrames, FixedEffectModels, FreqTables, ReadStatTables, Statistics, and StatsBase. These packages are not covered in this documentation, but you can find more information by following the links.
Kezdi.jlis currently in beta. We have more than 400 unit tests and a large code coverage.The package, however, is not guaranteed to be bug-free. If you encounter any issues, please report them as a GitHub issue.
If you would like to receive updates on the package, please star the repository on GitHub and sign up for email notifications here.
To install the package, run the following command in Julia's REPL:
using Pkg; Pkg.add("Kezdi")Every Kezdi.jl command is a macro that begins with @. These commands operate on a global DataFrame that is set using the setdf function. Alternatively, commands can be executed within a @with block that sets the DataFrame for the duration of the block.
using Kezdi
using RDatasets
setdf(dataset("datasets", "mtcars"))
@rename HP Horsepower
@rename Disp Displacement
@rename WT Weight
@rename Cyl Cylinders
@tabulate Gear
@keep @if Gear == 4
@keep MPG Horsepower Weight Displacement Cylinders
@summarize MPG
@regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robust Alternatively, you can use the @with block to avoid writing to a global DataFrame:
using Kezdi
using RDatasets
df = dataset("datasets", "mtcars")
renamed_df = @with df begin
@rename HP Horsepower
@rename Disp Displacement
@rename WT Weight
@rename Cyl Cylinders
end
@with renamed_df begin
@tabulate Gear
@keep @if Gear == 4
@keep MPG Horsepower Weight Displacement Cylinders
@summarize MPG
@regress log(MPG) log(Horsepower) log(Weight) log(Displacement) fe(Cylinders), robust
end| Command | Stata | Julia 1st run | Julia 2nd run | Speedup |
|---|---|---|---|---|
@egen |
4.90s | 1.60s | 0.41s | 10x |
@collapse |
0.92s | 0.18s | 0.13s | 8x |
@tabulate |
2.14s | 0.46s | 0.10s | 20x |
@summarize |
10.40s | 0.58s | 0.37s | 28x |
@regress |
0.89s | 1.93s | 0.16s | 6x |
@generate logHP = log(Horsepower)The function can operate on individual elements,
get_make(text) = split(text, " ")[1]
@generate Make = get_make(Model)or on the entire column:
function geometric_mean(x::Vector)
n = length(x)
return exp(sum(log.(x)) / n)
end
@collapse geom_NPG = geometric_mean(MPG), by(Cylinders)To maximize convenience for Stata users, Kezdi.jl has a number of differences to standard Julia and DataFrames syntax.
While there are a few convenience functions, most Kezdi.jl commands are macros that begin with @.
@tabulate GearDue to this non-standard syntax, Kezdi.jl uses the comma to separate options.
@regress log(MPG) log(Horsepower), robustHere log(MPG) and log(Horsepower) are the dependent and independent variables, respectively, and robust is an option. Options may also have arguments, like
@regress log(MPG) log(Horsepower), cluster(Cylinders)Column names of the data frame can be used directly in the commands without the need to prefix them with the data frame name or using a Symbol.
@generate logHP = log(Horsepower)Other data manipulation packages in Julia require column names to be passed as symbols or strings. Kezdi.jl does not require this, and it will not work if you try to use symbols or strings.
Julia reserved words, like
begin,export,functionand standard types likeString,Int,Float64, etc., cannot be used as variable names in Kezdi.jl. If you have a column with a reserved word, rename it before passing it to Kezdi.jl.
All functions are automatically vectorized, so there is no need to use the . operator to broadcast functions over elements of a column.
@generate logHP = log(Horsepower)If you want to turn off automatic vectorization, use the ~ notation,
@generate logHP = ~log(Horsepower)The exception is when the function operates on Vectors, in which case Kezdi.jl understands you want to apply the function to the entire column.
@collapse mean_HP = mean(Horsepower), by(Cylinders)If you need to apply a function to individual elements of a column, you need to vectorize it with adding . after the function name:
@generate words = split(Model, " ")
@generate n_words = length.(words)Here,
wordsbecomes a vector of vectors, where each element is a vector of words in the correspondingModelstring. The functionlegth.will operate on each cell inwords, counting the number of words in eachModelstring. By contrast,length(words)would return the number of elements in thewordsvector, which is the number of rows in the DataFrame.
Almost every command can be followed by an @if condition that filters the data frame. The command will only be executed on the subset of rows for which the condition evaluates to true. The condition can use any combination of column names and functions.
@summarize MPG @if Horsepower > median(Horsepower)Kezdi.jl ignores missing values when aggregating over entire columns.
@with DataFrame(A = [1, 2, missing, 4]) begin
@collapse mean_A = mean(A)
endreturns mean_A = 2.33.
The variable _n refers to the row number in the data frame, _N denotes the total number of rows. These can be used in @if conditions, as well.
@with DataFrame(A = [1, 2, 3, 4]) begin
@keep @if _n < 3
endTo allow for Stata-like syntax, all commands begin with @. These are macros that rewrite your Kezdi.jl code to DataFrames.jl commands.
@tabulate Gear
@keep @if Gear == 4
@keep Model MPG Horsepower Weight Displacement CylindersThe @if condition is non-standard behavior in Julia, so it is also implemented as a macro.
Unlike Stata, where egen and collapse have different syntax, Kezdi.jl uses the same syntax for both commands.
@egen mean_HP = mean(Horsepower), by(Cylinders)
@collapse mean_HP = mean(Horsepower), by(Cylinders)To maintain compatibility with Julia, we had to rename some functions. For example, count is called rowcount, missing is called ismissing in Kezdi.jl.
Inspiration for the package came from Tidier.jl, a similar package launched by Karandeep Singh that provides a dplyr-like interface for Julia. Johannes Boehm has also developed a similar package, Douglass.jl.
The package is built on top of DataFrames.jl, FreqTables.jl and FixedEffectModels.jl. The @with function relies on Chain.jl by Julius Krumbiegel.
The package is named after Gabor Kezdi, a Hungarian economist who has made significant contributions to teaching data analysis.
Footnotes
-
Stata is a registered trademark of StataCorp LLC. Kezdi.jl is not affiliated with StataCorp LLC. ↩