Fast estimation of generalized linear models with high dimensional categorical variables in Julia
This package estimates generalized linear models with high dimensional categorical variables. It builds on Matthieu Gomez's FixedEffects.jl and Amrei Stammann's Alpaca.


Example use

using GLFixedEffectModels, GLM, Distributions
using RDatasets

df = dataset("datasets", "iris")
df.binary = zeros(Float64, size(df,1))
df[df.SepalLength .> 5.0,:binary] .= 1.0
df.SpeciesDummy = categorical(df.Species)
idx = rand(1:3,size(df,1),1)
a = ["A","B","C"]
df.Random = vec([a[i] for i in idx])
df.RandomCategorical = categorical(df.Random)

m = @formula binary ~ SepalWidth + fe(SpeciesDummy)
x = nlreg(df, m, Binomial(), LogitLink(), start = [0.2] )

m = @formula binary ~ SepalWidth + PetalLength + fe(SpeciesDummy)
nlreg(df, m, Binomial(), LogitLink(), Vcov.cluster(:SpeciesDummy,:RandomCategorical) , start = [0.2, 0.2] )


The main function is nlreg(), which returns a GLFixedEffectModel <: RegressionModel.

nlreg(df, formula::FormulaTerm,
    vcov::CovarianceEstimator; ...)

The required arguments are:

  • df: a Table
  • formula: A formula created using @formula.
  • distribution: A Distribution. See the documentation of GLM.jl for valid distributions.
  • link: A Link function. See the documentation of GLM.jl for valid link functions.
  • vcov: A CovarianceEstimator to compute the variance-covariance matrix.

The optional arguments are:

  • save::Union{Bool, Symbol} = false: Should residuals and eventual estimated fixed effects saved in a dataframe? Use save = :residuals to only save residuals. Use save = :fe to only save fixed effects.
  • method::Symbol: A symbol for the method. Default is :cpu. Alternatively, :gpu requires CuArrays. In this case, use the option double_precision = false to use Float32. This option is the same as for the FixedEffectModels.jl package.
  • contrasts::Dict = Dict() An optional Dict of contrast codings for each categorical variable in the formula. Any unspecified variables will have DummyCoding.
  • maxiter::Integer = 1000: Maximum number of iterations in the Newton-Raphson routine.
  • maxiter_center::Integer = 10000: Maximum number of iterations for centering procedure.
  • double_precision::Bool: Should the demeaning operation use Float64 rather than Float32? Default to true.
  • dev_tol::Real : Tolerance level for the first stopping condition of the maximization routine.
  • rho_tol::Real : Tolerance level for the stephalving in the maximization routine.
  • step_tol::Real : Tolerance level that accounts for rounding errors inside the stephalving routine
  • center_tol::Real : Tolerance level for the stopping condition of the centering algorithm. Default to 1e-8 if double_precision = true, 1e-6 otherwise.

Things that still need to be implemented

  • Better default starting values
  • Bias correction
  • Weights
  • Better StatsBase interface & prediction
  • Better benchmarking
  • Integration with RegressionTables.jl

Related Julia packages

  • FixedEffectModels.jl estimates linear models with high dimensional categorical variables (and with or without endogeneous regressors).
  • FixedEffects.jl is a package for fast pseudo-demeaning operations using LSMR. Both this package and FixedEffectModels.jl build on this.
  • Alpaca.jl is a wrapper to the Alpaca R package, which solves the same tasks as this package.
  • GLM.jl estimates generalized linear models, but without explicit support for categorical regressors.
  • Econometrics.jl provides routines to estimate multinomial logit and other models.
  • RegressionTables.jl will, in the future, support pretty printing of results from this package.


