Skip to content

Latest commit

 

History

History
527 lines (384 loc) · 24.4 KB

README.md

File metadata and controls

527 lines (384 loc) · 24.4 KB

EventStudyInteracts.jl

This package is a Julia replication of the Stata package eventstudyinteract provided by Sun and Abraham (2021).

The estimation process is identical to that of eventstudyinteract, and this package used FixedEffectModels.jl to implement the Stata command reghdfe functionality.

As introduced in the Stata package eventstudyinteract provided by Sun and Abraham (2021), the function of this package is:

eventstudyinteract implements the interaction weighted (IW) estimator and constructs pointwise confidence interval for the estimation of dynamic treatment effects. To estimate the dynamic effects of an absorbing treatment, researchers often use two-way fixed effects (TWFE) regressions that include leads and lags of the treatment (event study specification). Units are categorized into different cohorts based on their initial treatment timing. Sun and Abraham (2020) proposes this estimator as an alternative to the TWFE regression in the presence of treatment effects heterogeneous across cohorts. Under treatment effects heterogeneity, the TWFE regression can result in estimates with uninterpretable weights, which can be assessed by the Stata module eventstudyweights. The IW estimator is implemented in three steps. First, estimate the interacted regression with reghdfe, where the interactions are between relative time indicators and cohort indicators. Second, estimate the cohort shares underlying each relative time. Third, take the weighted average of estimates from the first step, with weights set to the estimated cohort shares.

Introduction

  • Julia is faster than Stata, and FixedEffectModels.jl is much faster than reghdfe.

  • I am a beginner in programming and the code is a complete translation of the Stata package eventstudyinteract. I used newbing to help me with programming, and it is very helpful for beginners.

  • I used FixedEffectModels.jl directly for estimation and I am grateful to the author of this package. Other parts of this package, such as reporting of estimation results, also refer to FixedEffectModels.jl.

  • The Stata package eventstudyinteract is very flexible, and one important feature is that it can compare event study estimates for subsamples, i.e. perform heterogeneous estimation. I did not find this feature in the existing Julia package DiffinDiffs.jl or the  R packages fixest. I think the design of the Stata package eventstudyinteract is more user-friendly.

  • This package fully supports the features of FixedEffectModels.jl, such as using GPU for estimation. The syntax of this package is similar to that of FixedEffectModels.jl.

Installation

The package is registered in the General registry and so can be installed at the REPL with ] add EventStudyInteracts.

Syntax

result =  reg(df,    
    formula,
    rel_varlist::Vector{Symbol},
    control_cohort::Symbol,
    cohort::Symbol,
    vcov,
    contrasts = contrasts,
    weights = weights,
    save = save,
    method = method,
    nthreads = nthreads,
    double_precision = double_precision,
    tol = tol,
    maxiter = maxiter,
    drop_singletons = drop_singletons,
    progress_bar = progress_bar,
    dof_add = dof_add,
    subset = subset)

The syntax is similar to FixedEffectModels.jl (You must use a version higher than v1.9.1),but users need to specify separately rel_varlist::Vector{Symbol}, control_cohort::Symbol, and cohort::Symbol. The rel_varlist is the list of relative time indicators as you would have included in the canonical two-way fixed effects regression. The rel_varlist must be defined as a Vector{Symbol}.

rel_varlist = Symbol[:g_4, :g_3, :g_2, :g0, :g1, :g2, :g3]

See illustration for an example of generating these relative time indicators. The relative time indicators should take the value of zero for never treated units. Users should shape their dataset to a long format where each observation is at the unit-time level. See illustration for an example of specifying the syntax. The syntax is similar to FixedEffectModels.jl in specifying fixed effects and the type of standard error reported.

Furthermore, it also requires the user to specify the cohort categories as well as which cohort is the control cohort. Note that Sun and Abraham (2021) only establishes the validity of the IW estimators for balanced panel data without covariates. eventstudyinteract does not impose a balanced panel by dropping units with observations that are missing in any iod. Instead, it evaluates the IW estimators for unbalanced panel data by estimating the interacted regression using all observations.

The control_cohort must be defined as a Symbol, which is a binary variable that corresponds to the control cohort, which can be never-treated units or last-treated units. If using last-treated unit as control cohort, exclude the time periods when the last cohort receives treatment.

The cohort must be defined as a Symbol, which is a categorical variable that corresponds to the initial treatment timing of each unit. If there are units that receive multiple treatments, Sun and Abraham (2021) defines the initial treatment timing to be based on the first treatment. This categorical variable should be set to be missing for never treated units.

Examples

  • You can download the nlswork.dta data from the repository, which is the example data used by eventstudyinteract, and the method of generating variables is the same as that of eventstudyinteract.
using DataFrames
using FixedEffectModels
using EventStudyInteracts
using ReadStatTables
using RegressionTables
using CUDA
using Plots

# Load the 1968 extract of the National Longitudinal Survey of Young Women and Mature Women.
df =DataFrame(readstat("EventStudyInteracts\dataset\nlswork.dta"))

# Code the cohort categorical variable based on when the individual first joined the union, which will be inputted in cohort(varname).

df.union_year = ifelse.(coalesce.(df.union, false) .== 1, df.year, missing)
# This is really frustrating. In Stata, the simple code bysort idcode: egen first_union = min(union_year) can achieve this. Why is Julia so complicated?
function min_skipmissing(x)
    v = collect(skipmissing(x))
    return isempty(v) ? missing : minimum(v)
end

transform!(groupby(df, :idcode), :union_year => min_skipmissing => :first_union)

select!(df, Not(:union_year))

# Code the relative time categorical variable.
df.ry = df.year - df.first_union

# For the first example, we take the control cohort to be individuals that never unionized.
df.never_union = ismissing.(df.first_union)

# Check if there is a sufficient number of treated units for each relative time. With very few units it might be better to bin the relative times and assume constant treatment effects within the bin.
function tab1(df::DataFrame, catevar::Symbol)
    gdf = groupby(df, catevar)
    result = combine(gdf, nrow => :freq)
    result.percent = result.freq / nrow(df) * 100
    sort!(result, catevar)
    result.cum = cumsum(result.percent)
    return result
end

tab1(df,:ry)
# We will consider the dynamic effect of union status on income. We first generate these relative time indicators, and leave out the distant leads due to few observations. Implicitly this assumes that effects outside the lead windows are zero.
relmin = abs(minimum(skipmissing(df.ry)))
relmax = abs(maximum(skipmissing(df.ry)))

for k in relmin:-1:2
    df[!, Symbol("g_$k")] = Int.(coalesce.(df.ry .== -k, false))
end

for k in 0:relmax
    df[!, Symbol("g$k")] = Int.(coalesce.(df.ry .== k, false))
end

absorb = [:idcode,:year]

formula1 = term(:ln_wage) ~ term(:south) + sum(fe.(term.(absorb)))

# To test whether the package can run successfully when there are collinear variables.
df.g_19 .= 0

# The rel_varlist must be defined as a Vector{Symbol}. In the example below, when defining an empty Vector, you cannot use rel_varlist1 = [].

rel_varlist1 = Symbol[:g_4, :g_5]

for i in relmin:-1:2
    push!(rel_varlist1, Symbol("g_"*string(i)))
end

for i in 0:relmax
    push!(rel_varlist1, Symbol("g"*string(i)))
end

control_cohort1 = :never_union

cohort1 = :first_union

vcov1 = Vcov.cluster(:idcode)

m1 = eventreg(df, formula1, rel_varlist1, control_cohort1, cohort1, vcov1)

#                    EventStudyInteract IW estimator
# ========================================================================
# Number of obs:                27979  Degrees of freedom:             166
# R2:                           0.666  R2 Adjusted:                  0.664
# F-Stat:                     2.81691  p-value:                      0.000
# R2 within:                    0.030  Iterations:                      11
# ========================================================================
# ln_wage |   Estimate Std.Error   t value Pr(>|t|)   Lower 95%  Upper 95%
# ------------------------------------------------------------------------
# g_19    |        0.0       0.0       NaN      NaN         0.0        0.0
# g_18    | -0.0609283 0.0513893  -1.18562    0.236   -0.161679  0.0398221
# g_17    | -0.0489819 0.0432146  -1.13346    0.257   -0.133706  0.0357418
# g_16    | -0.0680484 0.0495311  -1.37385    0.170   -0.165156  0.0290589
# g_15    |  -0.135218 0.0448805  -3.01284    0.003   -0.223208 -0.0472281
# g_14    |  -0.103295 0.0443376  -2.32975    0.020   -0.190221   -0.01637
# g_13    | -0.0684827 0.0416284   -1.6451    0.100   -0.150097  0.0131311
# g_12    |  -0.132015 0.0385567  -3.42392    0.001   -0.207606 -0.0564232
# g_11    |   -0.15239 0.0376567  -4.04683    0.000   -0.226217  -0.078563
# g_10    |   -0.11842 0.0291013  -4.06925    0.000   -0.175474 -0.0613663
# g_9     |  -0.172996 0.0330077  -5.24109    0.000   -0.237709  -0.108284
# g_8     |  -0.116192  0.030757  -3.77773    0.000   -0.176492 -0.0558915
# g_7     |  -0.140938 0.0287493  -4.90233    0.000   -0.197302 -0.0845745
# g_6     |  -0.133223 0.0369878  -3.60181    0.000   -0.205739 -0.0607073
# g_5     |  -0.131416 0.0238826  -5.50261    0.000   -0.178239 -0.0845938
# g_4     |  -0.150598 0.0330246  -4.56016    0.000   -0.215344 -0.0858518
# g_3     | -0.0559142 0.0202356  -2.76316    0.006  -0.0955869 -0.0162416
# g_2     | -0.0565837 0.0191366  -2.95683    0.003  -0.0941016 -0.0190657
# g0      |  0.0515795 0.0146284   3.52598    0.000      0.0229  0.0802591
# g1      |  0.0961542 0.0194692   4.93879    0.000   0.0579842   0.134324
# g2      |  0.0677024 0.0182556   3.70858    0.000   0.0319116   0.103493
# g3      |  0.0663372  0.018431   3.59923    0.000   0.0302027   0.102472
# g4      |   0.132223 0.0276378   4.78414    0.000   0.0780381   0.186408
# g5      |  0.0565864 0.0182183   3.10603    0.002   0.0208689  0.0923039
# g6      |   0.117224 0.0215077    5.4503    0.000   0.0750571    0.15939
# g7      |   0.072123 0.0221207   3.26042    0.001   0.0287545   0.115491
# g8      |  0.0666257 0.0193959   3.43504    0.001   0.0285993   0.104652
# g9      |  0.0929835 0.0349948   2.65707    0.008    0.024375   0.161592
# g10     |  0.0675955 0.0237323   2.84825    0.004   0.0210676   0.114124
# g11     |   0.106953 0.0212674   5.02896    0.000   0.0652575   0.148649
# g12     |  0.0647985 0.0365021    1.7752    0.076 -0.00676515   0.136362
# g13     |   0.134787 0.0412699   3.26599    0.001   0.0538762   0.215698
# g14     |   0.155875 0.0501801   3.10631    0.002   0.0574951   0.254255
# g15     |  0.0665542 0.0426601   1.56011    0.119  -0.0170823   0.150191
# g16     |   0.181412 0.0511286   3.54815    0.000   0.0811727   0.281651
# g17     |  0.0233864 0.0438817  0.532943    0.594  -0.0626452   0.109418
# g18     | -0.0477967 0.0750543 -0.636828    0.524   -0.194943  0.0993498
# ========================================================================
  • Event study plots. We may define a coefplot function for an event study plot.
function mycoefplot(m)
        n = coefnames(m)
        vals = coef(m)
        errors = stderror(m)
        Plots.scatter(
            n,
            vals,
            legend = false,
            yerror = 1.96 .* errors,
            #title = "Coefficient plot",
            xtickfont = font(10, "Times"),
            ytickfont = font(10, "Times"),
            titlefont = font(14, "Times"),
            # markersize = 2
    )
    Plots.vline!([3.5,],linestyle = :dash,linecolor=:red)
    Plots.hline!([0,],linestyle = :dash,linecolor=:black,linewidth=:2)
end

plot1 = mycoefplot(m1)
  • Binning. Pre-treatment effects seem relatively constant, which might suggest binning the many leads.  TODO: current implementation of bins does not follow Sun and Abraham (2020) exactly due to coding challenge.  But it is valid if effects in the bin are constant for each cohort.
df.g_l4 = Int.(coalesce.(df.ry .<= -4, false))

rel_varlist2 = Symbol[:g_l4]

for i in 3:-1:2
    push!(rel_varlist2, Symbol("g_"*string(i)))
end

for i in 0:18
    push!(rel_varlist2, Symbol("g"*string(i)))
end

m2 = eventreg(df, formula1, rel_varlist2, control_cohort1, cohort1, vcov1)
  • Using the last treated as the control cohort. Alternatively, we can take the control cohort to be individuals that were unionized last.
df.last_union = Int.(coalesce.(df.first_union .== 88, false))
control_cohort2 = :last_union
# If using the last-treated cohort as the control, be sure to restrict the analysis sample to exclude the never-treated (if any) and to be before the treated periods for the last-treated cohort.

m3 = eventreg(df[(.!ismissing.(df.first_union)) .& (df.year .< 88), :], formula1, rel_varlist2, control_cohort1, cohort1, vcov1)
  • Aggregating event study estimates. In Stata one can use lincom looks for coefficients and variance covariance matrix stored in e(b) and e(V). There is no package like lincom in Julia. I write another package [LinComs.jl]https://github.com/xiaobaaaa/LinComs.jl) to do this. You can refer to the following code after estimating the results.
rel_varlist1[20:38]
expr = Expr(:call, :/, Expr(:call, :+, rel_varlist1...), length(rel_varlist1))
# Or,
expr = :((g_20 + g_19 + g_18 + g_17 + g_16 + g_15 + g_14 + g_13 + g_12 + g_11 + g_10 + g_9 + g_8 + g_7 + g_6 + g_5 + g_4 + g_3 + g_2 + g0 + g1 + g2 + g3 + g4 + g5 + g6 + g7 + g8 + g9 + g10 + g11 + g12 + g13 + g14 + g15 + g16 + g17 + g18) / 38)
# And then,
lincom(m1,expr)
# Returns,
                                      lincom                                      
==================================================================================
ln_wage            |    Estimate Std.Error   t value Pr(>|t|)  Lower 95% Upper 95%
----------------------------------------------------------------------------------
Linear Combination | -0.00369359 0.0131072 -0.281798    0.778 -0.0293907 0.0220035
==================================================================================

The results obtained from LinComs.jl can also be used with RegressionTables.jl to generate Regression Tables.

  • Compare event study estimates for subsamples. Suppose we want to compare the average effect over the first five years of joining the union between college graduates and non-college graduates. We can first estimate their separate effects by interacting the relative time indicators with icator of college graduates:
for k in 18:-1:2
    df[!, Symbol("g_", k, "_collgrad0")] = ifelse.(coalesce.(df.ry .== -k, false) .& coalesce.(df.collgrad .== 0, false), 1, 0)
end
for k in 0:18
    df[!, Symbol("g", k, "_collgrad0")] = ifelse.(coalesce.(df.ry .== k, false) .& coalesce.(df.collgrad .== 0, false), 1, 0)
end
df.g_l4_collgrad0 = ifelse.(coalesce.(df.ry .<= -4, false) .& coalesce.(df.collgrad .== 0, false), 1, 0)
for k in 18:-1:2
    df[!, Symbol("g_", k, "_collgrad1")] = ifelse.(coalesce.(df.ry .== -k, false) .& coalesce.(df.collgrad .== 1, false), 1, 0)
end
for k in 0:18
    df[!, Symbol("g", k, "_collgrad1")] = ifelse.(coalesce.(df.ry .== k, false) .& coalesce.(df.collgrad .== 1, false), 1, 0)
end
df.g_l4_collgrad1 = ifelse.(coalesce.(df.ry .<= -4, false) .& coalesce.(df.collgrad .== 1, false), 1, 0)

rel_varlist3 = Symbol[:g_l4_collgrad0]

for i in 3:-1:2
    push!(rel_varlist3, Symbol("g_"*string(i)*"_collgrad0"))
end

for i in 0:18
    push!(rel_varlist3, Symbol("g"*string(i)*"_collgrad0"))
end

push!(rel_varlist3, :g_l4_collgrad1)
for i in 3:-1:2
    push!(rel_varlist3, Symbol("g_"*string(i)*"_collgrad1"))
end

for i in 0:18
    push!(rel_varlist3, Symbol("g"*string(i)*"_collgrad1"))
end

m4 = eventreg(df, formula1, rel_varlist3, control_cohort1, cohort1, vcov1)

Output

eventreg returns a light object like FixedEffectModels.reg. It is composed of the vector of coefficients & the covariance matrix (use coef, coefnames, vcov on the output of reg) a boolean vector reporting rows used in the estimation a set of scalars (number of observations, the degree of freedoms, r2, etc).

We can use RegressionTables.jl to get the Regression Tables.

regtable(m1, m2, m3,m4; renderSettings = asciiOutput(), estim_decoration=make_estim_decorator([0.01, 0.05, 0.1]))

regtable(m1, m2, m3,m4; renderSettings = htmlOutput("base_reg.html"), estim_decoration=make_estim_decorator([0.01, 0.05, 0.1]))

Use other features of FixedEffectModels.jl

We can also use other features of FixedEffectModels.jl, such as using GPU for estimation.

eventreg(df, formula1, rel_varlist1, control_cohort1, cohort1, vcov1,  method = :gpu, double_precision = false)

Another example

Refer to Difference-in-Difference (DiD) | DiD.

# Generate sample data
using Random

units = 30
start = 1
finish = 60
nlvls = 5
time = finish - start + 1
obsv = units * time

df = DataFrame(id = repeat(1:units, inner=time), t = repeat(start:finish, outer=units))

Random.seed!(20211222)

df.Y .= 0 # outcome variable
df.D .= 0 # intervention variable
df.cohort .= repeat([rand(0:nlvls) for i in 1:units], inner=time) # treatment cohort

effect = [rand(2:10) for i in unique(df.cohort)]
first_treat = [rand(start:(finish + 20)) for i in unique(df.cohort)]

df.effect = similar(df.cohort)
df.first_treat = similar(df.cohort, Union{Int64, Missing})

for i in 0:nlvls
    df.effect[df.cohort .== i] .= effect[i+1]
    df.first_treat[df.cohort .== i] .= first_treat[i+1]
end

df.first_treat[df.first_treat .> finish] .= missing


df.D = ifelse.(coalesce.(df.t .>= df.first_treat, false), 1, 0)
df.rel_time .= df.t .- df.first_treat
df.Y = df.id .+ df.t .+ ifelse.(df.D .== 1, df.effect .* df.rel_time, 0) .+ randn(nrow(df))

plot(df.t, df.Y, label=false)

tab1(df,:cohort)

df.never_treat = ismissing.(df.first_treat)

tab1(df,:rel_time)

# We will consider the dynamic effect of union status on income. We first generate these relative time indicators, and leave out the distant leads due to few observations. Implicitly this assumes that effects outside the lead windows are zero.

relmin = abs(minimum(skipmissing(df.rel_time)))
relmax = abs(maximum(skipmissing(df.rel_time)))

for k in relmin:-1:2
    df[!, Symbol("g_$k")] = Int.(coalesce.(df.rel_time .== -k, false))
end

for k in 0:relmax
    df[!, Symbol("g$k")] = Int.(coalesce.(df.rel_time .== k, false))
end

absorb = [:id,:t]

formula1 = term(:Y) ~ sum(fe.(term.(absorb)))

rel_varlist1 = Symbol[]

for i in relmin:-1:2
    push!(rel_varlist1, Symbol("g_"*string(i)))
end

for i in 0:relmax
    push!(rel_varlist1, Symbol("g"*string(i)))
end

control_cohort1 = :never_treat

cohort1 = :first_treat

m5 = eventreg(df, formula1, rel_varlist1, control_cohort1, cohort1)

plot2 = mycoefplot(m5)

Performance

EventStudyInteracts.jl is 3.69 times faster than Stata when controlling only for id and year fixed effects with a sample size of 2 million observations. EventStudyInteracts.jl is 4.75 times faster than Stata when controlling for an additional two fixed effects. As the sample size increases and more fixed effects are controlled for, performance improvement should be even greater. Furthermore, EventStudyInteracts.jl can also utilize CUDA. Many thanks to FixedEffectModels.jl for its outstanding performance!

  • Julia Codes

    using CSV, DataFrames, Random, Plots, FixedEffectModels, EventStudyInteracts
    
    units = 100000
    start = 1
    finish = 20
    nlvls = 5
    time = finish - start + 1
    obsv = units * time
    K=100
    id1 = rand(1:(obsv/K), obsv)
    id2 = rand(1:K, obsv)
    
    df = DataFrame(id = repeat(1:units, inner=time), t = repeat(start:finish, outer=units), id1 = id1, id2 = id2)
    
    Random.seed!(20211222)
    
    df.Y .= 0 # outcome variable
    df.D .= 0 # intervention variable
    df.cohort .= repeat([rand(0:nlvls) for i in 1:units], inner=time) # treatment cohort
    
    effect = [rand(2:10) for i in unique(df.cohort)]
    first_treat = [rand(start:(finish + 20)) for i in unique(df.cohort)]
    
    df.effect = similar(df.cohort)
    df.first_treat = similar(df.cohort, Union{Int64, Missing})
    
    for i in 0:nlvls
        df.effect[df.cohort .== i] .= effect[i+1]
        df.first_treat[df.cohort .== i] .= first_treat[i+1]
    end
    
    df.first_treat[df.first_treat .> finish] .= missing
    
    
    df.D = ifelse.(coalesce.(df.t .>= df.first_treat, false), 1, 0)
    df.rel_time .= df.t .- df.first_treat
    df.Y = df.t .+ ifelse.(df.D .== 1, df.effect .* df.rel_time, 0) .+ randn(nrow(df))
    
    plot(df.t, df.Y, label=false)
    
    
    df.never_treat = ismissing.(df.first_treat)
    
    # We will consider the dynamic effect of union status on income. We first generate these relative time indicators, and leave out the distant leads due to few observations. Implicitly this assumes that effects outside the lead windows are zero.
    
    relmin = abs(minimum(skipmissing(df.rel_time)))
    relmax = abs(maximum(skipmissing(df.rel_time)))
    
    for k in relmin:-1:2
        df[!, Symbol("g_$k")] = Int.(coalesce.(df.rel_time .== -k, false))
    end
    
    for k in 0:relmax
        df[!, Symbol("g$k")] = Int.(coalesce.(df.rel_time .== k, false))
    end
    
    absorb = [:id,:t]
    
    formula1 = term(:Y) ~ sum(fe.(term.(absorb)))
    
    rel_varlist1 = Symbol[]
    
    for i in relmin:-1:2
        push!(rel_varlist1, Symbol("g_"*string(i)))
    end
    
    for i in 0:relmax
        push!(rel_varlist1, Symbol("g"*string(i)))
    end
    
    control_cohort1 = :never_treat
    
    cohort1 = :first_treat
    
    CSV.write("bench.csv", df)
    
    m1 = eventreg(df, formula1, rel_varlist1, control_cohort1, cohort1)
    @time m1 = eventreg(df, formula1, rel_varlist1, control_cohort1, cohort1)
    # 76.965495 seconds (395.81 k allocations: 218.291 GiB, 23.78% gc time)
    absorb = [:id,:id1,:id2,:t]
    formula2 = term(:Y) ~ sum(fe.(term.(absorb)))
    @time m2 = eventreg(df, formula2, rel_varlist1, control_cohort1, cohort1)
    # 88.459680 seconds (3.09 M allocations: 218.534 GiB, 21.19% gc time, 0.86% compilation time)
  • Stata Codes

    import delimited "~\EventStudyInteracts\benchmark\bench.csv", clear
    timer clear
    set rmsg on
    gen never_treat1 = never_treat=="true"
    eventstudyinteract y g_* g0-g16, cohort(first_treat) control_cohort(never_treat1) absorb(i.id i.t)
    r; t=284.17
    eventstudyinteract y g_* g0-g16, cohort(first_treat) control_cohort(never_treat1) absorb(i.id i.id1 i.id2 i.t)
    r; t=420.61

References

GitHub - FixedEffects/FixedEffectModels.jl: Fast Estimation of Linear Models with IV and High Dimensional Categorical Variables

lsun20/EventStudyInteract (github.com)

Difference-in-Difference (DiD) | DiD