html | print_background | ||||||
---|---|---|---|---|---|---|---|
|
false |
Notes on replication of the Chetty and Hendren (C+H) design for LATF project
- chetty-hendren-Nordic README
- What is the C+H design?
- Key facts about C+H's data
- Example analysis using simulated data
- Questions and comments
- CH replication code
Chetty and Hendren proposed a research design for assessing the impact of neighbourhoods during childhood on adult outcomes. In a nutshell, the design assumes that:
- the younger the child is when the move occurs, the longer they will be exposed to the new neighbourhood
- the age at which children move neighbourhoods is as-good-as-random
Therefore, if neighbourhood poverty (for example) had a negative effect on adult outcomes then we ought to see a stronger negative association between neighbourhood poverty and adult outcomes for children who moved at 8 compared to 18.
This is due to a difference in expected exposure that is assumed to be random (see point 2).
There's only two very simple steps to the C+H design:
Step 1: Estimate a series of linear regressions for each age group
- Restrict sample to everyone that has moved at least once in childhood.
- For all children who move, find out what age they moved at (
$m$ ) - Find out the characteristics of the neighbourhood they moved to (
$\bar y$ ) - For each age group
$m$ , regress adult outcome$y$ (e.g. wage at 25) onto childhood neighbourhood characteristic ($\bar y$ ) - The coefficient for
$\bar y$ in each age specific regression model is$\hat \beta_m$
Example age regression: Each model = 1 age (
===================================================================================================================
Dependent variable:
-----------------------------------------------------------------------------------------------
y
(1) (2) (3) (4) (5)
-------------------------------------------------------------------------------------------------------------------
yNhood 0.123 -0.087 -0.250 0.142 -0.249
(0.334) (0.326) (0.264) (0.282) (0.301)
Constant 39.689** 55.444*** 56.227*** 47.040*** 68.000***
(17.316) (16.427) (13.407) (14.094) (15.252)
-------------------------------------------------------------------------------------------------------------------
Observations 94 91 99 99 108
R2 0.001 0.001 0.009 0.003 0.006
Adjusted R2 -0.009 -0.010 -0.001 -0.008 -0.003
Residual Std. Error 29.997 (df = 92) 29.486 (df = 89) 27.186 (df = 97) 27.865 (df = 97) 28.280 (df = 106)
F Statistic 0.136 (df = 1; 92) 0.071 (df = 1; 89) 0.892 (df = 1; 97) 0.255 (df = 1; 97) 0.687 (df = 1; 106)
===================================================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Step 2: Regress
- Run a linear model where
$\hat \beta_m$ from step one is regressed on age$m$ (independent variable) - Examine results to see if
$\hat \beta_m$ gets closer to zero as$m$ increases - The above is expected if older children get less exposure to neighbourhoods during childhood than younger children
First, a simple regression of adult outcomes
The relationship between
A fundamental argument in the C+H design is that the bias term
The causal interpretation of
Imagine a trial evaluation where all members receive the treatment eventually. The trial starts in January and ends in December of the same year. There is a random treatment and control group.
However, we randomly select trial members into 12 groups. For group A, the treatment is dispensed in janurary; B in Feburary etc. Therefore, some members have different exposure to (or dosages of) the treatment at random: some are exposed for an entire year, others for one month.
start group | treatment start | treatment outcome | control outcome | Effect (treatment - control) |
---|---|---|---|---|
A | Jan | 1 | 0 | 1 |
B | Feb | 0.9 | 0 | 0.9 |
C | Mar | 0.8 | 0 | 0.8 |
etc | .. | .. | .. | .. |
For each group, we can calculate the treatment effect. The difference in treatment effect between groups A and B reflects the difference in exposure time to the treatment.
In practise, we cannot perfectly manipulate exposure due to non-compliance; treatment and control members can drop out. So the difference in treatment effect between A and B reflects the intention to vary exposure in the treatment. For group A, the intended exposure length is 12 months.
This is the experimental analogy to the C+H design. In C+H, if outcomes are measured when a child is 24:
- the trial length = 24
- treatment is
$\bar y$ which is continuous instead of binary - start group/ treatment start = age of move
$m$ - for group
$m$ , intended exposure length = 24 -$m$ - for group
$m$ , the treatment effect is not$\beta_m$ ... - ... nonetheless the intention to vary exposure effect is still
$\delta_\beta$
However, in C+H design, the treatment effect for age group
This is why
Given this assumption, we can still measure the treatment effect for age group
Now imagine in the example trial, there was an extra group Z that will received the treatment AFTER the trial has ended. However, we measure their outcomes at the same time as everyone else. In short, for this group, we record their outcomes BEFORE the treatment starts (e.g. outcomes recorded in December but treatment started in January next year).
The treatment effect group Z is therefore equal to the fixed bias term. Therefore we can use this information to derive the true unbiased treatment effects for group A, B etc.
In C+H's case, they assume that the exposure effect of neighbourhoods ends after a certain age. Therefore,
Their data comes from federal tax data from:
- children born between 1980 - 1988
- parental tax income from 1996 - 2012
- child with valid social security
- children citizen as of 2013 (for measuring parental income)
- children with parent which have positive income (only 1.5%children have parents with zero and negative income; these people are weird and not necessarily poor see footnote 11 )
- who in commuting zones over 250k in pop in 2000 census (excludes 19.6% of observations p. 1117)
- children with income data at age 24 or later
parent = (first) tax filer claiming child as depend between ages 15-40 when child was born.
Sample is split into stayers (no moves in 1996-2012; also referred to as permanent residents) and movers.
Out of 24.6 million children at base, 19.5 million are stayers! (p. 1118).
To get their effect size of 0.04 you need a SE of approx 0.02 or smaller thereabout. The SE in their study ~0.002.
A sample size require X movers where X = :
> 7e6/100
[1] 70000
This may be possible with Sweden:
0-14 years: 17.54% (male 904,957 /female 855,946) 15-24 years: 11.06% (male 573,595 /female 537,358)
In Norway, for the cohort born between 1970 and 1990, we have ~ 45k moves a year between 1990-1998 (@ref Henrik email Aug 2021).
In Sweden: Stefanie Bastani (maiden name: Heidrich) as part of PhD (paper 3) (https://orcid.org/0000-0001-8888-1823) (http://umu.diva-portal.org/smash/person.jsf?pid=authority-person%3A63764&dswid=3998)
sampe (image courtesy of Bertha)
Idea for replication is that:
- Get CH's replication do file
- Format the Nordic data to match CH's data using a mixture of their do files and article
- For the formatted data, restructure the datasets to be exactly the same structure as the files used in step 1 (e.g. same .dta format, same variables names)
- Rerun CH's do files in step 1 to get a perfect replication
Current issue: CH do files mention filenames and variable names not elaborated upon in their data dictionary!