PyTwoWay is the Python package associated with the following paper:
"How Much Should we Trust Estimates of Firm Effects and Worker Sorting?" by Stéphane Bonhomme, Kerstin Holzheu, Thibaut Lamadon, Elena Manresa, Magne Mogstad, and Bradley Setzler. No. w27368. National Bureau of Economic Research, 2020.
The package provides implementations for a series of estimators for models with two sided heterogeneity:
- two way fixed effect estimator as proposed by Abowd, Kramarz, and Margolis
- homoskedastic bias correction as in Andrews, et al.
- heteroskedastic bias correction as in Kline, Saggio, and Sølvsten
- group fixed estimator as in Bonhomme, Lamadon, and Manresa
- group correlated random effect as presented in the main paper
- fixed-point revealed preference estimator as in Sorkin
- estimator as in Borovičková and Shimer for a modified definition of sorting
If you want to give it a try, you can start an example notebook for the FE estimator here: for the CRE estimator here: for the BLM estimator here: for the Sorkin estimator here: and for the Borovickova-Shimer estimator here: . These start fully interactive notebooks with simple examples that simulate data and run the estimators.
The package provides a Python interface. Installation is handled by pip or Conda (TBD). The source of the package is available on GitHub at PyTwoWay. The online documentation is hosted here.
The code is relatively efficient. A benchmark below compares PyTwoWay's speed with that of LeaveOutTwoWay, a MATLAB package for estimating AKM and its bias corrections.
To install via pip, from the command line run:
pip install pytwoway
To make sure you are running the most up-to-date version of PyTwoWay, from the command line run:
pip install --upgrade pytwoway
Please DO NOT download the Conda version of the package, as it is outdated!
Please check out the documentation for detailed examples of how to use PyTwoWay. If you have a question that the documentation doesn't answer, please also check the past Issues to see if someone else has already asked this question and an answer has been provided. If you still can't find an answer, please open a new Issue and we will try to answer as quickly as possible.
Data is simulated from BipartitePandas using the following code:
import numpy as np
import bipartitepandas as bpd
sim_params = bpd.sim_params({'n_workers': 500000, 'firm_size': 10, 'p_move': 0.05})
rng = np.random.default_rng(1234)
sim_data = bpd.SimBipartite(sim_params).simulate(rng)
This data is then estimated using the PyTwoWay class FEEstimator and using the MATLAB package LeaveOutTwoWay. For estimation using PyTwoWay, all estimators other than AMG use the incomplete Cholesky decomposition as a preconditioner.
Results are estimated on a 2021 MacBook Pro 14" with 16 GB Ram and an Apple M1 Pro processor with 8 cores.
Some summary statistics about the largest leave-one-match-out set:
Package | #obs | #firms | #movers |
---|---|---|---|
KSS | 2,255,370 | 44,510 | 88,542 |
PyTwoWay | 2,269,665 | 44,601 | 89,098 |
Run time:
Solver | Cleaning | Estimation | Total |
---|---|---|---|
KSS | N/A | N/A | 55.2s |
PYTW-AMG | 4.0s | 3m2s | 3m6s |
PYTW-BICG | 4.0s | 20.4s | 24.4s |
PYTW-BICGSTAB | 4.0s | 21.9s | 25.9s |
PYTW-CG | 4.0s | 19.6s | 23.6s |
PYTW-CGS | 4.0s | 20.6s | 24.6s |
PYTW-GMRES | 4.0s | 32.9s | 36.9s |
PYTW-MINRES | 4.0s | 10.7s | 14.7s |
PYTW-QMR | 4.0s | 3m53s | 3m57s |
If you want to contribute to the package, the easiest way is to test that it's working properly! If you notice a part of the package is giving incorrect results, please add a new post in Issues and we will do our best to fix it as soon as possible.
We are also happy to consider any suggestions to improve the package and documentation, whether to add a new feature, make a feature more user-friendly, or make the documentation clearer. Please also post suggestions in Issues.
Finally, if you would like to help with developing the package, please make a fork of the repository and submit pull requests with any changes you make! These will be promptly reviewed, and hopefully accepted!
We are extremely grateful for all contributions made by the community!
Solving large sparse linear models relies on a combination of PyAMG (this is the package we use to estimate the different decompositions on US data) and SciPy's iterative sparse linear solvers.
Many tools for handling sparse matrices come from SciPy.
Additional preconditioners for linear solvers come from PyMatting (installing the package is not required, as the necessary files have been copied into the submodule preconditioners). The incomplete Cholesky preconditioner in turn relies on Numba.
Constrained optimization is handled by QPSolvers.
Progress bars are generated with tqdm.
Parameter dictionaries are constructed using ParamsDict.
Data cleaning is handled by BipartitePandas.
We also rely on a number of standard libraries, such as NumPy, Pandas, matplotlib, etc.
Optionally, the code is compatible with: - multiprocess. Installing this may help if multiprocessing is raising errors related to pickling objects. - PyTorch. This may speed up BLM estimation, and adds the option to compute some operations using the GPU.
Please use following citation to cite PyTwoWay in academic publications:
Bibtex entry:
@techreport{bhlmms2020, title={How Much Should We Trust Estimates of Firm Effects and Worker Sorting?}, author={Bonhomme, St{\'e}phane and Holzheu, Kerstin and Lamadon, Thibaut and Manresa, Elena and Mogstad, Magne and Setzler, Bradley}, year={2020}, institution={National Bureau of Economic Research} }
Thibaut Lamadon, Assistant Professor in Economics, University of Chicago, lamadon@uchicago.edu
Adam A. Oppenheimer, Research Professional, University of Chicago, oppenheimer@uchicago.edu