This is an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!
I had been exposed to college admissions data for years while managing the team at Newton, my education startup in Shanghai, but only after building a primitive shell script data scraper to pull college data to build a “match” tool did I realize there seemed to be a systemic gender gap in yield rates – the ratio of students accepting offer letters of admission to total offers given out by the school.
After brushing up on probability, statistics, and linear algebra, I starting learning the fundamental tools of modern data science and felt that analyzing the old admissions data I used at Newton would be a perfect candidate for a personal project to help cement my learning. This project has provided me hundreds of hours of experience to become comfortable using the following technologies and I now feel confident in my ability to:
- with Jupyter notebooks (soon upgrading to Jupyter Lab),
- run jupyter notebooks remotely (and securely) from a server.
- work on notebooks from my iPad at the cafe!
- with BeautifulSoup,
- along with requests, retrieve large numbers of pages without supervision.
- isolate and extract targeted values in web pages.
- with PostgreSQL / SQLite,
- install database software on a server.
- use complex SQL queries to select specific data.
- with pandas,
- perform advanced manipulation of data.
- comfortably use MultiIndexing.
- efficiently clean text data using string methods and regular expressions.
- with difflib / fuzzywuzzy,
- join tables with close but not identical string keys.
- with matplotlib / seaborn / bokeh,
- visualize distributions with boxplots, ECDFs, etc.
- finely control customization of figures.
- create user interactive visualization tools.
- with scipy.stats,
- conduct statistical hypothesis testing.
- normalize data with Box-Cox transforms.
- with scikit-learn,
- split training and testing data and cross validate.
- build linear, lasso, ridge, and isotonic regression models.
- use logistic regression models for classification.
In addition, I have a pretty solid grasp on how multiple imputation by chained equations (MICE) works and have utilized the iterative imputation approach to handle missing data in my project.