GitHub - vertuli/college-yield-gap: A dive into college admissions data to understand the yield rate gender gap.

Yield Rate Gender Gap

This is an investigation into the yield rate disparity in college admissions between women and men. This is a personal project I started to help me tie together using python for web scraping, data cleaning, data visualization, hypothesis testing, statistical modeling, machine learning, and more. I appreciate feedback!

I had been exposed to college admissions data for years while managing the team at Newton, my education startup in Shanghai, but only after building a primitive shell script data scraper to pull college data to build a “match” tool did I realize there seemed to be a systemic gender gap in yield rates – the ratio of students accepting offer letters of admission to total offers given out by the school.

After brushing up on probability, statistics, and linear algebra, I starting learning the fundamental tools of modern data science and felt that analyzing the old admissions data I used at Newton would be a perfect candidate for a personal project to help cement my learning. This project has provided me hundreds of hours of experience to become comfortable using the following technologies and I now feel confident in my ability to:

with Jupyter notebooks (soon upgrading to Jupyter Lab),
- run jupyter notebooks remotely (and securely) from a server.
- work on notebooks from my iPad at the cafe!
with BeautifulSoup,
- along with requests, retrieve large numbers of pages without supervision.
- isolate and extract targeted values in web pages.
with PostgreSQL / SQLite,
- install database software on a server.
- use complex SQL queries to select specific data.
with pandas,
- perform advanced manipulation of data.
- comfortably use MultiIndexing.
- efficiently clean text data using string methods and regular expressions.
with difflib / fuzzywuzzy,
- join tables with close but not identical string keys.
with matplotlib / seaborn / bokeh,
- visualize distributions with boxplots, ECDFs, etc.
- finely control customization of figures.
- create user interactive visualization tools.
with scipy.stats,
- conduct statistical hypothesis testing.
- normalize data with Box-Cox transforms.
with scikit-learn,
- split training and testing data and cross validate.
- build linear, lasso, ridge, and isotonic regression models.
- use logistic regression models for classification.

In addition, I have a pretty solid grasp on how multiple imputation by chained equations (MICE) works and have utilized the iterative imputation approach to handle missing data in my project.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
data		data
.gitignore		.gitignore
README.md		README.md
Test.ipynb		Test.ipynb
Untitled.ipynb		Untitled.ipynb
collegedata_names.py		collegedata_names.py
collegedata_scraper.py		collegedata_scraper.py
old-part1-scraping.ipynb		old-part1-scraping.ipynb
old-part2-cleaning.ipynb		old-part2-cleaning.ipynb
old-part3-joining.ipynb		old-part3-joining.ipynb
old-part4-missing-values.ipynb		old-part4-missing-values.ipynb
old-part5-eda.ipynb		old-part5-eda.ipynb
part1-scraping.ipynb		part1-scraping.ipynb
part2-extracting.ipynb		part2-extracting.ipynb
part3-fuzzy-merging.ipynb		part3-fuzzy-merging.ipynb
part4-preprocessing.ipynb		part4-preprocessing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yield Rate Gender Gap

About

Releases

Packages

Languages

vertuli/college-yield-gap

Folders and files

Latest commit

History

Repository files navigation

Yield Rate Gender Gap

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages