Supporting Code and Data for "Sex-biased reduction in reproductive success drives selective constraint on human genes"
This repository contains the supporting code for our manuscript on the relationship between rare genetic variant burden as measured by shet and fertility. This repository consists of a few different resources necessary to replicate our findings:
-
An RStudio project consisting of three RMarkdown documents in
RMarkdown/
:SNVCalling_Filtering.Rmd
- Examples on how CNV QC and annotation was performed.CNVCalling_Filtering.Rmd
- Examples on how SNV annotation was performed.PhenotypeTesting.Rmd
- Code to replicate all main text figures, supplementary figures, and findings of the manuscript.
These documents are easily loadable into RStudio by simply doing File -> Open Project and selecting the
UKBBFertility.Rproj
file. The first two documents (i.e.*Calling_Filtering.Rmd
are not intended to be actually runnable, but are provided as examples of how we processed data as part of the project.PhenotypeTesting.Rmd
on the other hand, is intended to be run, but requires the user to download UK Biobank participant protected data using their own UK Biobank access. While our manuscript is undergoing peer review, UK Biobank will not provide required files to run this document. We will update this document when data is made available thru data access with UK Biobank. We also provide in the directorycompiled_html/
html documents produced by knitter which represent the data as we ran it on our system. Please view these documents in your browser if you want better formating with the following links: -
Scripts used as part of Rmarkdown and other data processing in
scripts/
. Please see individual RMarkdown documents for more details. -
Rawdata that is used as input for RMarkdown. This is provided as a tarball to save space. To use it please do the following:
tar -zxf rawdata.tar.gz
You should then be ready to use at least
PhenotypeTesting.Rmd
pending acquisition of UK Biobank data. -
Java source code for tools that we created to do CNV and SNV/InDel annotation and QC in
src/
. This source code is provided with an Eclipse IDE project file to enable easy loading into eclipse. Both of these projects require external jars to compile:Compiled jars which are runnable with a distribution of the java14 JRE/JDK are also provided in
scripts/
. Please see CNV/SNV Calling and Filtering RMarkdowns for more information.
This project requires the following packages/dependencies:
- R:
- biomaRt - Get gene lists we need (Need to install via bioconductor)
- readxl - Read Supplemental Excel tables
- data.table - Better than data.frame
- patchwork - Arranging ggplots
- broom - Makes getting odds ratios/betas/errors/p.values out of
glm()
much easier - meta - For doing meta analysis
- mratios - Need this to calculate 95% CIs for ratios of two means
- svglite - Need to create main text figures properly (ggsave doesn't like anything with an alpha shading)
- tidyverse - Loads ggplot, tidyr, dplyr, and stringr
- randomForest - R randomForest implementation
- ROCR - Builds ROC curves
- cvAUC - Determines CI for AUC from ROCR
- gdata - Allows for accurate random sampling when making training sets for Random Forest
- rcompanion - Used for calculating Naglekerke's pseudo r-squared for logistic regression
- lubridate - For measuring time between patient birth/ICD-code incidence
- Perl:
- None
- Java:
- Apache Commons Math
- Apache Commons CLI
- Apache Commons Exec
- HTSJDK
- Python
- scipy
- pandas
If you use any of the materials in the repository, we would appreciate it if you cited our manuscript:
(https://www.biorxiv.org/content/10.1101/2020.05.26.116111v2)