CNVCalling_Filtering.Rmd

---
title: "CNV Calling and Filtering"
author: "Eugene Gardner"
date: "03 December 2020"
output: 
  html_document:
    toc: true
    toc_depth: 2
---

# 1. Startup and Introduction

This notebook handles calling, annotation, and building filtering model(s) of CNV calls.

This proceeds in the following order:

1. Array CNVs are called in batches of ~5K samples.
2. Whole Exome CNVs (for INTERVAL only) are called using XHMM, CoNVex, and CLAMMS
3. Raw CNVs are cat'd from each of the Array CNV calling batches into a final (raw) file and covariate values for each CNV are calculated and CNVs are merged into the final file using the "CNVPolisher":
    + Attachment of Individual level data to each CNV and intersect with WES CNV calls from XHMM, CANOES, and CLAMMS. Sites/Individuals **are not** filtered except for light filtering based on:
        + PASS status for the actual array based on affymetrix standards (so on individual level)
        +Duplicated array (keep highest quality score array per individual)
4. Filter CNVs according to a Random Forest model
5. Merging of DEL/DUP calls independently using at first `bedtools merge` (to capture all possible overlaps) and then 75% reciprocal overlap.
        + Printing of a VCF file containing "genotyped" calls for each individual at every site
6. Annotation with VEP and generation of final files for input into `PhenotypeTesting.Rmd`

The intent of this file is _NOT_ to perfectly regenerate the files that are provided as part of our UK Biobank, it is merely intended to document the code that we used to perform the QC itself. This document also does some comparisons on CNV QC that were not shown in our manuscript.

**Note**: Example input files for some steps are given in the directory `rawdata/cnvexamples/`

You can view a compiled html version of this document with all code run either within this repository at `compiled_htmls/CNVCalling_Filtering.html` or on [github](https://htmlpreview.github.io/?https://github.com/eugenegardner/UKBBFertility/blob/master/compiled_html/CNVCalling_Filtering.html).

```{r setup}

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE
)

## Silently load packages
load.package <- function(name) {
  suppressMessages(suppressWarnings(library(name, quietly = T, warn.conflicts = F, character.only = T)))
}

load.package("data.table") ## Better than data.frame
load.package("tidyverse") ## provides DPLYR/ggplot2
load.package("patchwork") ## Makes combining plots easily
load.package("randomForest") ## randomForest implementation
load.package("ROCR") ## Builds ROC curves
load.package("cvAUC") ## Determines CI for AUC from ROCR
load.package("gdata") ## Allows for accurate random sampling when making training sets

theme <- theme(panel.background=element_rect(fill="white"),line=element_line(size=1,colour="black",lineend="round"),axis.line=element_line(size=1),text=element_text(size=10,face="bold",colour="black"),axis.text=element_text(colour="black"),axis.ticks=element_line(size=1,colour="black"),axis.ticks.length=unit(.1,"cm"),strip.background=element_rect(fill="white"),axis.text.x=element_text(angle=45,hjust=1),legend.position="blank",panel.grid.major=element_line(colour="grey",size=0.5))

## Default theme w/legend
theme.legend <- theme + theme(legend.position="right")

## Colour Scheme:
male <- "#38BCA0"
female <- "#7B06F8"

sex.colours <- c(male, female)
names(sex.colours) <- c("Male","Female")
sex.colours.fill <- scale_fill_manual(name = "Sex",values=sex.colours)
sex.colours <- scale_colour_manual(name = "Sex",values=sex.colours)

##Functions for Analysis:
#remove NAs from a data.table
fix.na<-function(x) {
  if(is.na(x)) {
    return(0)
  } else {
    return(x)
  }
}

##Convert boolean T/F to binary 1/0
convert.log<-function(x){
    if (x==F) {
      return(0)
    } else {
      return(1)
    }
}

del.line <- "#4F7942"
del.fill <- "#77DD77"
dup.line <- "#0000FF"
dup.fill <- "#AEC6CF"
```

# 2. Generation of Array-based CNV Calls

In this section are example command lines to generate PennCNV CNV calls. They *are not* intended to be specifically reproducible on a separate system and will have to be modified to any particular system where PennCNV is run.

Required Files:

* Axiom reference files for the PennCNV pipeline:
    + gc content: `Axiom.gcmodel`
    + Array-specific analysis files: `/Axiom_UKB_WCSG_Analysis-r3/`
        + These are specific to the UKBiobank and INTERVAL arrays, and should not be used for other arrays (including BiLEVE)
    + All SNVs `mapfileAX.dat`
    + PennCNV specific Axiom HMM: `Axiom_trained.hmm`
* Raw array .CEL files (downloaded from INTERVAL or UKB; need to apply for separate access, see [this document](http://biobank.ndph.ox.ac.uk/showcase/showcase/docs/ukbfetch_instruct.html) on how to download UK Biobank data)
* [PennCNV Scripts](http://penncnv.openbioinformatics.org/en/latest/)

All of the input CNVs from array-based calls are generated by PennCNV using the following pipeline:

```{bash PennCNV pipeline, eval=FALSE}

# For the purposes of this section '$BATCH' is the batch number!

BATCH=1

## 1. Make CEL file list:
ls batch${BATCH}/CEL_FILES/*.CEL > batch${BATCH}/cel.list_batch${BATCH}.txt
  # Below adds the header ‘cel_files’ to the file list:
echo 'cel_files' | cat - batch${BATCH}/cel.list_batch${BATCH}.txt > batch${BATCH}/cel.list_batch${BATCH}.header.txt

## 2. Do affy genotyping:
apt/bin/apt-probeset-genotype --analysis-files-path Axiom_UKB_WCSG_Analysis-r3/ --xml-file Axiom_UKB_WCSG_Analysis-r3/Axiom_UKB_WCSG_96orMore_Step2_Bi-allelic.r3.apt-probeset-genotype.AxiomGT1.xml --out-dir batch${BATCH}/ --summaries --cel-files batch${BATCH}/cel.list_batch${BATCH}.header.txt

## 3. Generate sex file:
grep '.CEL' batch${BATCH}/AxiomGT1.report.txt | grep -v '#' | awk '{print $1"\t"$2}' | sed 's_unknown_male_g' > batch${BATCH}/sex_batch${BATCH}.txt

## 4. Generate geno clusters:
PennCNV-1.0.4/penn-affy/bin/generate_affy_geno_cluster.pl batch${BATCH}/AxiomGT1.calls.txt batch${BATCH}/AxiomGT1.confidences.txt batch${BATCH}/AxiomGT1.summary.txt --nopower2 -locfile reference/mapfileAX.dat -sexfile batch${BATCH}/sex_batch${BATCH}.txt -out batch${BATCH}/batch${BATCH}.genocluster

## 5. Normalize the geno clusters:
PennCNV-1.0.4/penn-affy/bin/normalize_affy_geno_cluster.pl -nopower2 -locfile mapfileAX.dat -out batch${BATCH}/batch${BATCH}_lrr_baf.txt batch${BATCH}/batch${BATCH}.genocluster batch${BATCH}/AxiomGT1.summary.txt

## 6. Generate lrr_baf file for each individual:

# make dir for split files:
mkdir batch${BATCH}/batch${BATCH}_split/

# This generates the script to create and run kcolumn.pl in sets of 1000 samples
# kcolumn.pl doesn't really work for more 1000 samples at a time:
echo $BATCH | perl -ne 'chomp $_; print "wc -l batch$_/cel.list_batch$_.txt \| perl -ane '\''chomp \$_; print \"\#/bin/bash\\n\\n\"; for \(\$y=1,\$z=1000;\$y<=\$F\[0\];\$y+=1000,\$z+=1000\) \{if \(\$z \> \$F\[0\]\) \{print \"PennCNV-1.0.4/kcolumn.pl batch$_/batch$_\" . \"_lrr_baf.txt split 2 -start \$y -end \$F\[0\] -tab -head 3 -name -out batch$_/batch$_" . "_split/split$_;\\n\";\} else \{print \"PennCNV-1.0.4/kcolumn.pl batch$_/batch$_\" . \"_lrr_baf.txt split 2 -start \$y -end \$z -tab -head 3 -name -out batch$_/batch$_" . "_split/split$_;\\n\";\}\}'\'' > batch$_/kcolumn.sh;\n"; print "chmod +x batch$_/kcolumn.sh;\n";' | bash
./batch${BATCH}/kcolumn.sh

## 7. Generate signal file list:
ls batch${BATCH}/batch${BATCH}_split/split${BATCH}.* > batch${BATCH}/batch${BATCH}_split/batch${BATCH}_signalfile.lst;

## 8. Run detect CNV (needs basement queue as seems to take ~4 days?):
PennCNV-1.0.4/detect_cnv.pl --test --hmmfile Axiom_trained.hmm --pfbfile Axiom.pfb --listfile batch${BATCH}/batch${BATCH}_split/batch${BATCH}_signalfile.lst --output batch${BATCH}/batch${BATCH}.rawcnv --confidence --logfile batch${BATCH}/batch${BATCH}.rawcnv.log --gcmodelfile Axiom.gcmodel

## 9. Run cnv cleaning to merge CNVs based on distance from one another:
PennCNV-1.0.4/clean_cnv.pl -signalfile Axiom.pfb -fraction 0.25 -bp combineseg batch${BATCH}/batch${BATCH}.rawcnv > batch${BATCH}/join_batch${BATCH}.rawcnv

## 10. Convert ALL split files in a batch to pseudo-bed format, sort, bgzip, and index; like:
tail -n +2 ${BATCH}/batch${BATCH}_split/split${BATCH}.a550484 | perl -ane 'chomp $_; shift(@F); splice(@F, 1, 0, $F[1]); print join("\t", @F) . "\n";' | bedtools sort -i stdin > split1.a550484.bed
bgzip split1.a550484.bed
tabix -p bed split1.a550484.bed.gz
```

# 3. WES CNV Calling:

CNVs were then called from WES using:

* [XHMM](https://atgu.mgh.harvard.edu/xhmm/tutorial.shtml)
* [CLAMMS](https://github.com/rgcgithub/clamms)
* [CANOES](http://www.columbia.edu/~ys2411/canoes/)

I am not going to specifically document how all of these tools were used to call CNVs, as instructions for performing this analysis are available on their respective github pages, or in the case of XHMM, in the excellent [XHMM methods paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4065038/). All tools were run with default settings.

For each caller, final calls are converted to a single concatenated file with the following columns for use with the downstream QC pipeline:

1. Chromosome, e.g. 1-22, X, Y (without 'chr')
2. start
3. end
4. Number of affected WES probes
5. Intentionally left blank
6. Copy type (either DEL or DUP)
7. sample ID

This file must then be sorted, zipped, and indexed with `bgzip` and `tabix`. An example of what this file looks like is available at `rawdata/cnvexamples/wes_input.tsv.gz`

**Note**: WES ID *MUST* be consistent between all of the files, or else the subsequent processing pipeline will not work properly. It does not matter what this ID is, other than containing only alphanumeric characters or underscores.

# 4. Covariate Annotation:

This section attaches sample level information via a custom java pipeline (included in this repo at `src/CNVQC/` and as a compiled jar -- **runnable with java14** -- at `scripts/CNVAnnotator.jar`). There are two possible ways to annotate CNVs, one with exome data (INTERVAL) and one without (UKBB). To run CNVs without WES, simply omit the canoes/clamms/xhmm files. The file provided to -samples looks like:

```
a550484-4209466-021515-320_A10.CEL  /calling/batch1/batch1_split/split1.a550484_R1  sampl_1 FEMALE  0.9963 1234567
```

Where each column tab-delimited and is:

1. The CEL file name
2. The split file without the `.bed.gz` suffix. This is due to how PennCNV sets filenames in the *.rawcnv file in order for us to match information in the QCsum file
3. The sample ID to link to WES data as described above. This _must_ be the same ID as WES call files. Can be set to 'null' if no WES data available.
4. Sample gender in ALL CAPS
5. Quality score from the affymetrix raw data (can be 'NaN' if unknown)
6. 7 digit eid of the participant from your UKBB fam file

The files provided to -axiom and -wes are a tab delimited bed file of axiom and WES probes, respectively. For WES probes, I have merged adjacent probes within 500bp of each other to represent 'exons'. Example files are available at `rawdata/cnvexamples/axiom_baits.example.bed.gz` and `rawdata/cnvexamples/exome_baits.example.bed.gz`.

**Note**: It is up to the user to create this file. It 

```{bash CNV file merging, eval=FALSE}

## Merge calls across batches:
cat batch*/join_batch*.rawcnv > merging/batchAll.rawcnv
## Merge QC values across batches:
cat batch*/batch*.qcsum > merging/batchAll.qcsum

# in the below example, -s is the "chunk" of the file to annotate in 500 line chunks. For example, if you have 2,000 CNVs to annotate, you would run the java code below four times, with -s set to 1, 2, 3, 4 in each run. This was to allow easy parallelization when running on a compute cluster. If -s is not given, then ALL lines of the file are run, which can take A _LONG_ time
## This annotates each CNV with necessary information for the random forest model used for QC purposes:
CHUNK=1

java -Xmx2G -Xms2G -XX:+UseSerialGC -jar /nfs/ddd0/eg15/CNVAnnotator.jar Annotate \
-h hs37d5.fa \
-o merging/raw_files/batchAll.${CHUNK}.annotated \
-p merging/batchAll.rawcnv \
-tmp tmp/ \
-s ${CHUNK} \
-axiom axiom_baits.bed.gz \
-wes exome_baits.bed.gz \
-canoes CANOES.cnvs.bed.gz \
-clamms CLAMMS.cnvs.bed.gz \
-xhmm XHMM.sorted.bed.gz \
-samples samples_file.txt \
-qcsum merging/batchAll.qcsum

## Primary output will be a file named batchAll.<chunk_number>.annotated.txt, 
## where chunk number will be the value provided to -s

## Cat together individual annotations:
# Have to grab the header since we remove it in the next step
grep '#' merging/raw_files/batchAll.1.annotated.txt > merging/header.txt
grep -h -v "#" merging/raw_files/batchAll.*.annotated.txt | cat merging/header.txt - > merging/batchAll.annotated.txt
```

# 5. Filtering CNVs

The above java code for `Annotate` performs **no** filtering based on any model. I am using pre-merge information using the code below to build a Random Forest model to do QC using only variants found in individuals with WES data. This model can then be applied to variants across any dataset to filter variants that do not "look" (based on various covariates) like a CNV that should have WES evidence.

## 5A. Load Annotated INTERVAL CNV calls

Calls added here are for all individuals in INTERVAL. An additional column (has.wes) indicates if the file does have WES data. Only these individuals are used to train our model. An example head of a complete batchAll.annotated.txt can be found at `rawdata/cnvexamples/example.annotated.txt`

```{r Load merged INTERVAL CNV calls}

annotated.cnvs<-fread("rawdata/cnvexamples/batchAll.annotated.for_testing.txt")
setnames(annotated.cnvs,"#chr","chr")

## Store column names from original file to print final QCd file:
original.names <- names(annotated.cnvs)

## define which callers am I using for Random Forest
callers <- c("wes.xhmm.int","wes.clamms.int","wes.canoes.int")

int.string <- paste(callers,collapse = " + ")
annotated.cnvs[,wes.support.num:=(eval(parse(text=int.string)))]

## Define wes.support TRUTH as any call with >= 1 overlap (this is due to the binary nature of ROCR)
annotated.cnvs[,wes.support:=wes.support.num>0]

##Assign a factor to copy type rather than a number
assign.ct<-function(x) {
  if (x < 2) {
    return("DEL")
  } else {
    return("DUP")
  }
}

annotated.cnvs[,ct:=mapply(assign.ct,Copy_Number)]
annotated.cnvs[,ct:=as.factor(ct)]

##Add dummy column for summing using standard data.table functions:
annotated.cnvs[,dummy:=1]

paste("Total number of CNVs loaded:", nrow(annotated.cnvs))
```

## 5B. Load Annotated UKBB CNV calls

This file is identical to the one loaded above for INTERVAL. Please see the head file there as an example.

```{r Load merged UKBB CNV calls}

ukbb.annotated.cnvs <- fread("rawdata/cnvexamples/UKBB.with_bileve.with_batch18.annotated.txt")
setnames(ukbb.annotated.cnvs,"#chr","chr")

ukbb.annotated.cnvs[,ct:=mapply(assign.ct,Copy_Number)]
ukbb.annotated.cnvs[,ct:=as.factor(ct)]

##Add dummy column for summing using standard data.table functions:
ukbb.annotated.cnvs[,dummy:=1]

paste("Total number of CNVs loaded:", nrow(ukbb.annotated.cnvs))
```

### Building The "Good Array" Covariate

This code creates a metric for each probe to see how often it is deleted in good versus bad arrays. 'bad' arrays are classified as:

LRR ≥ 0.35 & CNVs ≥ 30 & abs(WF) ≥ 0.03

and 'good' arrays as the 75%tile of CNVs that are less than the above terms.

```{r defining cutoffs}

bad.arrays <- annotated.cnvs[LRR_SD >= 0.35 | NumCNV >= 30 | abs(WF) >= 0.03]
good.arrays <- annotated.cnvs[LRR_SD < 0.35 & NumCNV < 30 & abs(WF) < 0.03]

## Define cutoffs as 75%ile of each respective category
LRR_SD.quant <- quantile(good.arrays[,LRR_SD])
numCNV.quant <- quantile(good.arrays[,NumCNV])
wf.quant <- quantile(good.arrays[,abs(WF)])

## get best arrays based on these cutoffs. No need to run this portion again since I have already calculated the metric (see below)
best.arrays <- good.arrays[LRR_SD < LRR_SD.quant[[4]] & NumCNV < numCNV.quant[[4]] & abs(WF) < wf.quant[[4]]]

write.table(bad.arrays[ct=="DEL",c("chr","start","end")],"rawdata/cnvexamples/probe_badness/bad_arrays.del.bed",sep="\t",col.names = F,row.names = F)
write.table(best.arrays[ct=="DEL",c("chr","start","end")],"rawdata/cnvexamples/probe_badness/best_arrays.del.bed",sep="\t",col.names = F,row.names = F)
write.table(bad.arrays[ct=="DUP",c("chr","start","end")],"rawdata/cnvexamples/probe_badness/bad_arrays.dup.bed",sep="\t",col.names = F,row.names = F)
write.table(best.arrays[ct=="DUP",c("chr","start","end")],"rawdata/cnvexamples/probe_badness/best_arrays.dup.bed",sep="\t",col.names = F,row.names = F)

write.table(annotated.cnvs[ct=="DEL",c("chr","start","end","cel.file","ct")],"rawdata/cnvexamples/probe_badness/to_annotate.INTERVAL.del.bed",sep="\t",col.names = F,row.names = F,quote=F)
write.table(annotated.cnvs[ct=="DUP",c("chr","start","end","cel.file","ct")],"rawdata/cnvexamples/probe_badness/to_annotate.INTERVAL.dup.bed",sep="\t",col.names = F,row.names = F,quote=F)

write.table(ukbb.annotated.cnvs[ct=="DEL",c("chr","start","end","split.file","ct")],"rawdata/cnvexamples/probe_badness/to_annotate.UKBB.del.bed",sep="\t",col.names = F,row.names = F,quote=F)
write.table(ukbb.annotated.cnvs[ct=="DUP",c("chr","start","end","split.file","ct")],"rawdata/cnvexamples/probe_badness/to_annotate.UKBB.dup.bed",sep="\t",col.names = F,row.names = F,quote=F)
```

We then use bedtools (must be in $PATH) to generate the number of times each axiom probe is independently deleted. Bedtools gives off an error in some instances (`W::sam_read1] Parse error at line 1`) that does not effect the result.

```{bash generate per-probe scores}
## Might need to change to ~/.bashrc depending on system (this is for macos)
source ~/.bash_profile

bedtools intersect -c -a rawdata/cnvexamples/axiom_probes.bed.gz -b rawdata/cnvexamples/probe_badness/best_arrays.del.bed > rawdata/cnvexamples/probe_badness/best_arrays.overlap.del.bed
bedtools intersect -c -a rawdata/cnvexamples/axiom_probes.bed.gz -b rawdata/cnvexamples/probe_badness/bad_arrays.del.bed > rawdata/cnvexamples/probe_badness/bad_arrays.overlap.del.bed

bedtools intersect -c -a rawdata/cnvexamples/axiom_probes.bed.gz -b rawdata/cnvexamples/probe_badness/best_arrays.dup.bed > rawdata/cnvexamples/probe_badness/best_arrays.overlap.dup.bed
bedtools intersect -c -a rawdata/cnvexamples/axiom_probes.bed.gz -b rawdata/cnvexamples/probe_badness/bad_arrays.dup.bed > rawdata/cnvexamples/probe_badness/bad_arrays.overlap.dup.bed

paste rawdata/cnvexamples/probe_badness/best_arrays.overlap.del.bed rawdata/cnvexamples/probe_badness/bad_arrays.overlap.del.bed  | perl -ane 'chomp $_; my $ratio = log(($F[3] + 1) / ($F[7] + 1)); print "$F[0]\t$F[1]\t$F[2]\t$ratio\n";' > rawdata/cnvexamples/probe_badness/del_badness_mask.bed
paste rawdata/cnvexamples/probe_badness/best_arrays.overlap.dup.bed rawdata/cnvexamples/probe_badness/bad_arrays.overlap.dup.bed  | perl -ane 'chomp $_; my $ratio = log(($F[3] + 1) / ($F[7] + 1)); print "$F[0]\t$F[1]\t$F[2]\t$ratio\n";' > rawdata/cnvexamples/probe_badness/dup_badness_mask.bed

## INTERVAL
bedtools sort -i rawdata/cnvexamples/probe_badness/to_annotate.INTERVAL.del.bed | bedtools map -a stdin -b rawdata/cnvexamples/probe_badness/del_badness_mask.bed -c 4 -o mean > rawdata/cnvexamples/probe_badness/annotated.INTERVAL.del.bed
bedtools sort -i rawdata/cnvexamples/probe_badness/to_annotate.INTERVAL.dup.bed | bedtools map -a stdin -b rawdata/cnvexamples/probe_badness/dup_badness_mask.bed -c 4 -o mean > rawdata/cnvexamples/probe_badness/annotated.INTERVAL.dup.bed

## UKBB
bedtools sort -i rawdata/cnvexamples/probe_badness/to_annotate.UKBB.del.bed | bedtools map -a stdin -b rawdata/cnvexamples/probe_badness/del_badness_mask.bed -c 4 -o mean > rawdata/cnvexamples/probe_badness/annotated.UKBB.del.bed
bedtools sort -i rawdata/cnvexamples/probe_badness/to_annotate.UKBB.dup.bed | bedtools map -a stdin -b rawdata/cnvexamples/probe_badness/dup_badness_mask.bed -c 4 -o mean > rawdata/cnvexamples/probe_badness/annotated.UKBB.dup.bed

cat rawdata/cnvexamples/probe_badness/annotated.INTERVAL.del.bed rawdata/cnvexamples/probe_badness/annotated.INTERVAL.dup.bed > rawdata/cnvexamples/probe_badness/annotated.INTERVAL.bed
cat rawdata/cnvexamples/probe_badness/annotated.UKBB.del.bed rawdata/cnvexamples/probe_badness/annotated.UKBB.dup.bed > rawdata/cnvexamples/probe_badness/annotated.UKBB.bed
```

And then just read back in and attach to INTERVAL/UKBB annotated.cnvs:

```{r add probe ratio, fig.height=4, fig.width=8}

## INTERVAL
bad.arrays.bed <- fread("rawdata/cnvexamples/probe_badness/annotated.INTERVAL.bed")
setnames(bad.arrays.bed,names(bad.arrays.bed),c("chr","start","end","cel.file","ct","badness.score"))

annotated.cnvs <- merge(annotated.cnvs,bad.arrays.bed,by=c("chr","start","end","cel.file","ct"),all=F)

## UKBB
bad.arrays.UKBB.bed <- fread("rawdata/cnvexamples/probe_badness/annotated.UKBB.bed")
setnames(bad.arrays.UKBB.bed,names(bad.arrays.UKBB.bed),c("chr","start","end","split.file","ct","badness.score"))
bad.arrays.UKBB.bed <- bad.arrays.UKBB.bed[badness.score!="."]
bad.arrays.UKBB.bed[,badness.score:=as.double(badness.score)]

ukbb.annotated.cnvs <- merge(ukbb.annotated.cnvs,bad.arrays.UKBB.bed,by=c("chr","start","end","split.file","ct"),all=F)
```

## 5C. Filtering based on already published methods

### Filtering based on Macé et. al.

This code filters on a step-wise logistic regression using data provided by [Macé et. al.](https://academic.oup.com/bioinformatics/article/32/21/3298/2415363). Covariates and methods below are stripped from the R package that they developed, and is available online at [PennCNV Pipeline](http://goo.gl/T6yuFM) - this is a direct link to the tarball download. This pipeline was found in [Fang. et. al.](https://link.springer.com/protocol/10.1007/978-1-4939-8666-8_1), which is a descriptor on how to do PennCNV-based array calls.

```{r CNV Filtering in Macé, fig.height=4, fig.width=12}

## Coefficients are from R_script_calculate_quality_score.R in the Penn CNV pipeline
build.coefficients<-function(copy.type) {
  if (copy.type=="DUP") {
    QS_coefficient <- cbind(c(-0.81529185, 3.90248885, -0.90184846, 0.34376765, -0.06351849, -0.01497046, -0.30665878, -0.10354156, -0.44829901, 0.06039380, 0.03645913), rep(0, 11))
    rownames(QS_coefficient) <- c('(Intercept)', 'Max_Log_BF', 'No_Probes', 'LRR_SD', 'BAF_drift', 'WF', 'LRR_mean', 'NumCNV', 'BAF_SD', 'Length_bp', 'BAF_mean')
    colnames(QS_coefficient) <- c('Estimate', 'Pr(>|z|)')
  } else {
    QS_coefficient <- cbind(c(-1.75648377, 4.64898485, -2.50150285, -0.47552224, 0.28876320, 0.10205302, 0.14363692, 0.02959571, 0.00000000, -0.00000000), rep(0, 10))
    rownames(QS_coefficient) <- c('(Intercept)', 'Max_Log_BF', 'No_Probes', 'NumCNV', 'LRR_SD', 'Length_bp', 'LRR_mean', 'WF', 'BAF_drift', 'BAF_SD')
  colnames(QS_coefficient) <- c('Estimate', 'Pr(>|z|)')
  }
    return(QS_coefficient)
}

## Build coefficients seperately for DUPs and DELs.
## This just makes a datatable with the coefficient for DEL/DUP
QS_coefficient_duplication<-build.coefficients("DUP")
QS_coefficient_deletion<-build.coefficients("DEL")

## Function takes my CNV data and the model and generates a quality score (QS) for each CNV
calculate_QS <- function(cnv_data, model_data, copy.type) {
  cnv_data<-cnv_data[ct==copy.type]
  QS_val <- data.table(QS=rep(model_data['(Intercept)', 'Estimate'], nrow(cnv_data)),split.file=cnv_data[,split.file],location=cnv_data[,location],ct=rep(copy.type,nrow(cnv_data)))
  ## This for loop works by taking each covariate, scaling it (scale) across all CNVs and then multiplying it by the Estimate for that covariate
  for(para in 2:nrow(model_data)){
    para_name <- rownames(model_data)[para]
    para.est <- model_data[para_name, 'Estimate']
    para.calcd <- scale(cnv_data[,..para_name])
    QS_val[,QS:=QS + (para.est * para.calcd)]
  }
  QS_val[,QS:= (1 / (1 + exp(-QS)))]
  if (copy.type == "DEL") {
    QS_val[,QS:=QS*-1] ##Deletion scores are scaled negatively
  }
  return(QS_val)
}

## First INTERVAL
QS_deletion <- data.table(calculate_QS(cnv_data = annotated.cnvs, model_data = QS_coefficient_deletion, copy.type="DEL"))
QS_duplication <- data.table(calculate_QS(cnv_data = annotated.cnvs, model_data = QS_coefficient_duplication, copy.type="DUP"))

## Take QS scores and make one table and then merge them with the original annotated.cnvs table
QS.table<-data.table(QS.mace=c(QS_deletion[,QS],QS_duplication[,QS]),split.file=c(QS_deletion[,split.file],QS_duplication[,split.file]),location=c(QS_deletion[,location],QS_duplication[,location]),ct=c(QS_deletion[,ct],QS_duplication[,ct]))
QS_deletion<-NULL
QS_duplication<-NULL

annotated.cnvs<-merge(annotated.cnvs,QS.table,by=c("split.file","location","ct"),all=T)

thresh<-0.8 ## Set Threshold for filtering CNVs
totals<-merge(annotated.cnvs[abs(QS.mace)>=thresh,sum(dummy),by=c("ct")],annotated.cnvs[,sum(dummy),by=c("ct")],by="ct")
totals[,pos.x:=if_else(ct == "DEL",-0.6,0.6)]
totals[,pos.y:=2500]
totals[,pct:=(V1.x/V1.y)*100]

ggplot(annotated.cnvs,aes(QS.mace,fill=ct,colour=ct)) + scale_fill_manual(values=c(del.fill,dup.fill)) + scale_colour_manual(values=c(del.line,dup.line)) + geom_histogram(binwidth = 0.01) + geom_vline(aes(xintercept=0.8),colour=dup.line,size=1.5) + geom_vline(aes(xintercept=-0.8),colour=del.line,size=1.5) + geom_text(inherit.aes=F,data=totals,aes(x=pos.x,y=pos.y,label=paste("Total Pass ",ct,":\n",V1.x,"\n(",sprintf("%.1f",pct),"%)",sep="")),size=5) + xlab("Macé-QS") + ylab("Total CNV sites per bin") + theme

rm(QS.table)

## Now UKBB
QS_deletion <- data.table(calculate_QS(cnv_data = ukbb.annotated.cnvs, model_data = QS_coefficient_deletion, copy.type="DEL"))
QS_duplication <- data.table(calculate_QS(cnv_data = ukbb.annotated.cnvs, model_data = QS_coefficient_duplication, copy.type="DUP"))

## Take QS scores and make one table and then merge them with the original annotated.cnvs table
QS.table<-data.table(QS.mace=c(QS_deletion[,QS],QS_duplication[,QS]),split.file=c(QS_deletion[,split.file],QS_duplication[,split.file]),location=c(QS_deletion[,location],QS_duplication[,location]),ct=c(QS_deletion[,ct],QS_duplication[,ct]))
QS_deletion<-NULL
QS_duplication<-NULL

ukbb.annotated.cnvs<-merge(ukbb.annotated.cnvs,QS.table,by=c("split.file","location","ct"),all=T)

thresh<-0.8 ## Set Threshold for filtering CNVs
totals<-merge(ukbb.annotated.cnvs[abs(QS.mace)>=thresh,sum(dummy),by=c("ct")],ukbb.annotated.cnvs[,sum(dummy),by=c("ct")],by="ct")
totals[,pos.x:=if_else(ct == "DEL",-0.6,0.6)]
totals[,pos.y:=25000]
totals[,pct:=(V1.x/V1.y)*100]

ggplot(ukbb.annotated.cnvs,aes(QS.mace,fill=ct,colour=ct)) + scale_fill_manual(values=c(del.fill,dup.fill)) + scale_colour_manual(values=c(del.line,dup.line)) + geom_histogram(binwidth = 0.01) + geom_vline(aes(xintercept=0.8),colour=dup.line,size=1.5) + geom_vline(aes(xintercept=-0.8),colour=del.line,size=1.5) + geom_text(inherit.aes=F,data=totals,aes(x=pos.x,y=pos.y,label=paste("Total Pass ",ct,":\n",V1.x,"\n(",sprintf("%.1f",pct),"%)",sep="")),size=5) + xlab("Macé-QS") + ylab("Total CNV sites per bin") + theme

rm(QS.table)
```

### Filtering using hard-cutoff methods

Using the models determined as 'canonical' presented in Macé et. al to filter and adding them for comparisons sake as a filter to annotated.cnvs:

**Note**: I have dropped filter A, as the BAF_SD filter of ≤0.05 filters everything (see plot below)

I have also added George Kirov's default filters (taken from [this](https://jmg.bmj.com/content/early/2018/10/20/jmedgenet-2018-105477.long) paper) as filter.gk.

```{r additional filters}

## TRUE means true call, false means FALSE call
## Filter B
annotated.cnvs[,filter.b:=(LRR_SD < 0.3 & BAF_drift < 0.01 & abs(WF) < 0.05 & NumCNV < 100)]
ukbb.annotated.cnvs[,filter.b:=(LRR_SD < 0.3 & BAF_drift < 0.01 & abs(WF) < 0.05 & NumCNV < 100)]
print("Filter B:")
table(annotated.cnvs[,filter.b])

## Filter C
annotated.cnvs[,filter.c:=(LRR_SD < 0.24 & abs(WF) < 0.05 & callrate > 0.99 & No_Probes >= 10)]
ukbb.annotated.cnvs[,filter.c:=(LRR_SD < 0.24 & abs(WF) < 0.05 & callrate > 0.99 & No_Probes >= 10)]
print("Filter C:")
table(annotated.cnvs[,filter.c])

## Filter D
annotated.cnvs[,filter.d:=(LRR_SD < 0.3 & abs(WF) < 0.05 & NumCNV < 100 & callrate > 0.98)]
ukbb.annotated.cnvs[,filter.d:=(LRR_SD < 0.3 & abs(WF) < 0.05 & NumCNV < 100 & callrate > 0.98)]
print("Filter D:")
table(annotated.cnvs[,filter.d])

## Filter E
annotated.cnvs[,filter.e:=(LRR_SD <= 0.27 & BAF_SD <= 0.13 & callrate >= 0.98 & No_Probes >= 5 & Length_bp >= 1000)]
ukbb.annotated.cnvs[,filter.e:=(LRR_SD <= 0.27 & BAF_SD <= 0.13 & callrate >= 0.98 & No_Probes >= 5 & Length_bp >= 1000)]
print("Filter E:")
table(annotated.cnvs[,filter.e])

## Filter GK
annotated.cnvs[,filter.gk:=(LRR_SD <= 0.35 & abs(WF) <= 0.03 & NumCNV <= 30 & callrate >= 0.96 & No_Probes >= 10 & density <= 20000)]
ukbb.annotated.cnvs[,filter.gk:=(LRR_SD <= 0.35 & abs(WF) <= 0.03 & NumCNV <= 30 & callrate >= 0.96 & No_Probes >= 10 & density <= 20000)]
print("George Kirov default filter:")
table(annotated.cnvs[,filter.gk])

## Macé suggested filters
annotated.cnvs[,filter.mace.8:=abs(QS.mace) > 0.8]
annotated.cnvs[,filter.mace.5:=abs(QS.mace) > 0.5]
ukbb.annotated.cnvs[,filter.mace.8:=abs(QS.mace) > 0.8]
ukbb.annotated.cnvs[,filter.mace.5:=abs(QS.mace) > 0.5]
print("Macé suggested filters")
table(annotated.cnvs[,filter.mace.8])
table(annotated.cnvs[,filter.mace.5])

## Add a "null" filter -- e.g. keep all sites to see what happens:
annotated.cnvs[,filter.null:=TRUE]
ukbb.annotated.cnvs[,filter.null:=TRUE]

```

## 5D. Novel Random Forest Model

Going to use the R module 'randomForest' to develop an ML model for declaring if a CNV "looks like" a WES CNV. 

Covariates included in my model are listed below in the Formula section. Using the tutorial [here](https://trevorstephens.com/kaggle-titanic-tutorial/r-part-5-random-forests/) to help me set up RandomForest.

Using covariates that are included in the output from Java "Annotate". Currently have built in the following covariates ("-" indicates covariates included in Macé et. al.):

* WF -
* BAF_SD -
* BAF_mean -
* No_Probes -
* Length_bp -
* BAF_drift -
* LRR_mean
* density
* LRR_SD
* site.LRR_mean
* site.LRR_SD
* site.BAF_mean
* site.BAF_SD
* badness.score (my bad name for the score calculated in the above section on "Good Array" Covariate).

I have excluded the Max_Lof_BF covariate from the Macé model.

### Formula for ML Models

The below code is used by *all* of my machine learning models as the input function into 'RandomForest'. Any modification here will change all downstream code:

```{r ML Formula}
covariates<-c("LRR_SD","BAF_drift","WF","LRR_mean","NumCNV","BAF_SD","BAF_mean","site.BAF_mean","site.BAF_SD","site.LRR_mean","site.LRR_SD","density","No_Probes","Length_bp","badness.score")
cov.string <- paste(covariates, collapse=" + ")
form <- as.formula(paste("wes.support.num",cov.string,sep=" ~ "))
```

### Testing Random Forest with cross validation

Below builds the random forest model by spliting CNVs with > 2 WES.probes into training and test sets. The training set is used to generate a model and then this model is used to predict the test set. As part of this validation, I do two randomization experiments:

1. Randomize the column "wes.support" right after I select CNVs with > 2 WES probes
2. Randomize the column "wes.support" after I build the model

This code makes extensive use of the package ROCR combined with cross-validating and generating confidence intervals for the AUC with [cvAUC](https://cran.r-project.org/web/packages/cvAUC/cvAUC.pdf).

**Note**: cvAUC simply calls ROCR to generate ROC curves, which I extract to make my own plots). I extract the X & Y variables from the ROCR performance objects generated by cvAUC, but underneath the hood it is all ROCR. I did generate a function (get.avg) which uses the "avg" function in .plot.performance which is supplied by ROCR. Good tutorial for basic ROCR package use is [here](https://www.r-bloggers.com/a-small-introduction-to-the-rocr-package/).

```{r leave one out, fig.height=7, fig.width=10, warning=FALSE}

## This function just runs the various models through RandomForest and Logistic Regression
run.models <- function(test,training) {
  
  output.forest <- randomForest(form, data = training,  importance = T, ntree = 500)
  forest.result <- data.table(predict(output.forest,test))
  test[,prob.wes:=forest.result[,V1]]
  
  return(test)
  
}

## Thus function runs each model and returns it as a list. This just enables cross-validation
create.cross.val <- function(copy.type) {
  
  basic.result <- train.basic(copy.type)
  random.result <- train.random(copy.type)

  ## Randomize wes.support after the model has been decided
  basic.result[,wes.support.random:=resample(basic.result[,wes.support])]
  
  ## Add old school cutoffs:
  predictor.os <- data.table(training.type=c("filter.b","filter.c","filter.d","filter.e","filter.gk","filter.mace.8","filter.mace.5","filter.null"))
  
  return(list(list(basic.result[,prob.wes]),list(basic.result[,wes.support]),list(basic.result[,wes.support.random]),
        list(random.result[,prob.wes]),list(random.result[,wes.support]),
        spec("filter.b",basic.result),sens("filter.b",basic.result),
        spec("filter.c",basic.result),sens("filter.c",basic.result),
        spec("filter.d",basic.result),sens("filter.d",basic.result),
        spec("filter.e",basic.result),sens("filter.e",basic.result),
        spec("filter.gk",basic.result),sens("filter.gk",basic.result),
        spec("filter.mace.8",basic.result),sens("filter.mace.8",basic.result),
        spec("filter.mace.5",basic.result),sens("filter.mace.5",basic.result),
        spec("filter.null",basic.result),sens("filter.null",basic.result)
        ))
  
}

## Gets an actual ROC curve that can be plotted as well as confidence intervals for display
build.plottable <- function(pred,labels) {
  
  avg.roc<-get.avg(performance(prediction(pred,labels),"sens","spec"))
  avg.roc.cis <- ci.cvAUC(pred,labels,confidence=0.95)
  return(list(avg.roc,avg.roc.cis$cvAUC,avg.roc.cis$ci[1],avg.roc.cis$ci[2]))
  
}

## This function returns cross validations for all model building
get.avg<-function(perf) {

  if (length(perf@alpha.values) != 0) 
      perf@alpha.values <- lapply(perf@alpha.values, function(x) {
          isfin <- is.finite(x)
          x[is.infinite(x)] <- (max(x[isfin]) + mean(abs(x[isfin][-1] - 
              x[isfin][-length(x[isfin])])))
          x
      })
  for (i in 1:length(perf@x.values)) {
      ind.bool <- (is.finite(perf@x.values[[i]]) & is.finite(perf@y.values[[i]]))
      if (length(perf@alpha.values) > 0) 
          perf@alpha.values[[i]] <- perf@alpha.values[[i]][ind.bool]
      perf@x.values[[i]] <- perf@x.values[[i]][ind.bool]
      perf@y.values[[i]] <- perf@y.values[[i]][ind.bool]
  }

  perf.sampled <- perf
  alpha.values <- rev(seq(min(unlist(perf@alpha.values)), max(unlist(perf@alpha.values)), 
    length = max(sapply(perf@alpha.values, length))))
  for (i in 1:length(perf.sampled@y.values)) {
    perf.sampled@x.values[[i]] <- approxfun(perf@alpha.values[[i]], 
        perf@x.values[[i]], rule = 2, ties = "mean")(alpha.values)
    perf.sampled@y.values[[i]] <- approxfun(perf@alpha.values[[i]], 
        perf@y.values[[i]], rule = 2, ties = "mean")(alpha.values)
  }
  perf.avg <- perf.sampled
  perf.avg@x.values <- list(rowMeans(data.frame(perf.avg@x.values)))
  perf.avg@y.values <- list(rowMeans(data.frame(perf.avg@y.values)))
  perf.avg@alpha.values <- list(alpha.values)
  return(perf.avg)
  
}

## Do training selection based on all possible CNVs (so overlapping sites can be considered)
train.basic <- function(copy.type) {

  annotated.cnvs.ct<-annotated.cnvs[ct==copy.type & !is.na(WESID) & wes.probe.count>2 & Copy_Number > 0 & (wes.support.num == 0 | wes.support.num == 1 | wes.support.num == 2 | wes.support.num == 3)]
  annotated.cnvs.ct[,row.num:=.I]

  training.set.size<-as.numeric(nrow(annotated.cnvs.ct) / 2)
  
  training<-sample_n(annotated.cnvs.ct,training.set.size)
  training.rows<-training[,row.num]
  test<-annotated.cnvs.ct[!row.num %in% training.rows][Max_Log_BF!=0]
  
  test<-run.models(test,training)
  return(test)
  
}

## Randomized wes.support at beginning
train.random <- function(copy.type) {

  annotated.cnvs.ct<-annotated.cnvs[ct==copy.type & !is.na(WESID) & wes.probe.count>2 & Copy_Number > 0 & (wes.support.num == 0 | wes.support.num == 1 | wes.support.num == 2 | wes.support.num == 3)]
  annotated.cnvs.ct[,row.num:=.I]
  annotated.cnvs.ct[,wes.support:=sample(annotated.cnvs.ct[,wes.support])]
  
  training.set.size<-as.numeric(nrow(annotated.cnvs.ct) / 2)

  training<-sample_n(annotated.cnvs.ct,training.set.size)
  training.rows<-training[,row.num]
  test<-annotated.cnvs.ct[!row.num %in% training.rows][Max_Log_BF!=0]
  
  test<-run.models(test,training)
  return(test)
  
}

## Helper functions for calculating Sens/Spec of old-school filters
sens<-function(x,to.use) {
  TPs <- nrow(to.use[get(x) == T][wes.support == T])
  FNs <- nrow(to.use[get(x) == F][wes.support == T])
  return(TPs / (TPs + FNs))
}
spec<-function(x,to.use) {
  FPs <- nrow(to.use[get(x) == T][wes.support == F])
  TNs <- nrow(to.use[get(x) == F][wes.support == F])
  return(TNs / (TNs + FPs))
}

## this function just makes it so the code is a bit cleaner.
make.ROC <- function(copy.type) {

  ## Cross validate the models:
  cross.val <- data.table(run=c(1:10))

  cross.val[,c("basic.forest","wes.support.basic","randomized","random.forest","wes.support.random","filter.b.spec","filter.b.sens","filter.c.spec","filter.c.sens","filter.d.spec","filter.d.sens","filter.e.spec","filter.e.sens","filter.gk.spec","filter.gk.sens","filter.mace.8.spec","filter.mace.8.sens","filter.mace.5.spec","filter.mace.5.sens","filter.null.spec","filter.null.sens"):=create.cross.val(copy.type),by=run]
  
  basic.forest.plot <- build.plottable(cross.val[,basic.forest],cross.val[,wes.support.basic])
  random.i.forest.plot <- build.plottable(cross.val[,random.forest],cross.val[,wes.support.random])
  random.p.forest.plot <- build.plottable(cross.val[,basic.forest],cross.val[,randomized])
  
  if (copy.type == "DEL") {
    mult <- -1
  } else {
    mult <- 1
  }
  
  plot.info <- data.table(AUC=c(basic.forest.plot[[2]],random.i.forest.plot[[2]],random.p.forest.plot[[2]]),
                          ci.lower=c(basic.forest.plot[[3]],random.i.forest.plot[[3]],random.p.forest.plot[[3]]),
                          ci.upper=c(basic.forest.plot[[4]],random.i.forest.plot[[4]],random.p.forest.plot[[4]]),
                          model=c("RandomForest","RandomForest","RandomForest"),
                          training.type=c("basic","random.initial","random.post"),
                          colour=c("#619CFF","#619CFF","#619CFF"))

  perf.dt <- data.table(x = c(unlist(basic.forest.plot[[1]]@x.values),
                              unlist(random.i.forest.plot[[1]]@x.values),
                              unlist(random.p.forest.plot[[1]]@x.values)),
                        y = c(unlist(basic.forest.plot[[1]]@y.values),
                              unlist(random.i.forest.plot[[1]]@y.values),
                              unlist(random.p.forest.plot[[1]]@y.values)),
                        a = c(unlist(basic.forest.plot[[1]]@alpha.values)*mult,
                              unlist(random.i.forest.plot[[1]]@alpha.values)*mult,
                              unlist(random.p.forest.plot[[1]]@alpha.values)*mult),
                        training.type=c(rep("basic",length(unlist(basic.forest.plot[[1]]@x.values))),
                                        rep("random.initial",length(unlist(random.i.forest.plot[[1]]@x.values))),
                                        rep("random.post",length(unlist(random.p.forest.plot[[1]]@x.values)))),
                        model=c(rep("RandomForest",length(unlist(basic.forest.plot[[1]]@x.values))),
                                rep("RandomForest",length(unlist(random.i.forest.plot[[1]]@x.values))),
                                rep("RandomForest",length(unlist(random.p.forest.plot[[1]]@x.values)))))
                        
  perf.dt <- merge(perf.dt,plot.info,by=c("training.type","model"))

  plot.info[,formated.AUC:=paste(sprintf("%0.3f",AUC),"[",sprintf("%0.3f",ci.lower),"-",sprintf("%0.3f",ci.upper),"]",sep="")]

  ## Add filters based on hard cutoffs:
  predictor.os <- data.table(training.type=c("filter.b","filter.c","filter.d","filter.e","filter.gk","filter.mace.8","filter.mace.5","filter.null"),
                             sens.mean=c(mean(cross.val[,filter.b.sens]),mean(cross.val[,filter.c.sens]),mean(cross.val[,filter.d.sens]),mean(cross.val[,filter.e.sens]),mean(cross.val[,filter.gk.sens]),mean(cross.val[,filter.mace.8.sens]),mean(cross.val[,filter.mace.5.sens]),mean(cross.val[,filter.null.sens])),
                             spec.mean=c(mean(cross.val[,filter.b.spec]),mean(cross.val[,filter.c.spec]),mean(cross.val[,filter.d.spec]),mean(cross.val[,filter.e.spec]),mean(cross.val[,filter.gk.spec]),mean(cross.val[,filter.mace.8.spec]),mean(cross.val[,filter.mace.5.spec]),mean(cross.val[,filter.null.spec])),
                             sens.sd=c(sd(cross.val[,filter.b.sens]),sd(cross.val[,filter.c.sens]),sd(cross.val[,filter.d.sens]),sd(cross.val[,filter.e.sens]),sd(cross.val[,filter.gk.sens]),sd(cross.val[,filter.mace.8.sens]),sd(cross.val[,filter.mace.5.sens]),sd(cross.val[,filter.null.sens])),
                             spec.sd=c(sd(cross.val[,filter.b.spec]),sd(cross.val[,filter.c.spec]),sd(cross.val[,filter.d.spec]),sd(cross.val[,filter.e.spec]),sd(cross.val[,filter.gk.spec])),sd(cross.val[,filter.mace.8.spec]),sd(cross.val[,filter.mace.5.spec]),sd(cross.val[,filter.null.spec]))
  
  predictor.os[,spec.upper:=spec.mean+spec.sd]
  predictor.os[,spec.lower:=spec.mean-spec.sd]
  predictor.os[,sens.upper:=sens.mean+sens.sd]
  predictor.os[,sens.lower:=sens.mean-sens.sd]
  
  if (copy.type == "DEL") {
    predictor.os[,a:=-0.5]
  } else {
    predictor.os[,a:=0.5]
  }
  predictor.os[,model:="OS"]
  predictor.os[,colour:="#000000"]

  if (copy.type == "DEL") {
  
    return(list(perf.dt,predictor.os,plot.info,del.line))
  
  } else {
    
    return(list(perf.dt,predictor.os,plot.info,dup.line))  
    
  }
    
}

DEL.plot <- make.ROC("DEL")
DUP.plot <- make.ROC("DUP")

## Add actual paper names...
DEL.plot[[2]][,training.type.papers:=factor(training.type,levels=c("filter.null","filter.b","filter.c","filter.d","filter.e","filter.gk","filter.mace.5","filter.mace.8"),labels=c("None","Wang et al.","Chettier et al.","Glessner et al.","Pinto et al.","Crawford et al.","Macé et al. (QS = 0.5)","Macé et al. (QS = 0.8)"))]
DUP.plot[[2]][,training.type.papers:=factor(training.type,levels=c("filter.null","filter.b","filter.c","filter.d","filter.e","filter.gk","filter.mace.5","filter.mace.8"),labels=c("None","Wang et al.","Chettier et al.","Glessner et al.","Pinto et al.","Crawford et al.","Macé et al. (QS = 0.5)","Macé et al. (QS = 0.8)"))]

## Everything from here down is just to plot
plot.data <- function(data.curve,data.point,col.to.use) {
  
  plot <- ggplot() +
    geom_line(data=data.curve,aes(x,y,group=interaction(model,training.type),linetype=training.type),size=1.5,colour=col.to.use) +
    scale_colour_identity(labels=unique(data.curve[,model])) +
    geom_point(data=data.point,aes(spec.mean,sens.mean),colour="black") +
    geom_errorbar(data=data.point,aes(x=spec.mean,ymin=sens.lower,ymax=sens.upper),colour="black") +
    geom_errorbarh(data=data.point,aes(y=sens.mean,xmin=spec.lower,xmax=spec.upper),colour="black") +
    geom_text(data=data.point,aes(spec.mean-0.03,sens.mean,label=training.type.papers,colour=colour),size=5,hjust=0,position=position_dodge(width=0.1)) +
    scale_x_reverse(name="Specificity",lim=c(1,-0.1),breaks=c(1,.75,.5,.25,0)) +
    scale_y_continuous(name="Sensitivity",lim=c(0,1)) +
    geom_abline(intercept = 1) +
    theme.legend

  return(plot)
}

for (a in list(DEL.plot,DUP.plot)) {

  i.curve <- a[[1]]
  i.point <- a[[2]]
  colour <- a[[4]]
  
  ## Plot RF basic & overlap models
  print(plot.data(i.curve[model=="RandomForest"][training.type=="basic" | training.type == "random.initial" | is.na(AUC)],i.point,colour))
  
}
rm(a)
```

### Full RandomForest to Filter UK Biobank CNVs

Now that we know that our model is valid, we know take _all_ of the available WES overlap data and train a random forest and then predict on the non-exome subset for just UK Biobank.

Generating our 'wes.probe.score' for UK Biobank CNV Calls by using _all_ individuals with WES from INTERVAL to train our RandomForest model.

```{r UKBB filtering, fig.height=6, fig.width=6, warning=FALSE}

make.my.own.model<-function(copy.type) {

  ## Pull correct CN CNVs from samples that have WES data with potential probe number > 2 
  training<-annotated.cnvs[ct==copy.type & !is.na(WESID) & wes.probe.count>2 & Copy_Number > 0 & (wes.support.num == 0 | wes.support.num == 1 | wes.support.num == 2 | wes.support.num == 3)]
  output.forest <- randomForest(form, data = training, importance = T)

  forest.result <- data.table(predict(output.forest,ukbb.annotated.cnvs[ct==copy.type ]))
  forest.result[,split.file:=ukbb.annotated.cnvs[ct==copy.type,split.file]]
  forest.result[,location:=ukbb.annotated.cnvs[ct==copy.type,location]]
  forest.result[,ct:=copy.type]
  setnames(forest.result,"V1","wes.support.score")
  if (copy.type == "DEL") {
    forest.result[,wes.support.score:=wes.support.score * -1]
  }
  
  return(list(forest.result,output.forest))
  
}

Forest_deletion<-make.my.own.model("DEL")
Forest_duplication<-make.my.own.model("DUP")

forest.table <- bind_rows(Forest_deletion[[1]],Forest_duplication[[1]])
Forest_deletion<-Forest_deletion[[2]]
Forest_duplication<-Forest_duplication[[2]]

ukbb.annotated.cnvs<-merge(ukbb.annotated.cnvs,forest.table,by=c("split.file","location","ct"),all=T)

## Homozygous DEL CNVs are almost uniformily good.
ukbb.annotated.cnvs[,wes.support.score:=mapply(ifelse,Copy_Number==0,-3,wes.support.score)]
ukbb.annotated.cnvs[,wes.support.score:=mapply(ifelse,Copy_Number==0,-3,wes.support.score)]

##Print forest results:
print(Forest_deletion)
print(Forest_duplication)

## Plots covariate weights:
del.imp<-data.table(Forest_deletion$importance)
del.imp[,covar:=rownames(Forest_deletion$importance)]
del.imp.sd <- data.table(Forest_deletion$importanceSD)
del.imp.sd <- del.imp.sd[,covar:=names(Forest_deletion$importanceSD)]
setnames(del.imp.sd,"V1","std.err")
del.imp <- merge(del.imp,del.imp.sd,by="covar")
rm(del.imp.sd)
del.imp[,IncNodePurity:=NULL]
del.imp[,IncMSE.final.del:=`%IncMSE` / std.err]
del.imp[,`%IncMSE`:=NULL]
del.imp[,std.err:=NULL]

dup.imp<-data.table(Forest_duplication$importance)
dup.imp[,covar:=rownames(Forest_duplication$importance)]
dup.imp.sd <- data.table(Forest_duplication$importanceSD)
dup.imp.sd <- dup.imp.sd[,covar:=names(Forest_duplication$importanceSD)]
setnames(dup.imp.sd,"V1","std.err")
dup.imp <- merge(dup.imp,dup.imp.sd,by="covar")
rm(dup.imp.sd)
dup.imp[,IncNodePurity:=NULL]
dup.imp[,IncMSE.final.dup:=`%IncMSE` / std.err]
dup.imp[,`%IncMSE`:=NULL]
dup.imp[,std.err:=NULL]

importance <- merge(dup.imp,del.imp,by="covar")
importance <- data.table(gather(importance,IncMSE.final.del,IncMSE.final.dup,key="Copy Type",value="%IncMSE"))
setkey(importance,"%IncMSE")
importance[,covar:=factor(covar,levels=importance[`Copy Type`=="IncMSE.final.del",covar])]

## Actual Plots - Sorted by importance for Deletions.
ggplot(importance,aes(covar,`%IncMSE`,group=`Copy Type`,colour=`Copy Type`)) + geom_point(position=position_dodge(width=1),size=3) + scale_colour_manual(values = c(del.line,dup.line)) + scale_x_discrete(labels=c("Num CNVs","BAF Mean","BAF SD","BAF Drift","LRR Mean", "site - BAF SD","site - LRR SD","LRR SD","WF","site - BAF Mean","CNV Length","No. Probes","Probe Density","Probe Error","site - LRR Mean"),name="") + coord_flip() + theme.legend
```

## 5E. Cutoff application and final file creation

### Hard Cutoff

First defining hard filtering scores based on some set parameters to remove really bad individuals that don't get dropped by our RandomForest. We hard filter on:

1. > 19 CNVs
2. $|WF|$ > 0.03

This also creates a reference file of individuals that are filtered so that they can be provided to `PhenotypeTesting.Rmd`. This file write split_file IDs, which the user needs to translate into eids from UK Biobank. That functionality is not encoded here.

```{r Hard Filtering on Num WES}

## Collect Counts Per Individual and some other stats and filter based on the Poisson distribution of total CNVs
get.filtered.indvs <- function(data) {

  count.per.indv <- data[,sum(dummy),by=c("split.file")]
  count.per.indv <- merge(count.per.indv,unique(data[,c("split.file","WF")]),by="split.file", all.y=F, all.x = T)
  setnames(count.per.indv,"V1","NumCNV")
  pois<-data.table(table(rpois(nrow(count.per.indv),median(count.per.indv[,NumCNV]))))
  NumCNV.filt <- 19
  WF.filt <- 0.03
  
  # Plot of the above
  print(ggplot(count.per.indv,aes(NumCNV)) + geom_histogram(binwidth=1) + geom_vline(aes(xintercept=NumCNV.filt)) + xlim(0,50) + geom_point(data=pois,aes(as.numeric(V1),as.numeric(N)),colour="blue") + theme)
  
  filtered.indvs <- count.per.indv[NumCNV > NumCNV.filt | abs(WF) > 0.03,split.file]

  return(filtered.indvs)
  
}

ukbb.filtered.indvs <- get.filtered.indvs(ukbb.annotated.cnvs)

write.table(data.table(split.file=ukbb.filtered.indvs),"rawdata/cnvexamples/ukbb.filtered_cnv_indvs.txt",sep="\t",quote=F,row.names=F)
```

### ML and Binary Filter Cutoffs

Currently using 95% Specificity in leave one out model performed in section 4D to determine appropriate cutoffs.

```{r Calculate Cutoff, fig.height=6, fig.width=8}

calculate.cutoff <- function(cutoff, data) {

    ## Overall Sens/Specificity from leave one out
    combined.DUP.DEL <- bind_rows(DUP.plot[[1]][training.type == "basic"],DEL.plot[[1]][training.type == "basic"])
    combined.DUP.DEL <- combined.DUP.DEL[,c("a","x","y")]
    setnames(combined.DUP.DEL,c("a","x","y"),c("score","spec","sens"))
    combined.DUP.DEL[,metric:="rf.wes.binary"]
    
    # combined.DUP.DEL <- data.table(score=c(combined.DUP.DEL[training.type=="basic" & model == "RandomForest",a],
    #                                        combined.DUP.DEL[model=="OS",a]),
    #                                metric=c(rep("rf.wes.binary",length(combined.DUP.DEL[training.type=="basic" & model == "RandomForest",a])),
    #                                         combined.DUP.DEL[model=="OS",training.type]),
    #                                spec=c(combined.DUP.DEL[training.type=="basic" & model == "RandomForest",x],
    #                                       combined.DUP.DEL[model=="OS",x]),
    #                                sens=c(combined.DUP.DEL[training.type=="basic" & model == "RandomForest",y],
    #                                       combined.DUP.DEL[model=="OS",y]))
    
    ## Calculate cutoffs based on 90% sensitivity in LOO model for both DEL and DUP in Random Forest, Log Regression, and Macé and generate sensitivity/specificity values
    calcd.cutoffs <- data.table(ct=c(rep(c("DEL","DUP"),1)),
                                metric=c("rf.wes.binary","rf.wes.binary"),
                                linetype=c(1,1))
    
    ## calc.cutoff and calc.sens.spec functions use interpolation by subtracting the cutoff from each score (either sensitivity to DGV or the respective metric), and finding the minimum result of that calculation
    # Setting a higher cutoff simply lets all variants through, and removes the specificity to DGV (which we can't really calculate)
    calc.cutoff <- function(ct,m) {
      
      if (ct == "DEL") {
        return(combined.DUP.DEL[metric==m][score<=0,score][which.min(abs(combined.DUP.DEL[metric==m][score<=0,spec] - cutoff))])
      } else {
        return(combined.DUP.DEL[metric==m][score>=0,score][which.min(abs(combined.DUP.DEL[metric==m][score>=0,spec] - cutoff))])
      }
      
    }
    calcd.cutoffs[,co:=mapply(calc.cutoff,ct,metric)]
    
    calc.sens.spec <- function(ct,m,co,returnable) {
      if (returnable == "DGV") {
        if (ct == "DEL") {
          return(unlist(sensitivity.DGV[metric==m][score<=0,sens.DGV][which.min(abs(sensitivity.DGV[metric==m][score<=0,score] - co))]))
        } else {
          return(unlist(sensitivity.DGV[metric==m][score>=0,sens.DGV][which.min(abs(sensitivity.DGV[metric==m][score>=0,score] - co))]))
        }
      } else {
        if (ct == "DEL") {
          return(unlist(combined.DUP.DEL[metric==m][score<=0,..returnable][which.min(abs(combined.DUP.DEL[metric==m][score<=0,score] - co))]))
        } else {
          return(unlist(combined.DUP.DEL[metric==m][score>=0,..returnable][which.min(abs(combined.DUP.DEL[metric==m][score>=0,score] - co))]))
        }
      }
    }
    calcd.cutoffs[,sens:=mapply(calc.sens.spec,ct,metric,co,"sens")]
    calcd.cutoffs[,spec:=mapply(calc.sens.spec,ct,metric,co,"spec")]
    
    calcd.cutoffs.os <- bind_rows(DEL.plot[[2]],DUP.plot[[2]])
    calcd.cutoffs.os[,ct:=ifelse(a<=0,"DEL","DUP")]
    setnames(calcd.cutoffs.os,"a","co")
    
    ## Apply the actual cutoff
    apply.cutoff <- function(m, co.val) {
      DEL.cutoff <- calcd.cutoffs[metric==m][ct=="DEL",co]
      DUP.cutoff <- calcd.cutoffs[metric==m][ct=="DUP",co]
      if (m == "rf.wes.binary") {
        m <- "wes.support.score"
      }
      data[,paste("filter",eval(co.val),eval(m),sep="."):= (get(m) < DEL.cutoff | get(m) > DUP.cutoff) & !split.file %in% ukbb.filtered.indvs]
    }
    
    apply.cutoff("rf.wes.binary", cutoff)

    ## Just find total number of variants filtered by each metric
    calc.n <- function(copy.type,m,co) {
      if (m == "rf.wes.binary") {
        m <- "wes.support.score"
      }
      if (abs(co) == 0.5) {
        return(nrow(data[ct==copy.type][get(m) == T]))
      } else {
        return(nrow(data[ct==copy.type][get(paste("filter",eval(co),eval(m),sep=".")) == T]))
      }
    }
    calc.pct <- function(n,copy.type) {
        return(n / nrow(data[ct==copy.type]))
    }
    calcd.cutoffs[,total.pass.sites:=mapply(calc.n,ct,metric,cutoff)]
    calcd.cutoffs[,total.pct.sites:=mapply(calc.pct,total.pass.sites,ct)]
    calcd.cutoffs[,score:=NULL]
    
    calcd.cutoffs.os[,total.pass.sites:=mapply(calc.n,ct,training.type,co)]
    calcd.cutoffs.os[,total.pct.sites:=mapply(calc.pct,total.pass.sites,ct)]
    calcd.cutoffs.os[,score:=NULL]
    
    ## Just print the table so it is included as a part of the Markdown
    print(calcd.cutoffs)

    ## Final Sensitivity v Specificity
    print(ggplot() + geom_text(data=calcd.cutoffs,aes(spec,sens,colour=ct,label=metric),size=6) +
            geom_text(data=calcd.cutoffs.os,aes(spec.mean,sens.mean,colour=ct,label=training.type),size=6) +      
            scale_colour_manual(values=c(del.line,dup.line)) + 
            scale_x_reverse(lim=c(1,0),name="Specificity") + 
            scale_y_continuous(lim=c(0,1),name="Sensitivity") + theme)
    
    return(list(calcd.cutoffs,calcd.cutoffs.os))
    
}

cutoff.ukbb <- calculate.cutoff(0.95, ukbb.annotated.cnvs)

```

## 5F. Print and Save Final QCd UKBB CNVs 

This prints my final file with QC metrics from my basic model. **Note**: This file has ALL sites included — it *does not* remove any sites and should be filtered by any subsequent pipeline.

**Big Note**: The results of the QC will change slightly every time this script is run, due to the "random" portion of Random Forest. Results are highly correlated and do not change the results of downstream phenotype testing.

```{r Print final file}
final.names <- c(original.names[1:38],"badness.score",names(annotated.cnvs)[grep("filter\\.",names(annotated.cnvs))],"wes.support.score")

## Save the UKBB file for merging:
write.table(ukbb.annotated.cnvs[,..final.names],"rawdata/cnvexamples/ukbb.qcd.txt",sep="\t",col.names = F, row.names = F, quote = F)
```

# 6. Merging of DEL/DUP calls

Final file is provided as input to java code included as part of this project. This is the same code/jar as used for "Covariate Annotation" in section 3, except with a different subcommand -- `Merge`. Just reiterating, but sourcecode is in the directory `src/CNVPolisher/`, the jar is at `scripts/CNVAnnotator.jar`. This code:

* Merges CNVs (including multiallelic sites, so DEL + DUP together) based on 75% reciprocal overlap -- did this to help make sure that exons were actually hit by the CNV of interest
* Writes a VCF file (not used for this project)
* Writes a file (with the suffix from `-o` + *.qcdMerged.txt) that is the exact some format as the input, but with the appropriate interval in the VCF file at the end.
* Annotates CNVs for pathogenic CNVs as in [Crawford et al](https://jmg.bmj.com/content/early/2018/10/20/jmedgenet-2018-105477.long).

Due to other projects, this code takes as input a tab-delimited 'samples_file'. For the purposes of UKBiobank CNV filtering, it can be fairly barebones and looks like:

`NA /biobank-cnv/BATCHES/Batch1/Batch1_split_zipped/split1.1234567_12345_0_0.sorted.bed.gz NA MALE 0`

The **ONLY** relevant column is the second, all others can be dummy values and DO NOT MATTER. Just ensure that the split file is given with the full path and `.sorted.bed.gz` suffix.

Example command to run is included here:

```{bash QC and Merge CNVs, eval=FALSE}

## Note: For UKBB scale data, you may need to provide > 20Gb of memory (via -Xmx/-Xms) to Java for it to get through this process!
java -Xmx20G -Xms20G -jar scripts/CNVAnnotator.jar Merge /
  -p rawdata/cnvexamples/ukbb.qcd.txt /
  -o rawdata/cnvexamples/ukbb.unfiltered /
  -tmp tmp/ /
  -h hs37d5.fa /
  -samples samples_file.txt
```

# 7. VEP Annotation

This section annotates every locus in `ukbb.qcd.txt`. The final output file, `cnv_vep_parsed.sorted.bed` has a single line per unique DEL/DUP interval (basedo n the 75% reciprocal overlap from above) with gene level-information.

## 7A. Generate Gene Lists

This section is going to be excluded here, as code is already present in the `PhenotypeTesting.Rmd` and `SNVCalling_Filtering.Rmd` documents. You can use the files generated there for this document.

Reiterating what gene lists that we use here:

1. ENSEMBL-downloaded resources from [BioMart](https://www.ensembl.org/biomart/martview/0511514c231557b5d24ace4e8f7862e0).
2. pLI information from the [gnomAD project](https://storage.googleapis.com/gnomad-public/release/2.1.1/constraint/gnomad.v2.1.1.lof_metrics.by_gene.txt.bgz).
3. s~het~ from [Weghorn et al](https://doi.org/10.1093/molbev/msz092). The reference file is included in this repository (`raw_data/genelist/shet.weghorn.txt`).
4. Disease genes from ClinVar, DDG2P, and OMIM (see that section for links).
5. Male Infertility Genes from [this paper](https://academic.oup.com/humrep/article/34/5/932/5377831).
6. "Highly expressed genes" from each of the tissues assessed by [GTEx v7]()

## 7B. Annotate With VEP

This analysis proceeds in three basic steps:

1. Extract all loci out of the output from `CNVAnnotator.jar Merge` and generate a list of intervals (`/scripts/make_cnv_intervals.pl`)
2. Run VEP on all of those loci (`scripts/annotate_vep.pl`)
    + **Note**: You need to make sure you edit this script to set appropriate paths to your own file system!
3. Generate a tab delimited file that can be used when running `PhenotypeTesting.Rmd` (`scripts/attach_gene_stats.pl`). This calculates:
    + Number of high pLI/s~het~ genes per locus
    + s~het~ scores without disease genes, male infertility genes, or mouse infertility genes
    + s~het~ scores without highly expressed genes in ~50 tissues surveyed by GTEx
    + s~het~ scores given by the formula:
    
    $$s_{het[i,v]}=1-\prod_{g} (1-s_{het[i,v,g]})$$

**Note:** The script `attach_gene_stats.pl` requires gene lists generated above to be located in the execution directory. Please see that script for more details on what that entails.

**Note:** The method to generate disease, male infertility, mouse infertility, and tissue specific lists is documented in `PhenotypeTesting.Rmd`. Please run that code prior to running this code and then provide those files as required here.

```{bash VEP Annotation, eval=FALSE}

## 1. Generate CNV Intervals 
scripts/make_cnv_intervals.pl rawdata/cnvexamples/ukbb.unfiltered.qcdMerged.txt | bedtools sort -i stdin > cnv_for_vep.sorted.bed

## 2. Annotate with VEP
# Prepare
mkdir vep_input/; cd vep_input/
# We make a set of files with 500 lines each so we can parellelize
split -a 3 --numeric-suffixes=1 -l 500 ../cnv_for_vep.sorted.bed cnv_for_vep.
# This _should_ generate ~450 files but will change slightly depending on QC.

# Actual VEP command is fairly simple, but generate a script to help with parallelization:
# Takes as it's only input the file number you need to annotate:
scripts/annotate_vep.pl <1-457>

## 3. Merge the annotations with gene level information and calculate sHET scores:
# Make a quick file list for this script to do male infertility/disease gene scores:
echo -e "rawdata/genelists/diseaseGenes.txt\nrawdata/genelists/male_infertility_genes.txt\nrawdata/genelists/mouse_infertility_genes.annotated.txt" > rawdata/genelists/gene_lists.txt

# This just makes a one-liner runs all the files at once, can do it however you like
perl -e 'for (1..457) {my $num = sprintf("%03d", $_); print "./attach_gene_stats.pl vep_input/vep_annotation.$num.txt > vep_input_revision_rd4/vep_parsed.$num.txt;\n";}' | bash

## Cat it all together!
head -1 vep_input_revision_rd4/vep_parsed.001.txt > header.txt
cat vep_input_revision_rd4/vep_parsed.*.txt | grep -v 'product_sHET' | sort -k 1,1 -k2,2n > cnv_vep_parsed.revision.sorted.bed
cat header.txt cnv_vep_parsed.revision.sorted.bed > cnv_vep_parsed.revision.sorted.header.bed
```

## 7C. Build Final Files for Phenotype Testing

The last thing we need to do is read in our merged CNV file, remove needless columns for phenotype testing and get in the column order that `PhenotypeTesting.Rmd` needs. This is an **example** code block and does not actually run. You need to modify it to properly set the eids column based on the file you provide to translate PennCNV split files to eids.

```{r generate final file, eval = F}

## Load Merged CNVs
ukbb.qcd.merged.cnvs <- fread("rawdata/cnvexamples/ukbb.unfiltered.qcdMerged.txt")
setnames(ukbb.qcd.merged.cnvs,names(ukbb.qcd.merged.cnvs),c(final.names,"locus","path.locus"))

## Add supplemental columns:
ukbb.qcd.merged.cnvs[,ct:=if_else(Copy_Number<2,"DEL","DUP")]
ukbb.qcd.merged.cnvs[,ct:=as.factor(ct)]
ukbb.qcd.merged.cnvs[,gt:=if_else(Copy_Number == 0 | Copy_Number == 4,2,1)]

## Load your splitfile -> eid translation file:
# This is just example code and will not work! You need to customize this to your own file:
split.to.eid <- fread("rawdata/cnvexamples/split_to_eid.txt")
ukbb.qcd.merged.cnvs <- merge(ukbb.qcd.merged.cnvs, split.to.eid, by="split.file")

## Set columns we need for downstream phenotype testing:
cols.to.keep.and.order <- c("eid","chr","start","end","locus","Copy_Number","gt","ct","Length_bp","path.locus","Max_Log_BF","LRR_mean","LRR_median","LRR_SD","BAF_mean","BAF_median","BAF_SD","WF","BAF_drift","NumCNV","density","callrate","wes.probe.count","No_Probes","site.BAF_mean","site.BAF_SD","site.BAF_median","site.LRR_mean","site.LRR_SD","site.LRR_median","badness.score","filter.0.95.wes.support.score","wes.support.score")
ukbb.qcd.merged.cnvs <- ukbb.qcd.merged.cnvs[,..cols.to.keep.and.order]

## And write final file:
write.table(ukbb.qcd.merged.cnvs, "rawdata/cnvresources/ukbb.cnvs.qcd.txt")
```