-
Notifications
You must be signed in to change notification settings - Fork 1
/
CNVCalling_Filtering.Rmd
1177 lines (871 loc) · 62.4 KB
/
CNVCalling_Filtering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "CNV Calling and Filtering"
author: "Eugene Gardner"
date: "03 December 2020"
output:
html_document:
toc: true
toc_depth: 2
---
# 1. Startup and Introduction
This notebook handles calling, annotation, and building filtering model(s) of CNV calls.
This proceeds in the following order:
1. Array CNVs are called in batches of ~5K samples.
2. Whole Exome CNVs (for INTERVAL only) are called using XHMM, CoNVex, and CLAMMS
3. Raw CNVs are cat'd from each of the Array CNV calling batches into a final (raw) file and covariate values for each CNV are calculated and CNVs are merged into the final file using the "CNVPolisher":
+ Attachment of Individual level data to each CNV and intersect with WES CNV calls from XHMM, CANOES, and CLAMMS. Sites/Individuals **are not** filtered except for light filtering based on:
+ PASS status for the actual array based on affymetrix standards (so on individual level)
+Duplicated array (keep highest quality score array per individual)
4. Filter CNVs according to a Random Forest model
5. Merging of DEL/DUP calls independently using at first `bedtools merge` (to capture all possible overlaps) and then 75% reciprocal overlap.
+ Printing of a VCF file containing "genotyped" calls for each individual at every site
6. Annotation with VEP and generation of final files for input into `PhenotypeTesting.Rmd`
The intent of this file is _NOT_ to perfectly regenerate the files that are provided as part of our UK Biobank, it is merely intended to document the code that we used to perform the QC itself. This document also does some comparisons on CNV QC that were not shown in our manuscript.
**Note**: Example input files for some steps are given in the directory `rawdata/cnvexamples/`
You can view a compiled html version of this document with all code run either within this repository at `compiled_htmls/CNVCalling_Filtering.html` or on [github](https://htmlpreview.github.io/?https://github.com/eugenegardner/UKBBFertility/blob/master/compiled_html/CNVCalling_Filtering.html).
```{r setup}
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
## Silently load packages
load.package <- function(name) {
suppressMessages(suppressWarnings(library(name, quietly = T, warn.conflicts = F, character.only = T)))
}
load.package("data.table") ## Better than data.frame
load.package("tidyverse") ## provides DPLYR/ggplot2
load.package("patchwork") ## Makes combining plots easily
load.package("randomForest") ## randomForest implementation
load.package("ROCR") ## Builds ROC curves
load.package("cvAUC") ## Determines CI for AUC from ROCR
load.package("gdata") ## Allows for accurate random sampling when making training sets
theme <- theme(panel.background=element_rect(fill="white"),line=element_line(size=1,colour="black",lineend="round"),axis.line=element_line(size=1),text=element_text(size=10,face="bold",colour="black"),axis.text=element_text(colour="black"),axis.ticks=element_line(size=1,colour="black"),axis.ticks.length=unit(.1,"cm"),strip.background=element_rect(fill="white"),axis.text.x=element_text(angle=45,hjust=1),legend.position="blank",panel.grid.major=element_line(colour="grey",size=0.5))
## Default theme w/legend
theme.legend <- theme + theme(legend.position="right")
## Colour Scheme:
male <- "#38BCA0"
female <- "#7B06F8"
sex.colours <- c(male, female)
names(sex.colours) <- c("Male","Female")
sex.colours.fill <- scale_fill_manual(name = "Sex",values=sex.colours)
sex.colours <- scale_colour_manual(name = "Sex",values=sex.colours)
##Functions for Analysis:
#remove NAs from a data.table
fix.na<-function(x) {
if(is.na(x)) {
return(0)
} else {
return(x)
}
}
##Convert boolean T/F to binary 1/0
convert.log<-function(x){
if (x==F) {
return(0)
} else {
return(1)
}
}
del.line <- "#4F7942"
del.fill <- "#77DD77"
dup.line <- "#0000FF"
dup.fill <- "#AEC6CF"
```
# 2. Generation of Array-based CNV Calls
In this section are example command lines to generate PennCNV CNV calls. They *are not* intended to be specifically reproducible on a separate system and will have to be modified to any particular system where PennCNV is run.
Required Files:
* Axiom reference files for the PennCNV pipeline:
+ gc content: `Axiom.gcmodel`
+ Array-specific analysis files: `/Axiom_UKB_WCSG_Analysis-r3/`
+ These are specific to the UKBiobank and INTERVAL arrays, and should not be used for other arrays (including BiLEVE)
+ All SNVs `mapfileAX.dat`
+ PennCNV specific Axiom HMM: `Axiom_trained.hmm`
* Raw array .CEL files (downloaded from INTERVAL or UKB; need to apply for separate access, see [this document](http://biobank.ndph.ox.ac.uk/showcase/showcase/docs/ukbfetch_instruct.html) on how to download UK Biobank data)
* [PennCNV Scripts](http://penncnv.openbioinformatics.org/en/latest/)
All of the input CNVs from array-based calls are generated by PennCNV using the following pipeline:
```{bash PennCNV pipeline, eval=FALSE}
# For the purposes of this section '$BATCH' is the batch number!
BATCH=1
## 1. Make CEL file list:
ls batch${BATCH}/CEL_FILES/*.CEL > batch${BATCH}/cel.list_batch${BATCH}.txt
# Below adds the header ‘cel_files’ to the file list:
echo 'cel_files' | cat - batch${BATCH}/cel.list_batch${BATCH}.txt > batch${BATCH}/cel.list_batch${BATCH}.header.txt
## 2. Do affy genotyping:
apt/bin/apt-probeset-genotype --analysis-files-path Axiom_UKB_WCSG_Analysis-r3/ --xml-file Axiom_UKB_WCSG_Analysis-r3/Axiom_UKB_WCSG_96orMore_Step2_Bi-allelic.r3.apt-probeset-genotype.AxiomGT1.xml --out-dir batch${BATCH}/ --summaries --cel-files batch${BATCH}/cel.list_batch${BATCH}.header.txt
## 3. Generate sex file:
grep '.CEL' batch${BATCH}/AxiomGT1.report.txt | grep -v '#' | awk '{print $1"\t"$2}' | sed 's_unknown_male_g' > batch${BATCH}/sex_batch${BATCH}.txt
## 4. Generate geno clusters:
PennCNV-1.0.4/penn-affy/bin/generate_affy_geno_cluster.pl batch${BATCH}/AxiomGT1.calls.txt batch${BATCH}/AxiomGT1.confidences.txt batch${BATCH}/AxiomGT1.summary.txt --nopower2 -locfile reference/mapfileAX.dat -sexfile batch${BATCH}/sex_batch${BATCH}.txt -out batch${BATCH}/batch${BATCH}.genocluster
## 5. Normalize the geno clusters:
PennCNV-1.0.4/penn-affy/bin/normalize_affy_geno_cluster.pl -nopower2 -locfile mapfileAX.dat -out batch${BATCH}/batch${BATCH}_lrr_baf.txt batch${BATCH}/batch${BATCH}.genocluster batch${BATCH}/AxiomGT1.summary.txt
## 6. Generate lrr_baf file for each individual:
# make dir for split files:
mkdir batch${BATCH}/batch${BATCH}_split/
# This generates the script to create and run kcolumn.pl in sets of 1000 samples
# kcolumn.pl doesn't really work for more 1000 samples at a time:
echo $BATCH | perl -ne 'chomp $_; print "wc -l batch$_/cel.list_batch$_.txt \| perl -ane '\''chomp \$_; print \"\#/bin/bash\\n\\n\"; for \(\$y=1,\$z=1000;\$y<=\$F\[0\];\$y+=1000,\$z+=1000\) \{if \(\$z \> \$F\[0\]\) \{print \"PennCNV-1.0.4/kcolumn.pl batch$_/batch$_\" . \"_lrr_baf.txt split 2 -start \$y -end \$F\[0\] -tab -head 3 -name -out batch$_/batch$_" . "_split/split$_;\\n\";\} else \{print \"PennCNV-1.0.4/kcolumn.pl batch$_/batch$_\" . \"_lrr_baf.txt split 2 -start \$y -end \$z -tab -head 3 -name -out batch$_/batch$_" . "_split/split$_;\\n\";\}\}'\'' > batch$_/kcolumn.sh;\n"; print "chmod +x batch$_/kcolumn.sh;\n";' | bash
./batch${BATCH}/kcolumn.sh
## 7. Generate signal file list:
ls batch${BATCH}/batch${BATCH}_split/split${BATCH}.* > batch${BATCH}/batch${BATCH}_split/batch${BATCH}_signalfile.lst;
## 8. Run detect CNV (needs basement queue as seems to take ~4 days?):
PennCNV-1.0.4/detect_cnv.pl --test --hmmfile Axiom_trained.hmm --pfbfile Axiom.pfb --listfile batch${BATCH}/batch${BATCH}_split/batch${BATCH}_signalfile.lst --output batch${BATCH}/batch${BATCH}.rawcnv --confidence --logfile batch${BATCH}/batch${BATCH}.rawcnv.log --gcmodelfile Axiom.gcmodel
## 9. Run cnv cleaning to merge CNVs based on distance from one another:
PennCNV-1.0.4/clean_cnv.pl -signalfile Axiom.pfb -fraction 0.25 -bp combineseg batch${BATCH}/batch${BATCH}.rawcnv > batch${BATCH}/join_batch${BATCH}.rawcnv
## 10. Convert ALL split files in a batch to pseudo-bed format, sort, bgzip, and index; like:
tail -n +2 ${BATCH}/batch${BATCH}_split/split${BATCH}.a550484 | perl -ane 'chomp $_; shift(@F); splice(@F, 1, 0, $F[1]); print join("\t", @F) . "\n";' | bedtools sort -i stdin > split1.a550484.bed
bgzip split1.a550484.bed
tabix -p bed split1.a550484.bed.gz
```
# 3. WES CNV Calling:
CNVs were then called from WES using:
* [XHMM](https://atgu.mgh.harvard.edu/xhmm/tutorial.shtml)
* [CLAMMS](https://github.com/rgcgithub/clamms)
* [CANOES](http://www.columbia.edu/~ys2411/canoes/)
I am not going to specifically document how all of these tools were used to call CNVs, as instructions for performing this analysis are available on their respective github pages, or in the case of XHMM, in the excellent [XHMM methods paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4065038/). All tools were run with default settings.
For each caller, final calls are converted to a single concatenated file with the following columns for use with the downstream QC pipeline:
1. Chromosome, e.g. 1-22, X, Y (without 'chr')
2. start
3. end
4. Number of affected WES probes
5. Intentionally left blank
6. Copy type (either DEL or DUP)
7. sample ID
This file must then be sorted, zipped, and indexed with `bgzip` and `tabix`. An example of what this file looks like is available at `rawdata/cnvexamples/wes_input.tsv.gz`
**Note**: WES ID *MUST* be consistent between all of the files, or else the subsequent processing pipeline will not work properly. It does not matter what this ID is, other than containing only alphanumeric characters or underscores.
# 4. Covariate Annotation:
This section attaches sample level information via a custom java pipeline (included in this repo at `src/CNVQC/` and as a compiled jar -- **runnable with java14** -- at `scripts/CNVAnnotator.jar`). There are two possible ways to annotate CNVs, one with exome data (INTERVAL) and one without (UKBB). To run CNVs without WES, simply omit the canoes/clamms/xhmm files. The file provided to -samples looks like:
```
a550484-4209466-021515-320_A10.CEL /calling/batch1/batch1_split/split1.a550484_R1 sampl_1 FEMALE 0.9963 1234567
```
Where each column tab-delimited and is:
1. The CEL file name
2. The split file without the `.bed.gz` suffix. This is due to how PennCNV sets filenames in the *.rawcnv file in order for us to match information in the QCsum file
3. The sample ID to link to WES data as described above. This _must_ be the same ID as WES call files. Can be set to 'null' if no WES data available.
4. Sample gender in ALL CAPS
5. Quality score from the affymetrix raw data (can be 'NaN' if unknown)
6. 7 digit eid of the participant from your UKBB fam file
The files provided to -axiom and -wes are a tab delimited bed file of axiom and WES probes, respectively. For WES probes, I have merged adjacent probes within 500bp of each other to represent 'exons'. Example files are available at `rawdata/cnvexamples/axiom_baits.example.bed.gz` and `rawdata/cnvexamples/exome_baits.example.bed.gz`.
**Note**: It is up to the user to create this file. It
```{bash CNV file merging, eval=FALSE}
## Merge calls across batches:
cat batch*/join_batch*.rawcnv > merging/batchAll.rawcnv
## Merge QC values across batches:
cat batch*/batch*.qcsum > merging/batchAll.qcsum
# in the below example, -s is the "chunk" of the file to annotate in 500 line chunks. For example, if you have 2,000 CNVs to annotate, you would run the java code below four times, with -s set to 1, 2, 3, 4 in each run. This was to allow easy parallelization when running on a compute cluster. If -s is not given, then ALL lines of the file are run, which can take A _LONG_ time
## This annotates each CNV with necessary information for the random forest model used for QC purposes:
CHUNK=1
java -Xmx2G -Xms2G -XX:+UseSerialGC -jar /nfs/ddd0/eg15/CNVAnnotator.jar Annotate \
-h hs37d5.fa \
-o merging/raw_files/batchAll.${CHUNK}.annotated \
-p merging/batchAll.rawcnv \
-tmp tmp/ \
-s ${CHUNK} \
-axiom axiom_baits.bed.gz \
-wes exome_baits.bed.gz \
-canoes CANOES.cnvs.bed.gz \
-clamms CLAMMS.cnvs.bed.gz \
-xhmm XHMM.sorted.bed.gz \
-samples samples_file.txt \
-qcsum merging/batchAll.qcsum
## Primary output will be a file named batchAll.<chunk_number>.annotated.txt,
## where chunk number will be the value provided to -s
## Cat together individual annotations:
# Have to grab the header since we remove it in the next step
grep '#' merging/raw_files/batchAll.1.annotated.txt > merging/header.txt
grep -h -v "#" merging/raw_files/batchAll.*.annotated.txt | cat merging/header.txt - > merging/batchAll.annotated.txt
```
# 5. Filtering CNVs
The above java code for `Annotate` performs **no** filtering based on any model. I am using pre-merge information using the code below to build a Random Forest model to do QC using only variants found in individuals with WES data. This model can then be applied to variants across any dataset to filter variants that do not "look" (based on various covariates) like a CNV that should have WES evidence.
## 5A. Load Annotated INTERVAL CNV calls
Calls added here are for all individuals in INTERVAL. An additional column (has.wes) indicates if the file does have WES data. Only these individuals are used to train our model. An example head of a complete batchAll.annotated.txt can be found at `rawdata/cnvexamples/example.annotated.txt`
```{r Load merged INTERVAL CNV calls}
annotated.cnvs<-fread("rawdata/cnvexamples/batchAll.annotated.for_testing.txt")
setnames(annotated.cnvs,"#chr","chr")
## Store column names from original file to print final QCd file:
original.names <- names(annotated.cnvs)
## define which callers am I using for Random Forest
callers <- c("wes.xhmm.int","wes.clamms.int","wes.canoes.int")
int.string <- paste(callers,collapse = " + ")
annotated.cnvs[,wes.support.num:=(eval(parse(text=int.string)))]
## Define wes.support TRUTH as any call with >= 1 overlap (this is due to the binary nature of ROCR)
annotated.cnvs[,wes.support:=wes.support.num>0]
##Assign a factor to copy type rather than a number
assign.ct<-function(x) {
if (x < 2) {
return("DEL")
} else {
return("DUP")
}
}
annotated.cnvs[,ct:=mapply(assign.ct,Copy_Number)]
annotated.cnvs[,ct:=as.factor(ct)]
##Add dummy column for summing using standard data.table functions:
annotated.cnvs[,dummy:=1]
paste("Total number of CNVs loaded:", nrow(annotated.cnvs))
```
## 5B. Load Annotated UKBB CNV calls
This file is identical to the one loaded above for INTERVAL. Please see the head file there as an example.
```{r Load merged UKBB CNV calls}
ukbb.annotated.cnvs <- fread("rawdata/cnvexamples/UKBB.with_bileve.with_batch18.annotated.txt")
setnames(ukbb.annotated.cnvs,"#chr","chr")
ukbb.annotated.cnvs[,ct:=mapply(assign.ct,Copy_Number)]
ukbb.annotated.cnvs[,ct:=as.factor(ct)]
##Add dummy column for summing using standard data.table functions:
ukbb.annotated.cnvs[,dummy:=1]
paste("Total number of CNVs loaded:", nrow(ukbb.annotated.cnvs))
```
### Building The "Good Array" Covariate
This code creates a metric for each probe to see how often it is deleted in good versus bad arrays. 'bad' arrays are classified as:
LRR ≥ 0.35 & CNVs ≥ 30 & abs(WF) ≥ 0.03
and 'good' arrays as the 75%tile of CNVs that are less than the above terms.
```{r defining cutoffs}
bad.arrays <- annotated.cnvs[LRR_SD >= 0.35 | NumCNV >= 30 | abs(WF) >= 0.03]
good.arrays <- annotated.cnvs[LRR_SD < 0.35 & NumCNV < 30 & abs(WF) < 0.03]
## Define cutoffs as 75%ile of each respective category
LRR_SD.quant <- quantile(good.arrays[,LRR_SD])
numCNV.quant <- quantile(good.arrays[,NumCNV])
wf.quant <- quantile(good.arrays[,abs(WF)])
## get best arrays based on these cutoffs. No need to run this portion again since I have already calculated the metric (see below)
best.arrays <- good.arrays[LRR_SD < LRR_SD.quant[[4]] & NumCNV < numCNV.quant[[4]] & abs(WF) < wf.quant[[4]]]
write.table(bad.arrays[ct=="DEL",c("chr","start","end")],"rawdata/cnvexamples/probe_badness/bad_arrays.del.bed",sep="\t",col.names = F,row.names = F)
write.table(best.arrays[ct=="DEL",c("chr","start","end")],"rawdata/cnvexamples/probe_badness/best_arrays.del.bed",sep="\t",col.names = F,row.names = F)
write.table(bad.arrays[ct=="DUP",c("chr","start","end")],"rawdata/cnvexamples/probe_badness/bad_arrays.dup.bed",sep="\t",col.names = F,row.names = F)
write.table(best.arrays[ct=="DUP",c("chr","start","end")],"rawdata/cnvexamples/probe_badness/best_arrays.dup.bed",sep="\t",col.names = F,row.names = F)
write.table(annotated.cnvs[ct=="DEL",c("chr","start","end","cel.file","ct")],"rawdata/cnvexamples/probe_badness/to_annotate.INTERVAL.del.bed",sep="\t",col.names = F,row.names = F,quote=F)
write.table(annotated.cnvs[ct=="DUP",c("chr","start","end","cel.file","ct")],"rawdata/cnvexamples/probe_badness/to_annotate.INTERVAL.dup.bed",sep="\t",col.names = F,row.names = F,quote=F)
write.table(ukbb.annotated.cnvs[ct=="DEL",c("chr","start","end","split.file","ct")],"rawdata/cnvexamples/probe_badness/to_annotate.UKBB.del.bed",sep="\t",col.names = F,row.names = F,quote=F)
write.table(ukbb.annotated.cnvs[ct=="DUP",c("chr","start","end","split.file","ct")],"rawdata/cnvexamples/probe_badness/to_annotate.UKBB.dup.bed",sep="\t",col.names = F,row.names = F,quote=F)
```
We then use bedtools (must be in $PATH) to generate the number of times each axiom probe is independently deleted. Bedtools gives off an error in some instances (`W::sam_read1] Parse error at line 1`) that does not effect the result.
```{bash generate per-probe scores}
## Might need to change to ~/.bashrc depending on system (this is for macos)
source ~/.bash_profile
bedtools intersect -c -a rawdata/cnvexamples/axiom_probes.bed.gz -b rawdata/cnvexamples/probe_badness/best_arrays.del.bed > rawdata/cnvexamples/probe_badness/best_arrays.overlap.del.bed
bedtools intersect -c -a rawdata/cnvexamples/axiom_probes.bed.gz -b rawdata/cnvexamples/probe_badness/bad_arrays.del.bed > rawdata/cnvexamples/probe_badness/bad_arrays.overlap.del.bed
bedtools intersect -c -a rawdata/cnvexamples/axiom_probes.bed.gz -b rawdata/cnvexamples/probe_badness/best_arrays.dup.bed > rawdata/cnvexamples/probe_badness/best_arrays.overlap.dup.bed
bedtools intersect -c -a rawdata/cnvexamples/axiom_probes.bed.gz -b rawdata/cnvexamples/probe_badness/bad_arrays.dup.bed > rawdata/cnvexamples/probe_badness/bad_arrays.overlap.dup.bed
paste rawdata/cnvexamples/probe_badness/best_arrays.overlap.del.bed rawdata/cnvexamples/probe_badness/bad_arrays.overlap.del.bed | perl -ane 'chomp $_; my $ratio = log(($F[3] + 1) / ($F[7] + 1)); print "$F[0]\t$F[1]\t$F[2]\t$ratio\n";' > rawdata/cnvexamples/probe_badness/del_badness_mask.bed
paste rawdata/cnvexamples/probe_badness/best_arrays.overlap.dup.bed rawdata/cnvexamples/probe_badness/bad_arrays.overlap.dup.bed | perl -ane 'chomp $_; my $ratio = log(($F[3] + 1) / ($F[7] + 1)); print "$F[0]\t$F[1]\t$F[2]\t$ratio\n";' > rawdata/cnvexamples/probe_badness/dup_badness_mask.bed
## INTERVAL
bedtools sort -i rawdata/cnvexamples/probe_badness/to_annotate.INTERVAL.del.bed | bedtools map -a stdin -b rawdata/cnvexamples/probe_badness/del_badness_mask.bed -c 4 -o mean > rawdata/cnvexamples/probe_badness/annotated.INTERVAL.del.bed
bedtools sort -i rawdata/cnvexamples/probe_badness/to_annotate.INTERVAL.dup.bed | bedtools map -a stdin -b rawdata/cnvexamples/probe_badness/dup_badness_mask.bed -c 4 -o mean > rawdata/cnvexamples/probe_badness/annotated.INTERVAL.dup.bed
## UKBB
bedtools sort -i rawdata/cnvexamples/probe_badness/to_annotate.UKBB.del.bed | bedtools map -a stdin -b rawdata/cnvexamples/probe_badness/del_badness_mask.bed -c 4 -o mean > rawdata/cnvexamples/probe_badness/annotated.UKBB.del.bed
bedtools sort -i rawdata/cnvexamples/probe_badness/to_annotate.UKBB.dup.bed | bedtools map -a stdin -b rawdata/cnvexamples/probe_badness/dup_badness_mask.bed -c 4 -o mean > rawdata/cnvexamples/probe_badness/annotated.UKBB.dup.bed
cat rawdata/cnvexamples/probe_badness/annotated.INTERVAL.del.bed rawdata/cnvexamples/probe_badness/annotated.INTERVAL.dup.bed > rawdata/cnvexamples/probe_badness/annotated.INTERVAL.bed
cat rawdata/cnvexamples/probe_badness/annotated.UKBB.del.bed rawdata/cnvexamples/probe_badness/annotated.UKBB.dup.bed > rawdata/cnvexamples/probe_badness/annotated.UKBB.bed
```
And then just read back in and attach to INTERVAL/UKBB annotated.cnvs:
```{r add probe ratio, fig.height=4, fig.width=8}
## INTERVAL
bad.arrays.bed <- fread("rawdata/cnvexamples/probe_badness/annotated.INTERVAL.bed")
setnames(bad.arrays.bed,names(bad.arrays.bed),c("chr","start","end","cel.file","ct","badness.score"))
annotated.cnvs <- merge(annotated.cnvs,bad.arrays.bed,by=c("chr","start","end","cel.file","ct"),all=F)
## UKBB
bad.arrays.UKBB.bed <- fread("rawdata/cnvexamples/probe_badness/annotated.UKBB.bed")
setnames(bad.arrays.UKBB.bed,names(bad.arrays.UKBB.bed),c("chr","start","end","split.file","ct","badness.score"))
bad.arrays.UKBB.bed <- bad.arrays.UKBB.bed[badness.score!="."]
bad.arrays.UKBB.bed[,badness.score:=as.double(badness.score)]
ukbb.annotated.cnvs <- merge(ukbb.annotated.cnvs,bad.arrays.UKBB.bed,by=c("chr","start","end","split.file","ct"),all=F)
```
## 5C. Filtering based on already published methods
### Filtering based on Macé et. al.
This code filters on a step-wise logistic regression using data provided by [Macé et. al.](https://academic.oup.com/bioinformatics/article/32/21/3298/2415363). Covariates and methods below are stripped from the R package that they developed, and is available online at [PennCNV Pipeline](http://goo.gl/T6yuFM) - this is a direct link to the tarball download. This pipeline was found in [Fang. et. al.](https://link.springer.com/protocol/10.1007/978-1-4939-8666-8_1), which is a descriptor on how to do PennCNV-based array calls.
```{r CNV Filtering in Macé, fig.height=4, fig.width=12}
## Coefficients are from R_script_calculate_quality_score.R in the Penn CNV pipeline
build.coefficients<-function(copy.type) {
if (copy.type=="DUP") {
QS_coefficient <- cbind(c(-0.81529185, 3.90248885, -0.90184846, 0.34376765, -0.06351849, -0.01497046, -0.30665878, -0.10354156, -0.44829901, 0.06039380, 0.03645913), rep(0, 11))
rownames(QS_coefficient) <- c('(Intercept)', 'Max_Log_BF', 'No_Probes', 'LRR_SD', 'BAF_drift', 'WF', 'LRR_mean', 'NumCNV', 'BAF_SD', 'Length_bp', 'BAF_mean')
colnames(QS_coefficient) <- c('Estimate', 'Pr(>|z|)')
} else {
QS_coefficient <- cbind(c(-1.75648377, 4.64898485, -2.50150285, -0.47552224, 0.28876320, 0.10205302, 0.14363692, 0.02959571, 0.00000000, -0.00000000), rep(0, 10))
rownames(QS_coefficient) <- c('(Intercept)', 'Max_Log_BF', 'No_Probes', 'NumCNV', 'LRR_SD', 'Length_bp', 'LRR_mean', 'WF', 'BAF_drift', 'BAF_SD')
colnames(QS_coefficient) <- c('Estimate', 'Pr(>|z|)')
}
return(QS_coefficient)
}
## Build coefficients seperately for DUPs and DELs.
## This just makes a datatable with the coefficient for DEL/DUP
QS_coefficient_duplication<-build.coefficients("DUP")
QS_coefficient_deletion<-build.coefficients("DEL")
## Function takes my CNV data and the model and generates a quality score (QS) for each CNV
calculate_QS <- function(cnv_data, model_data, copy.type) {
cnv_data<-cnv_data[ct==copy.type]
QS_val <- data.table(QS=rep(model_data['(Intercept)', 'Estimate'], nrow(cnv_data)),split.file=cnv_data[,split.file],location=cnv_data[,location],ct=rep(copy.type,nrow(cnv_data)))
## This for loop works by taking each covariate, scaling it (scale) across all CNVs and then multiplying it by the Estimate for that covariate
for(para in 2:nrow(model_data)){
para_name <- rownames(model_data)[para]
para.est <- model_data[para_name, 'Estimate']
para.calcd <- scale(cnv_data[,..para_name])
QS_val[,QS:=QS + (para.est * para.calcd)]
}
QS_val[,QS:= (1 / (1 + exp(-QS)))]
if (copy.type == "DEL") {
QS_val[,QS:=QS*-1] ##Deletion scores are scaled negatively
}
return(QS_val)
}
## First INTERVAL
QS_deletion <- data.table(calculate_QS(cnv_data = annotated.cnvs, model_data = QS_coefficient_deletion, copy.type="DEL"))
QS_duplication <- data.table(calculate_QS(cnv_data = annotated.cnvs, model_data = QS_coefficient_duplication, copy.type="DUP"))
## Take QS scores and make one table and then merge them with the original annotated.cnvs table
QS.table<-data.table(QS.mace=c(QS_deletion[,QS],QS_duplication[,QS]),split.file=c(QS_deletion[,split.file],QS_duplication[,split.file]),location=c(QS_deletion[,location],QS_duplication[,location]),ct=c(QS_deletion[,ct],QS_duplication[,ct]))
QS_deletion<-NULL
QS_duplication<-NULL
annotated.cnvs<-merge(annotated.cnvs,QS.table,by=c("split.file","location","ct"),all=T)
thresh<-0.8 ## Set Threshold for filtering CNVs
totals<-merge(annotated.cnvs[abs(QS.mace)>=thresh,sum(dummy),by=c("ct")],annotated.cnvs[,sum(dummy),by=c("ct")],by="ct")
totals[,pos.x:=if_else(ct == "DEL",-0.6,0.6)]
totals[,pos.y:=2500]
totals[,pct:=(V1.x/V1.y)*100]
ggplot(annotated.cnvs,aes(QS.mace,fill=ct,colour=ct)) + scale_fill_manual(values=c(del.fill,dup.fill)) + scale_colour_manual(values=c(del.line,dup.line)) + geom_histogram(binwidth = 0.01) + geom_vline(aes(xintercept=0.8),colour=dup.line,size=1.5) + geom_vline(aes(xintercept=-0.8),colour=del.line,size=1.5) + geom_text(inherit.aes=F,data=totals,aes(x=pos.x,y=pos.y,label=paste("Total Pass ",ct,":\n",V1.x,"\n(",sprintf("%.1f",pct),"%)",sep="")),size=5) + xlab("Macé-QS") + ylab("Total CNV sites per bin") + theme
rm(QS.table)
## Now UKBB
QS_deletion <- data.table(calculate_QS(cnv_data = ukbb.annotated.cnvs, model_data = QS_coefficient_deletion, copy.type="DEL"))
QS_duplication <- data.table(calculate_QS(cnv_data = ukbb.annotated.cnvs, model_data = QS_coefficient_duplication, copy.type="DUP"))
## Take QS scores and make one table and then merge them with the original annotated.cnvs table
QS.table<-data.table(QS.mace=c(QS_deletion[,QS],QS_duplication[,QS]),split.file=c(QS_deletion[,split.file],QS_duplication[,split.file]),location=c(QS_deletion[,location],QS_duplication[,location]),ct=c(QS_deletion[,ct],QS_duplication[,ct]))
QS_deletion<-NULL
QS_duplication<-NULL
ukbb.annotated.cnvs<-merge(ukbb.annotated.cnvs,QS.table,by=c("split.file","location","ct"),all=T)
thresh<-0.8 ## Set Threshold for filtering CNVs
totals<-merge(ukbb.annotated.cnvs[abs(QS.mace)>=thresh,sum(dummy),by=c("ct")],ukbb.annotated.cnvs[,sum(dummy),by=c("ct")],by="ct")
totals[,pos.x:=if_else(ct == "DEL",-0.6,0.6)]
totals[,pos.y:=25000]
totals[,pct:=(V1.x/V1.y)*100]
ggplot(ukbb.annotated.cnvs,aes(QS.mace,fill=ct,colour=ct)) + scale_fill_manual(values=c(del.fill,dup.fill)) + scale_colour_manual(values=c(del.line,dup.line)) + geom_histogram(binwidth = 0.01) + geom_vline(aes(xintercept=0.8),colour=dup.line,size=1.5) + geom_vline(aes(xintercept=-0.8),colour=del.line,size=1.5) + geom_text(inherit.aes=F,data=totals,aes(x=pos.x,y=pos.y,label=paste("Total Pass ",ct,":\n",V1.x,"\n(",sprintf("%.1f",pct),"%)",sep="")),size=5) + xlab("Macé-QS") + ylab("Total CNV sites per bin") + theme
rm(QS.table)
```
### Filtering using hard-cutoff methods
Using the models determined as 'canonical' presented in Macé et. al to filter and adding them for comparisons sake as a filter to annotated.cnvs:
**Note**: I have dropped filter A, as the BAF_SD filter of ≤0.05 filters everything (see plot below)
I have also added George Kirov's default filters (taken from [this](https://jmg.bmj.com/content/early/2018/10/20/jmedgenet-2018-105477.long) paper) as filter.gk.
```{r additional filters}
## TRUE means true call, false means FALSE call
## Filter B
annotated.cnvs[,filter.b:=(LRR_SD < 0.3 & BAF_drift < 0.01 & abs(WF) < 0.05 & NumCNV < 100)]
ukbb.annotated.cnvs[,filter.b:=(LRR_SD < 0.3 & BAF_drift < 0.01 & abs(WF) < 0.05 & NumCNV < 100)]
print("Filter B:")
table(annotated.cnvs[,filter.b])
## Filter C
annotated.cnvs[,filter.c:=(LRR_SD < 0.24 & abs(WF) < 0.05 & callrate > 0.99 & No_Probes >= 10)]
ukbb.annotated.cnvs[,filter.c:=(LRR_SD < 0.24 & abs(WF) < 0.05 & callrate > 0.99 & No_Probes >= 10)]
print("Filter C:")
table(annotated.cnvs[,filter.c])
## Filter D
annotated.cnvs[,filter.d:=(LRR_SD < 0.3 & abs(WF) < 0.05 & NumCNV < 100 & callrate > 0.98)]
ukbb.annotated.cnvs[,filter.d:=(LRR_SD < 0.3 & abs(WF) < 0.05 & NumCNV < 100 & callrate > 0.98)]
print("Filter D:")
table(annotated.cnvs[,filter.d])
## Filter E
annotated.cnvs[,filter.e:=(LRR_SD <= 0.27 & BAF_SD <= 0.13 & callrate >= 0.98 & No_Probes >= 5 & Length_bp >= 1000)]
ukbb.annotated.cnvs[,filter.e:=(LRR_SD <= 0.27 & BAF_SD <= 0.13 & callrate >= 0.98 & No_Probes >= 5 & Length_bp >= 1000)]
print("Filter E:")
table(annotated.cnvs[,filter.e])
## Filter GK
annotated.cnvs[,filter.gk:=(LRR_SD <= 0.35 & abs(WF) <= 0.03 & NumCNV <= 30 & callrate >= 0.96 & No_Probes >= 10 & density <= 20000)]
ukbb.annotated.cnvs[,filter.gk:=(LRR_SD <= 0.35 & abs(WF) <= 0.03 & NumCNV <= 30 & callrate >= 0.96 & No_Probes >= 10 & density <= 20000)]
print("George Kirov default filter:")
table(annotated.cnvs[,filter.gk])
## Macé suggested filters
annotated.cnvs[,filter.mace.8:=abs(QS.mace) > 0.8]
annotated.cnvs[,filter.mace.5:=abs(QS.mace) > 0.5]
ukbb.annotated.cnvs[,filter.mace.8:=abs(QS.mace) > 0.8]
ukbb.annotated.cnvs[,filter.mace.5:=abs(QS.mace) > 0.5]
print("Macé suggested filters")
table(annotated.cnvs[,filter.mace.8])
table(annotated.cnvs[,filter.mace.5])
## Add a "null" filter -- e.g. keep all sites to see what happens:
annotated.cnvs[,filter.null:=TRUE]
ukbb.annotated.cnvs[,filter.null:=TRUE]
```
## 5D. Novel Random Forest Model
Going to use the R module 'randomForest' to develop an ML model for declaring if a CNV "looks like" a WES CNV.
Covariates included in my model are listed below in the Formula section. Using the tutorial [here](https://trevorstephens.com/kaggle-titanic-tutorial/r-part-5-random-forests/) to help me set up RandomForest.
Using covariates that are included in the output from Java "Annotate". Currently have built in the following covariates ("-" indicates covariates included in Macé et. al.):
* WF -
* BAF_SD -
* BAF_mean -
* No_Probes -
* Length_bp -
* BAF_drift -
* LRR_mean
* density
* LRR_SD
* site.LRR_mean
* site.LRR_SD
* site.BAF_mean
* site.BAF_SD
* badness.score (my bad name for the score calculated in the above section on "Good Array" Covariate).
I have excluded the Max_Lof_BF covariate from the Macé model.
### Formula for ML Models
The below code is used by *all* of my machine learning models as the input function into 'RandomForest'. Any modification here will change all downstream code:
```{r ML Formula}
covariates<-c("LRR_SD","BAF_drift","WF","LRR_mean","NumCNV","BAF_SD","BAF_mean","site.BAF_mean","site.BAF_SD","site.LRR_mean","site.LRR_SD","density","No_Probes","Length_bp","badness.score")
cov.string <- paste(covariates, collapse=" + ")
form <- as.formula(paste("wes.support.num",cov.string,sep=" ~ "))
```
### Testing Random Forest with cross validation
Below builds the random forest model by spliting CNVs with > 2 WES.probes into training and test sets. The training set is used to generate a model and then this model is used to predict the test set. As part of this validation, I do two randomization experiments:
1. Randomize the column "wes.support" right after I select CNVs with > 2 WES probes
2. Randomize the column "wes.support" after I build the model
This code makes extensive use of the package ROCR combined with cross-validating and generating confidence intervals for the AUC with [cvAUC](https://cran.r-project.org/web/packages/cvAUC/cvAUC.pdf).
**Note**: cvAUC simply calls ROCR to generate ROC curves, which I extract to make my own plots). I extract the X & Y variables from the ROCR performance objects generated by cvAUC, but underneath the hood it is all ROCR. I did generate a function (get.avg) which uses the "avg" function in .plot.performance which is supplied by ROCR. Good tutorial for basic ROCR package use is [here](https://www.r-bloggers.com/a-small-introduction-to-the-rocr-package/).
```{r leave one out, fig.height=7, fig.width=10, warning=FALSE}
## This function just runs the various models through RandomForest and Logistic Regression
run.models <- function(test,training) {
output.forest <- randomForest(form, data = training, importance = T, ntree = 500)
forest.result <- data.table(predict(output.forest,test))
test[,prob.wes:=forest.result[,V1]]
return(test)
}
## Thus function runs each model and returns it as a list. This just enables cross-validation
create.cross.val <- function(copy.type) {
basic.result <- train.basic(copy.type)
random.result <- train.random(copy.type)
## Randomize wes.support after the model has been decided
basic.result[,wes.support.random:=resample(basic.result[,wes.support])]
## Add old school cutoffs:
predictor.os <- data.table(training.type=c("filter.b","filter.c","filter.d","filter.e","filter.gk","filter.mace.8","filter.mace.5","filter.null"))
return(list(list(basic.result[,prob.wes]),list(basic.result[,wes.support]),list(basic.result[,wes.support.random]),
list(random.result[,prob.wes]),list(random.result[,wes.support]),
spec("filter.b",basic.result),sens("filter.b",basic.result),
spec("filter.c",basic.result),sens("filter.c",basic.result),
spec("filter.d",basic.result),sens("filter.d",basic.result),
spec("filter.e",basic.result),sens("filter.e",basic.result),
spec("filter.gk",basic.result),sens("filter.gk",basic.result),
spec("filter.mace.8",basic.result),sens("filter.mace.8",basic.result),
spec("filter.mace.5",basic.result),sens("filter.mace.5",basic.result),
spec("filter.null",basic.result),sens("filter.null",basic.result)
))
}
## Gets an actual ROC curve that can be plotted as well as confidence intervals for display
build.plottable <- function(pred,labels) {
avg.roc<-get.avg(performance(prediction(pred,labels),"sens","spec"))
avg.roc.cis <- ci.cvAUC(pred,labels,confidence=0.95)
return(list(avg.roc,avg.roc.cis$cvAUC,avg.roc.cis$ci[1],avg.roc.cis$ci[2]))
}
## This function returns cross validations for all model building
get.avg<-function(perf) {
if (length(perf@alpha.values) != 0)
perf@alpha.values <- lapply(perf@alpha.values, function(x) {
isfin <- is.finite(x)
x[is.infinite(x)] <- (max(x[isfin]) + mean(abs(x[isfin][-1] -
x[isfin][-length(x[isfin])])))
x
})
for (i in 1:length(perf@x.values)) {
ind.bool <- (is.finite(perf@x.values[[i]]) & is.finite(perf@y.values[[i]]))
if (length(perf@alpha.values) > 0)
perf@alpha.values[[i]] <- perf@alpha.values[[i]][ind.bool]
perf@x.values[[i]] <- perf@x.values[[i]][ind.bool]
perf@y.values[[i]] <- perf@y.values[[i]][ind.bool]
}
perf.sampled <- perf
alpha.values <- rev(seq(min(unlist(perf@alpha.values)), max(unlist(perf@alpha.values)),
length = max(sapply(perf@alpha.values, length))))
for (i in 1:length(perf.sampled@y.values)) {
perf.sampled@x.values[[i]] <- approxfun(perf@alpha.values[[i]],
perf@x.values[[i]], rule = 2, ties = "mean")(alpha.values)
perf.sampled@y.values[[i]] <- approxfun(perf@alpha.values[[i]],
perf@y.values[[i]], rule = 2, ties = "mean")(alpha.values)
}
perf.avg <- perf.sampled
perf.avg@x.values <- list(rowMeans(data.frame(perf.avg@x.values)))
perf.avg@y.values <- list(rowMeans(data.frame(perf.avg@y.values)))
perf.avg@alpha.values <- list(alpha.values)
return(perf.avg)
}
## Do training selection based on all possible CNVs (so overlapping sites can be considered)
train.basic <- function(copy.type) {
annotated.cnvs.ct<-annotated.cnvs[ct==copy.type & !is.na(WESID) & wes.probe.count>2 & Copy_Number > 0 & (wes.support.num == 0 | wes.support.num == 1 | wes.support.num == 2 | wes.support.num == 3)]
annotated.cnvs.ct[,row.num:=.I]
training.set.size<-as.numeric(nrow(annotated.cnvs.ct) / 2)
training<-sample_n(annotated.cnvs.ct,training.set.size)
training.rows<-training[,row.num]
test<-annotated.cnvs.ct[!row.num %in% training.rows][Max_Log_BF!=0]
test<-run.models(test,training)
return(test)
}
## Randomized wes.support at beginning
train.random <- function(copy.type) {
annotated.cnvs.ct<-annotated.cnvs[ct==copy.type & !is.na(WESID) & wes.probe.count>2 & Copy_Number > 0 & (wes.support.num == 0 | wes.support.num == 1 | wes.support.num == 2 | wes.support.num == 3)]
annotated.cnvs.ct[,row.num:=.I]
annotated.cnvs.ct[,wes.support:=sample(annotated.cnvs.ct[,wes.support])]
training.set.size<-as.numeric(nrow(annotated.cnvs.ct) / 2)
training<-sample_n(annotated.cnvs.ct,training.set.size)
training.rows<-training[,row.num]
test<-annotated.cnvs.ct[!row.num %in% training.rows][Max_Log_BF!=0]
test<-run.models(test,training)
return(test)
}
## Helper functions for calculating Sens/Spec of old-school filters
sens<-function(x,to.use) {
TPs <- nrow(to.use[get(x) == T][wes.support == T])
FNs <- nrow(to.use[get(x) == F][wes.support == T])
return(TPs / (TPs + FNs))
}
spec<-function(x,to.use) {
FPs <- nrow(to.use[get(x) == T][wes.support == F])
TNs <- nrow(to.use[get(x) == F][wes.support == F])
return(TNs / (TNs + FPs))
}
## this function just makes it so the code is a bit cleaner.
make.ROC <- function(copy.type) {
## Cross validate the models:
cross.val <- data.table(run=c(1:10))
cross.val[,c("basic.forest","wes.support.basic","randomized","random.forest","wes.support.random","filter.b.spec","filter.b.sens","filter.c.spec","filter.c.sens","filter.d.spec","filter.d.sens","filter.e.spec","filter.e.sens","filter.gk.spec","filter.gk.sens","filter.mace.8.spec","filter.mace.8.sens","filter.mace.5.spec","filter.mace.5.sens","filter.null.spec","filter.null.sens"):=create.cross.val(copy.type),by=run]
basic.forest.plot <- build.plottable(cross.val[,basic.forest],cross.val[,wes.support.basic])
random.i.forest.plot <- build.plottable(cross.val[,random.forest],cross.val[,wes.support.random])
random.p.forest.plot <- build.plottable(cross.val[,basic.forest],cross.val[,randomized])
if (copy.type == "DEL") {
mult <- -1
} else {
mult <- 1
}
plot.info <- data.table(AUC=c(basic.forest.plot[[2]],random.i.forest.plot[[2]],random.p.forest.plot[[2]]),
ci.lower=c(basic.forest.plot[[3]],random.i.forest.plot[[3]],random.p.forest.plot[[3]]),
ci.upper=c(basic.forest.plot[[4]],random.i.forest.plot[[4]],random.p.forest.plot[[4]]),
model=c("RandomForest","RandomForest","RandomForest"),
training.type=c("basic","random.initial","random.post"),
colour=c("#619CFF","#619CFF","#619CFF"))
perf.dt <- data.table(x = c(unlist(basic.forest.plot[[1]]@x.values),
unlist(random.i.forest.plot[[1]]@x.values),
unlist(random.p.forest.plot[[1]]@x.values)),
y = c(unlist(basic.forest.plot[[1]]@y.values),
unlist(random.i.forest.plot[[1]]@y.values),
unlist(random.p.forest.plot[[1]]@y.values)),
a = c(unlist(basic.forest.plot[[1]]@alpha.values)*mult,
unlist(random.i.forest.plot[[1]]@alpha.values)*mult,
unlist(random.p.forest.plot[[1]]@alpha.values)*mult),
training.type=c(rep("basic",length(unlist(basic.forest.plot[[1]]@x.values))),
rep("random.initial",length(unlist(random.i.forest.plot[[1]]@x.values))),
rep("random.post",length(unlist(random.p.forest.plot[[1]]@x.values)))),
model=c(rep("RandomForest",length(unlist(basic.forest.plot[[1]]@x.values))),
rep("RandomForest",length(unlist(random.i.forest.plot[[1]]@x.values))),
rep("RandomForest",length(unlist(random.p.forest.plot[[1]]@x.values)))))
perf.dt <- merge(perf.dt,plot.info,by=c("training.type","model"))
plot.info[,formated.AUC:=paste(sprintf("%0.3f",AUC),"[",sprintf("%0.3f",ci.lower),"-",sprintf("%0.3f",ci.upper),"]",sep="")]
## Add filters based on hard cutoffs:
predictor.os <- data.table(training.type=c("filter.b","filter.c","filter.d","filter.e","filter.gk","filter.mace.8","filter.mace.5","filter.null"),
sens.mean=c(mean(cross.val[,filter.b.sens]),mean(cross.val[,filter.c.sens]),mean(cross.val[,filter.d.sens]),mean(cross.val[,filter.e.sens]),mean(cross.val[,filter.gk.sens]),mean(cross.val[,filter.mace.8.sens]),mean(cross.val[,filter.mace.5.sens]),mean(cross.val[,filter.null.sens])),
spec.mean=c(mean(cross.val[,filter.b.spec]),mean(cross.val[,filter.c.spec]),mean(cross.val[,filter.d.spec]),mean(cross.val[,filter.e.spec]),mean(cross.val[,filter.gk.spec]),mean(cross.val[,filter.mace.8.spec]),mean(cross.val[,filter.mace.5.spec]),mean(cross.val[,filter.null.spec])),
sens.sd=c(sd(cross.val[,filter.b.sens]),sd(cross.val[,filter.c.sens]),sd(cross.val[,filter.d.sens]),sd(cross.val[,filter.e.sens]),sd(cross.val[,filter.gk.sens]),sd(cross.val[,filter.mace.8.sens]),sd(cross.val[,filter.mace.5.sens]),sd(cross.val[,filter.null.sens])),
spec.sd=c(sd(cross.val[,filter.b.spec]),sd(cross.val[,filter.c.spec]),sd(cross.val[,filter.d.spec]),sd(cross.val[,filter.e.spec]),sd(cross.val[,filter.gk.spec])),sd(cross.val[,filter.mace.8.spec]),sd(cross.val[,filter.mace.5.spec]),sd(cross.val[,filter.null.spec]))
predictor.os[,spec.upper:=spec.mean+spec.sd]
predictor.os[,spec.lower:=spec.mean-spec.sd]
predictor.os[,sens.upper:=sens.mean+sens.sd]
predictor.os[,sens.lower:=sens.mean-sens.sd]
if (copy.type == "DEL") {
predictor.os[,a:=-0.5]
} else {
predictor.os[,a:=0.5]
}
predictor.os[,model:="OS"]
predictor.os[,colour:="#000000"]
if (copy.type == "DEL") {
return(list(perf.dt,predictor.os,plot.info,del.line))
} else {
return(list(perf.dt,predictor.os,plot.info,dup.line))
}
}
DEL.plot <- make.ROC("DEL")
DUP.plot <- make.ROC("DUP")
## Add actual paper names...
DEL.plot[[2]][,training.type.papers:=factor(training.type,levels=c("filter.null","filter.b","filter.c","filter.d","filter.e","filter.gk","filter.mace.5","filter.mace.8"),labels=c("None","Wang et al.","Chettier et al.","Glessner et al.","Pinto et al.","Crawford et al.","Macé et al. (QS = 0.5)","Macé et al. (QS = 0.8)"))]
DUP.plot[[2]][,training.type.papers:=factor(training.type,levels=c("filter.null","filter.b","filter.c","filter.d","filter.e","filter.gk","filter.mace.5","filter.mace.8"),labels=c("None","Wang et al.","Chettier et al.","Glessner et al.","Pinto et al.","Crawford et al.","Macé et al. (QS = 0.5)","Macé et al. (QS = 0.8)"))]
## Everything from here down is just to plot
plot.data <- function(data.curve,data.point,col.to.use) {
plot <- ggplot() +
geom_line(data=data.curve,aes(x,y,group=interaction(model,training.type),linetype=training.type),size=1.5,colour=col.to.use) +
scale_colour_identity(labels=unique(data.curve[,model])) +
geom_point(data=data.point,aes(spec.mean,sens.mean),colour="black") +
geom_errorbar(data=data.point,aes(x=spec.mean,ymin=sens.lower,ymax=sens.upper),colour="black") +
geom_errorbarh(data=data.point,aes(y=sens.mean,xmin=spec.lower,xmax=spec.upper),colour="black") +
geom_text(data=data.point,aes(spec.mean-0.03,sens.mean,label=training.type.papers,colour=colour),size=5,hjust=0,position=position_dodge(width=0.1)) +
scale_x_reverse(name="Specificity",lim=c(1,-0.1),breaks=c(1,.75,.5,.25,0)) +
scale_y_continuous(name="Sensitivity",lim=c(0,1)) +
geom_abline(intercept = 1) +
theme.legend
return(plot)
}
for (a in list(DEL.plot,DUP.plot)) {
i.curve <- a[[1]]
i.point <- a[[2]]
colour <- a[[4]]
## Plot RF basic & overlap models
print(plot.data(i.curve[model=="RandomForest"][training.type=="basic" | training.type == "random.initial" | is.na(AUC)],i.point,colour))
}
rm(a)
```
### Full RandomForest to Filter UK Biobank CNVs
Now that we know that our model is valid, we know take _all_ of the available WES overlap data and train a random forest and then predict on the non-exome subset for just UK Biobank.
Generating our 'wes.probe.score' for UK Biobank CNV Calls by using _all_ individuals with WES from INTERVAL to train our RandomForest model.
```{r UKBB filtering, fig.height=6, fig.width=6, warning=FALSE}
make.my.own.model<-function(copy.type) {
## Pull correct CN CNVs from samples that have WES data with potential probe number > 2
training<-annotated.cnvs[ct==copy.type & !is.na(WESID) & wes.probe.count>2 & Copy_Number > 0 & (wes.support.num == 0 | wes.support.num == 1 | wes.support.num == 2 | wes.support.num == 3)]
output.forest <- randomForest(form, data = training, importance = T)
forest.result <- data.table(predict(output.forest,ukbb.annotated.cnvs[ct==copy.type ]))
forest.result[,split.file:=ukbb.annotated.cnvs[ct==copy.type,split.file]]
forest.result[,location:=ukbb.annotated.cnvs[ct==copy.type,location]]
forest.result[,ct:=copy.type]
setnames(forest.result,"V1","wes.support.score")
if (copy.type == "DEL") {
forest.result[,wes.support.score:=wes.support.score * -1]
}
return(list(forest.result,output.forest))
}
Forest_deletion<-make.my.own.model("DEL")
Forest_duplication<-make.my.own.model("DUP")
forest.table <- bind_rows(Forest_deletion[[1]],Forest_duplication[[1]])
Forest_deletion<-Forest_deletion[[2]]
Forest_duplication<-Forest_duplication[[2]]
ukbb.annotated.cnvs<-merge(ukbb.annotated.cnvs,forest.table,by=c("split.file","location","ct"),all=T)
## Homozygous DEL CNVs are almost uniformily good.
ukbb.annotated.cnvs[,wes.support.score:=mapply(ifelse,Copy_Number==0,-3,wes.support.score)]
ukbb.annotated.cnvs[,wes.support.score:=mapply(ifelse,Copy_Number==0,-3,wes.support.score)]
##Print forest results:
print(Forest_deletion)
print(Forest_duplication)
## Plots covariate weights:
del.imp<-data.table(Forest_deletion$importance)
del.imp[,covar:=rownames(Forest_deletion$importance)]
del.imp.sd <- data.table(Forest_deletion$importanceSD)
del.imp.sd <- del.imp.sd[,covar:=names(Forest_deletion$importanceSD)]
setnames(del.imp.sd,"V1","std.err")
del.imp <- merge(del.imp,del.imp.sd,by="covar")
rm(del.imp.sd)
del.imp[,IncNodePurity:=NULL]
del.imp[,IncMSE.final.del:=`%IncMSE` / std.err]
del.imp[,`%IncMSE`:=NULL]
del.imp[,std.err:=NULL]
dup.imp<-data.table(Forest_duplication$importance)
dup.imp[,covar:=rownames(Forest_duplication$importance)]
dup.imp.sd <- data.table(Forest_duplication$importanceSD)
dup.imp.sd <- dup.imp.sd[,covar:=names(Forest_duplication$importanceSD)]
setnames(dup.imp.sd,"V1","std.err")
dup.imp <- merge(dup.imp,dup.imp.sd,by="covar")
rm(dup.imp.sd)
dup.imp[,IncNodePurity:=NULL]
dup.imp[,IncMSE.final.dup:=`%IncMSE` / std.err]
dup.imp[,`%IncMSE`:=NULL]
dup.imp[,std.err:=NULL]
importance <- merge(dup.imp,del.imp,by="covar")
importance <- data.table(gather(importance,IncMSE.final.del,IncMSE.final.dup,key="Copy Type",value="%IncMSE"))
setkey(importance,"%IncMSE")
importance[,covar:=factor(covar,levels=importance[`Copy Type`=="IncMSE.final.del",covar])]
## Actual Plots - Sorted by importance for Deletions.
ggplot(importance,aes(covar,`%IncMSE`,group=`Copy Type`,colour=`Copy Type`)) + geom_point(position=position_dodge(width=1),size=3) + scale_colour_manual(values = c(del.line,dup.line)) + scale_x_discrete(labels=c("Num CNVs","BAF Mean","BAF SD","BAF Drift","LRR Mean", "site - BAF SD","site - LRR SD","LRR SD","WF","site - BAF Mean","CNV Length","No. Probes","Probe Density","Probe Error","site - LRR Mean"),name="") + coord_flip() + theme.legend
```
## 5E. Cutoff application and final file creation
### Hard Cutoff
First defining hard filtering scores based on some set parameters to remove really bad individuals that don't get dropped by our RandomForest. We hard filter on:
1. > 19 CNVs
2. $|WF|$ > 0.03
This also creates a reference file of individuals that are filtered so that they can be provided to `PhenotypeTesting.Rmd`. This file write split_file IDs, which the user needs to translate into eids from UK Biobank. That functionality is not encoded here.
```{r Hard Filtering on Num WES}
## Collect Counts Per Individual and some other stats and filter based on the Poisson distribution of total CNVs
get.filtered.indvs <- function(data) {
count.per.indv <- data[,sum(dummy),by=c("split.file")]
count.per.indv <- merge(count.per.indv,unique(data[,c("split.file","WF")]),by="split.file", all.y=F, all.x = T)
setnames(count.per.indv,"V1","NumCNV")
pois<-data.table(table(rpois(nrow(count.per.indv),median(count.per.indv[,NumCNV]))))
NumCNV.filt <- 19
WF.filt <- 0.03
# Plot of the above
print(ggplot(count.per.indv,aes(NumCNV)) + geom_histogram(binwidth=1) + geom_vline(aes(xintercept=NumCNV.filt)) + xlim(0,50) + geom_point(data=pois,aes(as.numeric(V1),as.numeric(N)),colour="blue") + theme)
filtered.indvs <- count.per.indv[NumCNV > NumCNV.filt | abs(WF) > 0.03,split.file]
return(filtered.indvs)
}
ukbb.filtered.indvs <- get.filtered.indvs(ukbb.annotated.cnvs)
write.table(data.table(split.file=ukbb.filtered.indvs),"rawdata/cnvexamples/ukbb.filtered_cnv_indvs.txt",sep="\t",quote=F,row.names=F)
```
### ML and Binary Filter Cutoffs
Currently using 95% Specificity in leave one out model performed in section 4D to determine appropriate cutoffs.
```{r Calculate Cutoff, fig.height=6, fig.width=8}
calculate.cutoff <- function(cutoff, data) {
## Overall Sens/Specificity from leave one out
combined.DUP.DEL <- bind_rows(DUP.plot[[1]][training.type == "basic"],DEL.plot[[1]][training.type == "basic"])
combined.DUP.DEL <- combined.DUP.DEL[,c("a","x","y")]
setnames(combined.DUP.DEL,c("a","x","y"),c("score","spec","sens"))
combined.DUP.DEL[,metric:="rf.wes.binary"]
# combined.DUP.DEL <- data.table(score=c(combined.DUP.DEL[training.type=="basic" & model == "RandomForest",a],
# combined.DUP.DEL[model=="OS",a]),
# metric=c(rep("rf.wes.binary",length(combined.DUP.DEL[training.type=="basic" & model == "RandomForest",a])),
# combined.DUP.DEL[model=="OS",training.type]),
# spec=c(combined.DUP.DEL[training.type=="basic" & model == "RandomForest",x],
# combined.DUP.DEL[model=="OS",x]),
# sens=c(combined.DUP.DEL[training.type=="basic" & model == "RandomForest",y],
# combined.DUP.DEL[model=="OS",y]))
## Calculate cutoffs based on 90% sensitivity in LOO model for both DEL and DUP in Random Forest, Log Regression, and Macé and generate sensitivity/specificity values
calcd.cutoffs <- data.table(ct=c(rep(c("DEL","DUP"),1)),
metric=c("rf.wes.binary","rf.wes.binary"),
linetype=c(1,1))
## calc.cutoff and calc.sens.spec functions use interpolation by subtracting the cutoff from each score (either sensitivity to DGV or the respective metric), and finding the minimum result of that calculation
# Setting a higher cutoff simply lets all variants through, and removes the specificity to DGV (which we can't really calculate)
calc.cutoff <- function(ct,m) {
if (ct == "DEL") {
return(combined.DUP.DEL[metric==m][score<=0,score][which.min(abs(combined.DUP.DEL[metric==m][score<=0,spec] - cutoff))])
} else {
return(combined.DUP.DEL[metric==m][score>=0,score][which.min(abs(combined.DUP.DEL[metric==m][score>=0,spec] - cutoff))])
}
}
calcd.cutoffs[,co:=mapply(calc.cutoff,ct,metric)]
calc.sens.spec <- function(ct,m,co,returnable) {
if (returnable == "DGV") {
if (ct == "DEL") {
return(unlist(sensitivity.DGV[metric==m][score<=0,sens.DGV][which.min(abs(sensitivity.DGV[metric==m][score<=0,score] - co))]))
} else {
return(unlist(sensitivity.DGV[metric==m][score>=0,sens.DGV][which.min(abs(sensitivity.DGV[metric==m][score>=0,score] - co))]))
}
} else {
if (ct == "DEL") {
return(unlist(combined.DUP.DEL[metric==m][score<=0,..returnable][which.min(abs(combined.DUP.DEL[metric==m][score<=0,score] - co))]))
} else {
return(unlist(combined.DUP.DEL[metric==m][score>=0,..returnable][which.min(abs(combined.DUP.DEL[metric==m][score>=0,score] - co))]))
}
}
}
calcd.cutoffs[,sens:=mapply(calc.sens.spec,ct,metric,co,"sens")]
calcd.cutoffs[,spec:=mapply(calc.sens.spec,ct,metric,co,"spec")]
calcd.cutoffs.os <- bind_rows(DEL.plot[[2]],DUP.plot[[2]])
calcd.cutoffs.os[,ct:=ifelse(a<=0,"DEL","DUP")]
setnames(calcd.cutoffs.os,"a","co")
## Apply the actual cutoff
apply.cutoff <- function(m, co.val) {
DEL.cutoff <- calcd.cutoffs[metric==m][ct=="DEL",co]
DUP.cutoff <- calcd.cutoffs[metric==m][ct=="DUP",co]
if (m == "rf.wes.binary") {
m <- "wes.support.score"
}
data[,paste("filter",eval(co.val),eval(m),sep="."):= (get(m) < DEL.cutoff | get(m) > DUP.cutoff) & !split.file %in% ukbb.filtered.indvs]
}