AI4All@Princeton is a summer camp that aims to promote diversity in computer science by teaching AI to young students of diverse backgrounds ( In this module, we will be investigating our genomic diversity by exploring natural genomic variation between world populations.
Relevant Links:
- IGSR home: Geno
- Phase 3 1000 Genomes Results:
- Phase 3 1000 Genomes Data:
- Phase 3 1000 Genomes Processing Example:
- *** see gwas_sv_ld_filt_af.txt
Data Preprocessing
- Run GATK software to subset SNPs:
- run ./gatk/gatk SelectVariants -V input.vcf -O output.vcf --keep-ids gwas_sv_ld_RSIDs.list
- example output.vcf: chrX_filtered.txt
- run ./gatk/gatk SelectVariants -V input.vcf -O output.vcf --keep-ids gwas_sv_ld_RSIDs.list
- concatenate vcf files using vcf-tools
Input Data:
- "chr01-22_filtered.vcf"
- to download:
To Do:
Outline first two weeks
Implement first two weeks notebooks (goals)
start first 3 mini lectures
plan (and test) which ML algorithms to introduce to students (clustering, standard prediction, etc)
brainstorm intro slides to SNPs
explore with python and benchmark runtimes
- 1000genomes_dataExploration.ipynb: preliminary exploration of 1000 Genomes data (PCA, SVM, data cleanup and filtering)