Skip to content

Commit

Permalink
Merge pull request #1 from hackseq/master
Browse files Browse the repository at this point in the history
Update fork from original
  • Loading branch information
yxblee authored Oct 15, 2016
2 parents b2b6fc9 + 47cd5e0 commit 18fd91a
Show file tree
Hide file tree
Showing 4 changed files with 137 additions and 37 deletions.
62 changes: 41 additions & 21 deletions Abyss-Blackbox-optimization.Rmd
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Abyss Blackbox Optimization"
title: "ABySS Blackbox Optimization"
author: "Nivretta Thatra"
date: "October 15, 2016"
output: github_document
Expand All @@ -20,35 +20,55 @@ output: github_document
knitr::opts_chunk$set(echo = TRUE)
```

#Project 2 goals
## Evaluate a bunch of parameter optimization tools
- Response metric
- Solution function is unknown
- Know min and max, scale of values (continuous or discrete?)
- Approach 1: Start halfway and try a point to the left and to the right, find next best, *exploitation*
- Approach 2: Try a bunch of random ranges, *exploration*
- Approach 3: Grid search
- Current problem evaluation
- Human genome assembly
- Shotgun sequencing to contigs is ~24 hours
- Metric: Contiguity (how long the contigs are) and how accurate the contigs are (
- We are optimizing for *contiguity*
- For genome assembly, there is one key parameter we can optimize, and then various other weaker ones.
- **Paretto Frontier**
- Setting the boundaries for the metrics

#Abstract
- When tackling optimization of parameters, the process is manual and tedious: submitting jobs to a scheduler, rerunning failed jobs, inspecting outputs, tweaking parameters, and repeating. In genome sequence assembly, for example, there are a variety of parameters related to expected coverage of the reads and heuristics to remove read errors and collapse heterozygous variation.

- BlackBox parameter optimization tools do exist, but their usibility and speed need to be compared & evaluated.
- Approaches of optimimation tools:
- *Exploitation Approach*: Start halfway and try a point to the left and to the right, find next best, iterate
- *Exploration Approach*: Try a bunch of random ranges
- *Grid Search Approach*
- Here we evaluate 3 optimization tools - Opal, SpearMint, and ParOpt - on a dataset of a human bacterial artificial chromosome (BAC), using the assembler [ABySS](https://github.com/bcgsc/abyss#readme).

- We find that: ~blah blah SPEARMINT is real fresh~

#Goals/Methods
##Evaluate 3 parameter optimization tools for usability and speed
- Divvy up 3 tools for testing
- Download ABySS
- Access dataset from ORCA computing machine

- Backend
- What types of input parameters (discrete with large/small ranges, continuous, binary)
- Make it portable to other commandline tools so optimizer can be told how to launch the tool
- Results output
- Generate metrics vs parameters plot
- Generate metrics vs metrics plot (Paretto Frontier)
- Generate target metrics vs parameters plot
- Generate Pareto frontier of the target metric and a second metric of interest (contiguity and correctness) likely in R using ggplot
- Generate a report of the results of the optimization
- Write a short report of our experience
- Post on GitHub pages
- Possibly submit to a preprint server (bioRxiv, PeerJ, Figshare)
- Possibly submit for peer review, such as F1000Research Hackathons

##Dataset(s) and Optimzers
- Dataset
- a human bacterial artificial chromosome (BAC), using assembler [ABySS](https://github.com/bcgsc/abyss#readme)

- Metrics
- The key metrics are *contiguity* (a.) and *correctness* (b. through d.).
1. contiguity (NG50, N50) and aligned contiguity (NGA50, NA50)
2. number of breakpoints when aligned to the reference as a proxy for misassemblies
3. number of mismatched nucleotides when aligned to the reference, Q = -10*log(mismatches / total_aligned)
4. completeness, number of reference bases covered by aligned contigs divided by number of reference bases
- We'll be optimizing the NG50 metric (or NGA50 with a reference genome) and reporting (but probably not optimizing) the correctness metrics.
- The primary parameter we'll be optimizing is k (a fundamental parameter of nearly all de Bruijn graph assemblers), and there's a bunch other parameters that we can play with (typically thresholds related to expected coverage).

- Optimizers being evaluated
- [OPAL](http://pythonoptimizers.github.io/opal/) by @dpo — [Optimization of algorithms with OPAL](http://link.springer.com/article/10.1007/s12532-014-0067-x)
-
- [SpearMint](https://github.com/hips/spearmint) by @mgelbart — [Predictive Entropy Search for Multi-objective Bayesian Optimization](https://arxiv.org/abs/1511.05467)
- [ParOpt](https://github.com/sseemayer/ParOpt) by @sseemayer which uses scipy.optimize
- Possibly R packages, [a long list](https://cran.r-project.org/web/views/Optimization.html)
- Possibly Python packages, [like scikit-optimize](https://scikit-optimize.github.io/)

##Results
50 changes: 34 additions & 16 deletions scripts/R_optimizers/common/AbyssWrapper.R
Original file line number Diff line number Diff line change
@@ -1,33 +1,51 @@
runAbyss<-function(input, k) {
name = paste("test_k", k, sep="")
cmd = paste("abyss-pe",
" k=", k,
#" np=8"
" name=", name,
" in='", input, "'",
sep = "")
library("testthat")

#' Runs Abyss
#'
#' Runs abyss with the specified parameters.
#' @param input Path to the input fastq files seperated by space
#' @param name The name of this assembly
#' @param k size of a single k-mer in a k-mer pair (bp)
#' @export
runAbyss<-function(input, name, k) {
outdir = paste(name, "_abyss_k", k, sep="")
dir.create(file.path(".", "runs"), showWarnings = FALSE)
dir.create(file.path("runs", outdir), showWarnings = FALSE)
outdir <- paste("runs/", outdir, sep="")

cmd <- paste("abyss-pe",
" -C ", outdir,
" k=", k,
" name=", name,
" in=\"", input, "\"",
sep = "")
print("Running:")
print(cmd)

t1 <- try(system(cmd,
intern = TRUE,
ignore.stderr = TRUE,
ignore.stdout = TRUE),
#ignore.stderr = TRUE,
#ignore.stdout = TRUE
),
silent = TRUE)

if (inherits(t1, "try-error")) {
print("[FAILED]")
return()
}
print("[DONE]")
}


testRunAbyss <- function() {
#setwd("Hackseq2016/abyss")
test_input = "../../../data/test-data/reads1.fastq ../../../data/test-data/reads2.fastq"
runAbyss(test_input, 22)
#' Runs Abyss for the test data
#'
#' @param k size of a single k-mer in a k-mer pair (bp)
#' @export
runAbyssTest <- function(k) {
runAbyss("$PWD/data/test-data/reads1.fastq $PWD/data/test-data/reads2.fastq",
"test",
k)
}

testRunAbyss()
runAbyssTest(k=22)


48 changes: 48 additions & 0 deletions scripts/manual/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
k=32

# Report run time and memory usage
export SHELL=zsh -opipefail
export REPORTTIME=1
export TIMEFMT=time user=%U system=%S elapsed=%E cpu=%P memory=%M job=%J

all: \
results/200k/k$k/hsapiens-scaffolds.fac.tsv \
results/200k.fac.tsv

.DELETE_ON_ERROR:
.SECONDARY:

# Download the complete data.
data/30CJCAAXX_4.fq.gz:
mkdir -p $(@D)
curl -o $@ ftp://ftp.bcgsc.ca/public/sjackman/$(@F)

# Download a subset of the data.
data/200k.fq.gz:
mkdir -p $(@D)
curl -o $@ ftp://ftp.bcgsc.ca/public/sjackman/$(@F)

# Take a subset of the data.
data/400k.fq.gz: data/30CJCAAXX_4.fq.gz
gunzip -c $< | head -n1600000 | gzip >$@

# Unzip the data.
%.fq: %.fq.gz
gunzip -c $< >$@

# Assemble the data with ABySS.
results/200k/k$k/hsapiens-scaffolds.fa: data/200k.fq
mkdir -p $(@D)
abyss-pe -C $(@D) name=hsapiens k=$k in=$(realpath $<)

# Calculate the assembly contiguity metrics.
%.fac.tsv: %.fa
abyss-fac $< >$@

# Concatenate multiple TSV files.
results/200k.fac.tsv: \
results/200k/k*/hsapiens-scaffolds.fac.tsv \
results/200k/k$k/hsapiens-scaffolds.fac.tsv
mlr --tsvlite cat $^ >$@
datamash -H max N50 <$@
head -n1 $@; grep $$(datamash -H max N50 <$@ | tail -n1) $@
14 changes: 14 additions & 0 deletions scripts/manual/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Manual Optimization

Run ABySS multiple times, manually, for multiple values of *k*, and determine which assembly has the largest N50.

# Usage

```sh
make k=24
make k=28
make k=32
make
datamash -H max N50 <200k.fac.tsv
head -n1 200k.fac.tsv; grep $(datamash -H max N50 <200k.fac.tsv | tail -n1) 200k.fac.tsv
```

0 comments on commit 18fd91a

Please sign in to comment.