The computational Seq-Well TCR alignment pipeline identifies and quantifies the TCR regions associated with each cell barcode in the sequencing data resulting from TCR enrichment protocol described in Tu et al. 2019. We have reengineered the Seq-Well TCR recovery pipeline originally presented by Tu et al. to make significant enhancements to the user and developer experience while delivering comparable results. Chiefly, we have improved the performance, reporting, tunability, portability, and maintainability of the computational pipeline in a new Python program, TCRGO. This pipeline consists of four main steps, preprocessing and alignment, filtering using alignment information, CDR3 recovery, and summarizing of the data. The preprocessing and alignment step consists of conversion to single-end BAM format, barcode tagging, read trimming, multimap alignment to TRA and TRB gene segments, and barcode error correction. Multimap alignment to V, J, and C TRA and TRB gene segments is performed via Bowtie2 while all other preprocessing tasks are now handled by trusted tools such as Samtools, Picard, and Picard extension Drop-Seq Tools. The filtering step iterates over the alignment data and distributes reads that multimap to both a V and J segment to lists which can be used in parallel by the CDR3 recovery step. The CDR3 recovery step quantifies the alignment identity data for each unique transcript and determines the unique transcript’s most likely CDR3 sequence using a greedy, optimized, heuristic-driven algorithm. The alignment identity and CDR3 information is gathered and tabulated during the summary step. The filtering, recovery, and analysis steps, originally scripted in Bourne Again Shell and Matlab, have been refactored in Python to afford better maintainability of the codebase, reporting for easier troubleshooting, and boosted performance via harnessing highly efficient and powerful packages such as pysam, biopython, numpy, and pandas. Furthermore, the pipeline has been fitted for cloud computing through wrapping it in Docker and the Workflow Description Language. Cloud deployability will increase accessibility to scientists on cloud research platforms such as Terra and enable parallelization of the pipeline’s data processing jobs.
Hardware: macOS (tested), linux (untested), or Windows (untested). Recommended 16GB RAM and at least 64 GB free space on hard drive.
Software: Python ≈ 3.8.5, pysam ≈ 0.16.0.1, pandas ≈ 1.1.2, bowtie2 >= 2.4.1, samtools ≈ 1.3.1 python-igraph ≈ 0.8.3, biopython = 1.77, Java 12.0.1 (recommended), Drop-Seq Tools >= 2.4.0
This program has been wrapped as a Workflow Description Language script, TCRGO.wdl. The latest version of this workflow can be found here on Dockstore, where it can be exported to an analysis platform of your choice. The Docker image itself is hosted here on Docker Hub.
On Terra, scattering of samples is possible through using the Terra Data Table.
Please see the Hardware requirements before considering a local installation.
It's strong recommended you create a new virtual environment. Example using conda
on a Unix-like system:
conda config --add channels conda-forge --add channels bioconda \
&& conda create -y -n tcrgo python=3.8.5 samtools=1.3.1 bowtie2=2.4.1 pysam=0.16.0.1 pandas=1.1.2 python-igraph=0.8.3 biopython=1.77 \
&& conda activate tcrgo
Then git clone
this repository to your machine.
For alignment, you must install Java and git clone
Drop-Seq Tools by the Broad Institute. The Picard and Java jar files must be entered as arguments to the alignment script.
Change the present working directory (cd
on Unix-like systems) to the repository root folder.
For help with any of the scripts, use only the -h
flag for help! Ex: python -m alignment -h
On a Unix-like system, if there are spaces in any of your pathnames please escape them using \
! Example: /path/to/my data/my sample.bam
becomes /paht/to/my\ data/my\ sample.bam
.
See the bottom section for a full list of inputs.
Call alignment.py, a wrapper for Drop-Seq Tools and bowtie2, to tag, trim, align, and barcode-repair a BAM, single-end, or .
python -m alignment -d path/to/dropseq.jar -p path/to/picard.jar -f path/to/VJ_reference.fasta -b mysamplename path/to/raw.bam
Or first convert paired-end FASTQs to single-end BAM then preprocess and align:
python -m alignment -d path/to/dropseq.jar -p path/to/picard.jar -f path/to/VJ_reference.fasta -b mysamplename path/to/barcode.fastq.gz,path/to/biological.fastq.gz
Here is an overview of the preprocessing/alignment pipeline:
Note: `<basename>` is inputted using the `--basename` argument.
0. Input Data -> Unmapped BAM, starting from any of these substeps depending on the input:
a. Raw BAM (`<basename>`.bam)
b. Single-end FASTQ (`<basename>`.fastq)
c. Pair-end FASTQs (`<basename>`_R1.fastq,`<basename>`_TCR.fastq)
1. Unmapped BAM (`<basename>`_unmapped.bam) -> aligned and tagged BAM
a. Tag cell barcodes (`<basename>_celltagged.bam`)
b. Tag molecular barcodes (`<basename>_idtagged.bam`)
c. Filter out reads with low quality cell/molecular barcodes (`<basename>`_filtered.bam)
d. Trim 5’ primer sequence (`<basename>`_adaptertrimmed.bam)
e. Trim 3’ polyA sequence (`<basename>`_trimmed.bam)
f. SAM -> Fastq (`<basename>`_trimmed.fastq)
g. Bowtie2 alignment (`<basename>`_aligned.sam)
h. Sort STAR alignment in queryname order (`<basename>`_alignedsorted.bam)
i. Merge STAR alignment tagged SAM to recover cell/molecular barcodes (`<basename>`_merged.bam)
j. Add gene/exon and other annotation tags (`<basename>`_exontagged.bam)
k. Barcode Repair
i. Repair substitution errors (`<basename>_synthrepaired.bam`)
ii. Repair indel errors (`<basename>`_repaired.bam)
iii. Sort by coordinates (`<basename>`_repairedsorted.bam)
Call filter_queries.py to write a list of BAM queries (reads) which contain alignments to V and J subregions.
python -m filter_queries -lt 5 -gt 1000 /path/to/alignment.bam
where alignment.bam is a BAM produced by the alignment workflow.
The -w
flag can be used to divide the list of queries up into w files for processing by workers on an HPC or the cloud.
Call recover_cdr3s.py to find the top-scoring V and J alignments per read, the VJ pair alignment consensus per UMI, and the consensus CDR3 sequence per UMI.
python -m recover_cdr3s -f path/to/VJ_reference.fasta -mf 0.3 -mc 5 -q out/queries1.tsv /path/to/alignment.bam
where alignment.bam is a BAM produced by the alignment workflow.
The --cdr3-positions-file
argument can be used to input a 2-column TSV file containing the names of FASTA TR(A|B)(V|J) segments and that entry's corresponding CDR3 start or ending codon position in the nucleotide sequence. Each entry name in the CDR3 positions file must exactly match its respective entry in the FASTA. If the argument is not specified, the program will attempt to identify the most likely CDR3 start and end positions.
Call summary.py to combine all cdr3_infow.tsv files produced by recover_cdr3s.py into one large tsv file. The --collapse
mode will attempt to collapse UMIs by their cell and molecular barcodes; an intensive process which consumes a considerable amount of time.
python -m summary -i path/to/cdr3_info_folder/ --collapse ALL
Task | Workflow Description Language | Python Commandline | Description |
---|---|---|---|
Global | Boolean run_alignment = true | N/A | Whether or not to run the alignment task. If false, sequence_data must the address of repairedsorted BAM from alignment written as an array of length 1; ["gs://path/to/repairedsortedbam"]. |
Global | String sample_name | [See --basename] | Becomes the prefix for many output files as to differentiate them from outputs from other samples. |
Global | Array[File] sequence_data | [See data, bam] | An array of length 1 or 2 containing the addresses to sequence data. Options include the ["gs://path/to/repairedsorted.bam"], ["gs://path/to/singleend.fastq"], or ["gs://path/to/barcode.fastq","gs://path/to/biological.fastq"]. |
Global | File fasta | [See --fasta] | The reference FASTA file containing entries for TR(A|B)(V|J|C) gene segments. |
Global | String docker = "shaleklab/tcrgo:latest" | N/A | The address to the container image used to run the program. |
Global | String zones = "us-central1-a" ... | N/A | The Google zone preferences for hosting and running the program. |
Global | Int preemptible = 2 | N/A | The number of tries the program will attempt to use a preemptible VM, which is ~5 times cheaper but can be restarted at any time. If exceeded |
All | N/A | --verbosity, -v = "INFO" | The scope of verbosity to which the log will output. See logging Python module for more details. Possible options include: "INFO", "VERBOSE", "WARNING", "SUCCESS", "ERROR". "VERBOSE" may be useful in debugging. |
alignment | Array[File] data | data (positional argument) | Path to single-end BAM/FASTQ or paths (delim. by comma) Read1-index-1 FASTQs |
alignment | N/A (/software/dropseq/jar/dropseq.jar) | --dropseq, -d | Path to the Drop-Seq Tools jar file. Recommended you use a version 2.3.0 or 2.4.0. |
alignment | N/A (/software/dropseq/3rdParty/picard/picard.jar) | --picard, -p | Path to the Picard jar file. Recommended you use a version of Picard that is bundled with/compatible with your version of Drop-Seq Tools |
alignment | File fasta | --fasta, -f | The reference FASTA file containing entries for TR(A|B)(V|J|C) gene segments. |
alignment | String sample_name | --basename, -b = None | The prefix to affix to all output files. If not specified, Python's os module will be used to ascertain. |
alignment | N/A (/cromwell_root/out/) | --output-path, -o = "./out/alignment/" | The path to which outputted files will be sent. |
alignment | Int number_cpu_threads = 4 | N/A | The number of CPUs to request for the task. |
alignment | Int task_memory_GB = 16 | N/A | The amount of RAM in GB to request for the task. |
alignment | String disks = "local-disk 256 HDD" | N/A | The amount of diskspace in GB to request for the task. |
alignment | Int boot_disk_size_GB = 15 | N/A | The size of the boot disk to request for the task. |
alignment | Boolean filter_barcodes = true | N/A | Run filter barcodes based on quality score step |
filter_queries | File bam | bam (positional argument) | Path to single-end tagged, trimmed, aligned, repaired BAM. IMPORTANT: It is strongly recommended you use the alignment script to create this BAM." |
filter_queries | Int minimum_reads = 5 | --minimum-reads, -lt = 5 | UMIs with less than the minimum number of reads will be filtered out. |
filter_queries | Int maximum_reads = 1000 | --maximum-reads, -gt = 1000 | UMIs with more than the maximum number of reads will have their reads randomly subsampled down to the maximum. |
filter_queries | Int seed = 2020 | --seed, -s = 2020 | The seed for pseudo-random subsampling of reads for UMIs that exceed the maximum number of reads. |
filter_queries | Int workers | --workers, -w = 1 | Divide the list of filtered reads by this number for processing by multiple HPCC tasks or VMs |
filter_queries | Int number_cpu_threads = 1 | N/A | The number of CPUs to request for the task. |
filter_queries | Int task_memory_GB = 16 | N/A | The amount of RAM in GB to request for the task. |
filter_queries | String disks = "local-disk 128 HDD" | N/A | The amount of diskspace in GB to request for the task. |
filter_queries | Int boot_disk_size_GB = 10 | N/A | The size of the boot disk to request for the task. |
filter_queries | N/A | --umi-tag, -u = "RX" | The BAM tag which represents the unique molecular barcode sequence. |
filter_queries | N/A | --barcode-tag, -b = "CR" | The BAM tag which represents the cell barcode sequence. |
filter_queries | N/A (/cromwell_root/out/) | --output-path, -o = "./out/queries/" | The path to which the files from this program will output. |
recover_cdr3s | File bam | bam (positional argument) | Path to sorted, single-end BAM |
recover_cdr3s | File fasta | --fasta, -f | Path to reference FASTA containing TRA/TRB gene segments (V,J,C) |
recover_cdr3s | File? cdr3_positions | --cdr3-positions-file = None | Path to a two-column TSV containing reference names and the start V or end J base position. NOTE: If these positions are zero-indexed as opposed to one-indexed, also specify the --zero-indexed flag! If not provided, program attempts to identify the best ORF and positions. |
recover_cdr3s | Boolean is_zero_indexed = false | --zero-indexed = False | If this and the CDR3 positions file are specified, the positions entered are treated as indices from 0 to length(seq)-1 |
recover_cdr3s | Int workers = 8 | --workers = 1 | The number ID of the worker which will read queries.tsv |
recover_cdr3s | File query_list | --query-list, -q = None | The path to the query list file, outputted by filter_queries, to read. |
recover_cdr3s | Float minimum_frequency = 0.3 | --minimum-frequency, -mf = 0.0 | UMIs whose top VJ combination frequency is beneath the minimum frequency will not undergo CDR3 sequence reconstruction. |
recover_cdr3s | Int minimum_cdr3s = 5 | --minimum-cdr3s = 1 | UMIs with less than the minimum number of reads which have recovered CDR3s will not have their CDR3 statistics reported. |
recover_cdr3s | Boolean exclude_relatives = false | --exclude-relatives, -r = False | When identifying the VJ combination for a UMI, only take for CDR3 recovery reads whose top VJ is exactly equal to the highest counted segment variants. If not enabled, all reads are taken that match the highest counted segment's root (ex: TRAV24-1 is the top V for the UMI, but TRAV24-2 reads are also gathered for CDR3 recovery). Regardless the top segment's specific variant will be reported in the output information. |
recover_cdr3s | N/A (/cromwell/out/) | --output-path, -o = "./out/cdr3/" | The path to which the files from this program will output. |
recover_cdr3s | Int number_cpu_threads = 1 | N/A | The number of CPUs to request for the task. |
recover_cdr3s | Int task_memory_GB = 16 | N/A | The amount of RAM in GB to request for the task. |
recover_cdr3s | String disks = "local-disk 128 HDD" | N/A | The amount of diskspace in GB to request for the task. |
recover_cdr3s | Int boot_disk_size_GB = 10 | N/A | The size of the boot disk to request for the task. |
summary | Array[File] cdr3_infos | [See --workers] | An array of CDR3 info files output by recover_cdr3s |
summary | Array[File] tiebreaks_alignments | N/A | An array of alignment tiebreak files output by recover_cdr3s |
summary | String sample_name | N/A | The prefix which will distinguish the output files from this pipeline from outputs of other samples. |
summary | Boolean collapse = true | --collapse, -c = False | Perform BCUMI collapsing. BCUMIs are first binned by topVJ combination, then subgrouped by greater/equal to threshold (collapsers) vs. less than threshold (collapsees). For each collapser BCUMI, hamming distance comparisons are done with each collapsee BCUMI. If HD == 1, then CDR3 sequences of equal lengths are compared. If HD <= 1, then collapsees are collapsed into the collapser with the highest number of reads (or number of edges if there is a tie)."aggrcollapsed_cdr3_info.tsv is created. |
summary | Int threshold = 10 | --threshold, -t = 10 | When BCUMI collapsing, BCUMIs with greater than or equal to this threshold will be considered collapsers and the rest collapsees. Collapsees are merged into collapsers if the collapsing criteria is met. A lower number may result in more collapsing while a higher number may be more strict and result in less collapsing. |
summary | Boolean plot = false | --plot, -p = False | If BCUMI collapsing, plot Collapsers-Collapsees graphs for each reference combination. If enabled, expect hundreds of graphs in --output-path /dgraphs/ with cumulative size exceeding 100MB. The program will take longer to finish |
summary | Boolean string_index = false | --string-index, -s = False | Output the aggregated CDR3 info .tsv file with the BC and UMI strings as the indices instead of numbering them for easier reading |
summary | N/A ("/cromwell_root/out/") | --output-path, -o = "./out/cdr3/" | The path to which outputted files will be sent. |
summary | N/A ("/cromwell_root/out/") | --input-path, -i = "./out/cdr3/" | Path to folder containing files outputted by recover_cdr3s |
summary | Int number_cpu_threads = 1 | N/A | The number of CPUs to request for the task. |
summary | Int task_memory_GB = 16 | N/A | The amount of RAM in GB to request for the task. |
summary | String disks = "local-disk 64 HDD" | N/A | The amount of RAM in GB to request for the task. |
summary | Int boot_disk_size_GB = 10 | N/A | The size of the boot disk to request for the task. |