Repository describing the computational workflow employed for the single-cell Alternative Splicing (AS) analysis from Costea et al., 2024.
The raw data can be accessed at EGAD50000000831.
Patient-derived xenograft (PDX) samples from pediatric T-cell acute lymphoblastic leukemia (T-ALL) patients were sequenced via VASA-seq, a plate-based total-transcriptome (single-end) scRNA-seq method developed by Salmen et al., 2022. This repository contains the workflow adapted from the original VASA-seq GitHub repository for conducting Alternative Splicing (AS) analysis, starting from the raw sequencing output. The whole pipeline is organized into four independent modules:
This snakemake
pipeline is designed to be run individually for each patient/sample.
Firstly, raw FASTQs are demultiplexed into single-cell files, then reads 3'-end are trimmed as per the original VASA-seq workflow. The depletion of ribosomal RNA is performed by mapping reads to a reference FASTA file containing both mouse and human rRNA genes (rRNA_ref/rRNA_human_mouse.fa
). The last step consists in discarding mouse reads that contaminated the human T-ALL sample.
Move to your working directory and clone this repository (change the part between angle brackets):
git clone https://github.com/Zaffe24/AS_VASAseq_sc_pipeline.git
cd pre_process/ # move to the subfolder for this first module
Note
Run the pipeline from the pre_process
subfolder and set it as your working directory in profile/variables.yaml
.
Then create a conda environment named pre_process
using as template pre_process.yaml
:
conda env create -f envs/pre_process.yaml -p </path/to/conda>/pre_process
conda activate pre_process
Before running the pipeline edit profile/variables.yaml
. This file allows the user to specify the inputs for the pipeline. Each parameter provides a thorough explanation of its purpose. Moreover, profile/config.yaml
contains parameters specific to Snakemake and how job submission to the cluster is handled.
Important
The ribo-depletion step requires indexing the concatenated human-mouse (hg38+mm38) genome. To build the index, execute scripts/BBsplit.sh
manually (BBMap
module required).
Then run:
snakemake --profile profile/ filtering_cleaned
logs/<patient_id>/
contains log reports of each job run on the cluster, organized by rule.- All the output files are stored in the
<patient_id>/cleaned
directory:- List of demultiplexed and mouse-depleted FASTQs.
units.tsv
contains the label prefix of all FASTQs generated (sample
field) and their location (fq1
). This and the following .tsv files will be needed for modules II, III and IV.samples.tsv
is also required for the following modules, it contains the label prefix of each FASTQ (sample
) and the respective cluster of origin (condition
).locals.tsv
contains the path to each FASTQ (path
), their label prefix (sample
), and the cluster label to which that cell belongs (cluster
).fractions.tsv
contains the following order: label prefix for each FASTQ, Proportion of reads mapping to the mouse genome, and proportion of ambiguous reads (mapping to both human and mouse genomes).full_metadata.tsv
merges the information relative to each FASTQ defined in the previous tables.Mouse_reads.png
andAmbigous_reads.png
show boxplots for the proportion of mouse and ambiguous reads in processed and discarded (if a Thr < 1 was set invariables.yaml
) cells.
This module allows the user to extend the (human) genome annotation file (GTF) with novel transcripts inferred directly from the patient/sample data. The single-cell FASTQs are collapsed into pseudo-bulks corresponding to the cell clusters previously defined, and novel transcript annotations are generated for each one of them. The newly annotated transcripts are then appended to the original GTF file that will be used to quantify AS events. For this step, we relied on the pipeline from the original VASA-seq publication.
For the installation of this module, please refer to the original documentation.
Note
To avoid confusion between Snakemake configuration files, we suggest creating a directory specific to this module.
We ran the pipeline from a dedicated conda environment whose template can be found in transcriptome_extension/
:
conda env create -f transcriptome_extension/module_2.yaml -p </path/to/conda>/module_2.yaml
conda activate module_2
As before, the parameters in variables.yaml
and config.yaml
should be defined according to the user specifics. Our customized templates can be found here, otherwise the templates from the original GitHub repository work fine.
Important
We adapted the original Snakefile
to satisfy our requirements. Replace the original with transcriptome_extension/Snakefile
in your working directory.
To run the pipeline:
snakemake --profile profile/ <patient_id>/gffcompare/extended_ref_annotation.gtf
The only relevant output for the next steps is the original GTF file expanded with novel transcript annotations: <patient_id>/gffcompare/extended_ref_annotation.gtf
.
This step allows the user to further expand the transcriptome annotation by inferring novel micro-exons (length < 30nt) directly from the data.
We suggest consulting the original documentation for further information about this step. We created another dedicated conda environment:
conda env create -f microexon_discovery/module_3.yaml -p </path/to/conda>/module_3
conda activate module_3
This and the following modules, rely on the Snakemake pipeline MicroExonator
. For further information about the inputs required and the rationale behind the workflow, consult the original MicroExonator page.
We provide our customized configuration files for Snakemake in microexon_discovery/profile.
To run this module with our specifics, set the following parameter in the header of MicroExonator.smk
:
## header of MicroExonator.smk
configfile : "profile/variables_discovery.yaml"
then:
snakemake -s MicroExonator.smk --profile profile/ Report/out.high_quality.txt
Report/out.high_quality.txt
lists all the high-quality micro-exons discovered. To append their annotations to the GTF file obtained in the previous module we ran the following script:
microexon_discovery/add_ME_to_ref.sh "Report/out.high_quality.txt" "</path/module2/patient_id>/gffcompare/extended_ref_annotation.gtf" "Homo_sapiens.GRCh38.dna.primary_assembly.fa"
# arg_1 : ouput from module III
# arg 2 : output from module II
# arg 3 : human genome FASTA file
This will generate Report/out.high_quality_calibrated.gtf
, the final GTF file incorporating both the annotation of novel transcripts and microexons.
Tip
The pipeline will generate all output folders in the current working directory. To avoid the risk of overwriting them when re-running the pipeline on a different patient, we suggest moving the outputs to patient-labeled subfolders after completion.
In this step, we perform pairwise comparisons across cell pseudo-bulks to compute the list of AS events statistically significant. Moreover, we can compute the percentage-spliced-in (PSI) value for each splicing node at the single-cell level. For further information regarding these outputs, consult the original pipeline repository.
Similarly to the previous module, we defined variables_pairwise_comparisons.yaml
as config file for MicroExonator.smk
:
## MicroExonator.smk header
configfile : "profile/variables_pairwise_comparisons.yaml"
Having activated the conda environment from the previous module, run the following code:
snakemake -s MicroExonator.smk --profile profile/ snakepool
As output, a .tsv file for each pairwise comparison will be generated in the subfolder Whippet/Delta/Single_Cell/Single_nodes/
. These tables contain information for all AS events detected, regardless if they are statistically significant or not.
To compute PSI values at the single-cell level instead, define variables_psi_matrix.yaml
as config file for MicroExonator.smk
:
## MicroExonator.smk header
configfile : "profile/variables_psi_matrix.yaml"
Within the same conda environment, run the following code:
snakemake -s MicroExonator.smk --profile profile/ quant_unpool_single_cell
For each cell, a .tsv file with PSI values for all splicing nodes will be generated and stored in Whippet/Quant/Single_Cell/Unpooled
.
Tip
Again, after completion of this module we suggest moving all the output files to a patient/sample-labeled subfolder.