- If you are working with an IGVF dataset, I recommend you just generate a new repository from the https://github.com/IGVF-UCSD/dataset_template repo and clone that to wherever you are working
- Create a new folder on the cluster that will house your data
- Create a GitHub repo and link this to the new folder
- Follow the directory structure outlined here: https://github.com/IGVF-UCSD/dataset_template
- Name your dataset something descriptive and machine readable (no spaces!)
The data you are about to start working with inevitably exists somewhere other than our cluster, so we will need to move it over from where it currently is. This is sometimes non-trivial, so I’ve put together some code to help with different scenarios depending on where the data starts out (https://github.com/IGVF-UCSD/single_cell_utilities/tree/main/data_acquisition). You will likely need to write some custom scripts based on these to transfer the some combination of the follwowing
- Raw fastqs —> these should be deposited within a
fastq/
subdirectory in your main dataset directory. Include the date of acquisition - Processed —> the exact form of this can vary substantially, but the outputs should be placed in a
processed/
subdirectory within the main directory. Include the date of acquisition - Annotated —> the exact form of this can vary substantially, but the outputs should be placed in an
annotation/
subdirectory within the main directory. Include the date of acquisition - Analysis products —> the exact form of this can vary substantially, but the outputs should be placed in an
analysis/
subdirectory within the main directory. Include the date of acquisition - Metadata —> the exact form of this can vary substantially, but the outputs should be placed in a
metadata/
subdirectory within the main directory. Include the date of acquisition
Before you move on, add a well-documented, short notebook describing the process of acquiring it to the bin/
directory.
Data will usually come in either as raw sequencing fastq files, or in already processed and analyzed formats. If it is the former, you will need to process the data yourself. There are many ways to do this, and I have provided some code and pipelines you can run to get to the processed state. A couple things to note:
- Make sure that you version your processing pipeline, similar to data acquisition. Stick the results in a subdirectory within the
processed/
subdirectory with a date (e.g.08Sep23
). - Make sure the processing is reproducible in some way…TODO
Most of the current downstream analyses performed on single cell data rely on being able to group cells together by some sort of biologically relevant property. This is most commonly cell-type, but can be the results of a more nuanced clustering of data as well (or a known covariate). The goal of this section is to get to those annotations through a sort of preliminary analysis of the data. This usually includes:
- Computing quality control metrics on each sample within a dataset
- Filtering low quality cells based on the computed metrics
- Normalizing the data to deal with unwanted properties of the dataset (e.g. unstable variance across features)
- Integrating samples to get all cells in the same feature space and to correct for unwanted technical variation (horizontal integration)
- Integrating modalities (vertical or diagonal integration) to project a cell into a single feature space that captures multiple experimental observations
- Clustering based on these feature spaces
- Automatic annotation of clusters as a first pass
- Manual refinement of annotation by experts
It is also possible that the data received has already undergone one or more of these steps already. If this is the case, I recommend that you put together a few notebooks where you skeptically investigate the data in as many ways as possible.
Whatever route you end up taking, create the following in a new annotation/
subdirectory:
{DatasetID}_CellClusterAssignment.tsv
— a metadata table with one row per barcode that includes:- bc
- CellClusterID
- RNAUMI
- RNAGenes
- ATACFragments (if multiome)
- Any other additional columns you want (e.g. %mito, %ribo etc.)
{DatasetID}_ClusterMetadata.tsv
— a separate table with cluster level metadata. This one should be one row per cluster and include:- CellClusterID — should match above for barcodes (e.g. the unique IDs here should exactly match the unique IDs in the CellCluster column above)
- ManualAnnotationLabel —
- nCells
- MeanRNAUMIsPerCell
- MeanATACFragmentsPerCell
{DatasetID}_thresholds.tsv
— a separate table of per sample thresholds used for filtering cells (if different ones were used per sample)
Other useful files to generate during this stage include:
reductions/
—> any dimensionality reductionsloadings/
—> any feature loadingsqc/
—> if per sample qc was done first
Once you have a confident set of annotations, there are set of relatively general analysis steps you can perform. These are distinguished from more specific downstream analysis tasks that will depend on experimental design (e.g. these will likely be applicable to almost any single cell dataset). These include:
- Identifying cluster markers for features at single cell resolution
- Create pseudobulk representations of the dataset for each metadata field of interest. Put these in a subdirectory called
pseudobulk
. Note that this will depend on the assay, but the concept should be the same. Some examples:- A
tagAlign
for ATAC - A
tsv
for RNA
- A
- Summarizing features at pseudobulk resolution:
- Identifying cluster markers
- Calling peaks for ATAC
- Creating feature matrices from called summarized features
- fragment x cell matrices for ATAC
A couple reminders:
- Version this data as well! You may end up modifiying cluster assignments, or using a different tool for generating pseudobulks
Finally, if the dataset gets published, or you are a part of a consortium, you will need to submit the data. This usually involves creating a metadata table that points to all relevant files and running some kind of command line submission script. I will be adding resources to this as the pipeline for it becomes more robust, but a couple tips:
-
Make sure you fully understand the experimental set-up prior to generating a metadata table. You should understand the nuances, including but not limited to the number of samples, number of biological replicates per sample, the number of technical replicates per sample, the sequencer used, the library protocol used, etc.
-
Come up with a unique identifier per file that captures the
- biosample type (e.g. condition, timepoint etc.) — 0hr_control (if relevant)
- the sample ID (e.g. patient ID or internal sample ID) — DM0b
- the modality used — scRNA, scATAC
- the technical replicate number — _1 (shallow), _3 (deep)
- the lane/plate of sequencing - 1, 2, 3, 4, 13G etc.
- the read type - R1, R2, I1, R3
An example could be
scATAC_24hr_3-cyt_dm21a_fastq_2_1_R1
which is the modality, biosample type, sample ID, file type (optional), lane, tehcnical replicate number and read type.
- Create a README that documents the major steps followed for this dataset, and highlights any nuances worth noting
- You likely won’t be able to host most of the actual data files on GitHub due to their size. So make sure that you provide a link to the publicly available data files prior to publishing this repo. Also provide some code for someone to be able to download and hit the ground running
Inevitably, analysis tools require data to be written to disk in different formats to be usable. If you started from raw data, this usually isn’t a problem. If you were given some processed version of the data, it will likely be written to disk in a specific format (e.g. an special hdf5 file called h5ad
for Scanpy ). If you want to do an analysis in R, you’ll have to convert things. Unfortunately, this is very non-trivial and improper conversions can lead to a lot confusion and improper downstream analysis. To mitigate this, I’ve put together some code that covers many data wrangling cases in a reproducible fashion: https://github.com/IGVF-UCSD/single_cell_utilities/tree/main/data_wrangling. Note, that because many formats are tool specific, this code is mainly organized by tool.