Skip to content

GoekeLab/sg-nex-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Singapore Nanopore-Expression Project!

GitHub release (latest SemVer) cell lines Sequencing Experiments

The SG-NEx project is an international collaboration initiated at the Genome Institute of Singapore to provide reference transcriptomes for 5 of the most commonly used cancer cell lines using Nanopore long read RNA-Seq data:

The Singapore Nanopore-Expression Project - Design!

Transcriptome profiling is done using PCR-cDNA sequencing ("PCR-cDNA"), amplification-free cDNA sequencing ("direct cDNA"), direct sequencing of native RNA (“direct RNA”), and short read RNA-Seq. All samples are sequenced with at least 3 high quality replicates. For a subset of samples spike-in RNAs are included and matched m6A profiling is available.

The raw, aligned, and processed data is hosted on the AWS open data registry (see below for data access and analysis tutorial).

Content

Sign up for data release notifications and updates

You can sign up for the sg-nex-updates email list to receive notifications about upcoming data releases:

https://groups.google.com/forum/#!forum/sg-nex-updates/join

Data Release and Access

Latest Release (v0.6)

DOI

This release includes 113 samples from 13 different cell lines.

Data Access

You can access the following data through the AWS Open Data Registry:

  • raw files (fast5)
  • raw files (blow5)
  • basecalled files (fastq)
  • aligned reads (genome and transcriptome) (bam)
  • tracks for visualisation (bigwig and bigbed)
  • processed data for differential RNA modification analysis (json, for use with xPore)
  • processed data for identification of m6A (json, for use with m6Anet)
  • annotation files
  • detailed sample and experiment information

You can browse the S3 data here: 1) fast5, fastq, and bam and 2) blow5.

Please refer to the data access tutorial which describes the S3 data structure and how to access files with AWS CLI. The direct links to the data are listed in the sample spreadsheet.

Here are the locations for the spike-in concentrations used in SG-NEx samples:

Citation: Please cite the pre-print describing the SG-NEx data resource when using these data, and add the following details: "The SG-NEx data was accessed on [DATE] at registry.opendata.aws/sg-nex-data".

Chen, Y. et al. "A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines." bioRxiv (2021). doi: https://doi.org/10.1101/2021.04.21.440736

Release Note & Updates

Version Number: V0.6.0
Date: 2024-11-21
Replacement of fastq and bam files

  • fastq files basecalled from fast5 converted blow5 files using Guppy 6.4.2
  • bam files using updated fastq files with Minimap2-2.22

Added in code for SG-NEx manuscript

Version Number: V0.5.1
Date: 2024-04-15 Release of new sample

  • new RNA004 sample of Hek293T (SGNex_Hek293T_directRNA_replicate5_run1)
  • pod5, fastq, genome and transcriptome aligned bam files are included in this release

Version Number: V0.5.0
Date: 2024-03-08
Release of new samples

  • direct RNA data for H9 and HEYA8 samples
  • cDNA and direct cDNA samples for H9 and HEYA8
  • cDNA promethion samples of Hct116 samples using SQK-PCS110 (100 million reads on average)
  • cDNA sample of Hct116 sampe using the SQK-PCS111

Update of existing sample files

  • SGNex_MCF7_cDNAStranded_replicate2_run1.fastq.gz additional info characters removed before @ for the first read
  • SGNex_K562_cDNAStranded_replicate3_run3.fastq.gz line48000 added 1 character of “ for quality to match sequence length
  • SGNex_A549_directRNA_replicate5_run1.tar.gz updated as previous version is incomplete
  • SGNex_MCF7-EV_directRNA_replicate1_run1.fastq.gz updated on ENA as it is a duplicated file
  • SGNex_MCF7_directRNA_replicate2_run2 fixed with this command “zcat SGNex_MCF7_directRNA_replicate2_run2.fastq.gz | sed 's/.*@/@/g' | sed '$d' | gzip > SGNex_MCF7_directRNA_replicate2_run2_fixed.fastq.gz” thanks to Alex

Version Number: V0.4.0
Date: 2023-03-06
Update of the SG-NEx data on AWS. Includes raw signal data in blow5 format.

Version Number: V0.3.0
Date: 2022-07-28
Initial release of the SG-NEx data on AWS. Includes Nanopore direct RNA, cDNA, direct cDNA-Seq, short read RNA-Seq and m6ACE-Seq.

Release History

You can find previous releases here in the release history

Browse the data

You can now browse the data using the UCSC genome browser:

View the SG-NEx data in the UCSC Genome Browser

By default only selected tracks are shown, but you can visualise all reads (bigbed tracks) and their coverage tracks (bigwig) from each individual sample.

Data Processing

All data was aligned against the human genome version Grch38 (please refer to the data access tutorial for reference files). We collaborated with nf-core to develop nanoseq, a standardardized pipeline for Nanopore RNA-Seq data processing.

Use Cases and Applications

You can browse a list of articles that review or use the SG-NEx data here. If you have used the data for your own research, feel free to add a publication entry.

Data Analysis Tutorials and Workflows

The following short tutorials are available that demonstrate how to analyse the SG-NEx data:

Additional, more detailed workflows can be found here:

Contributing

We welcome contributions from all long read RNA-seq tool developers! You may follow the steps below to contribute:

  • Fork this repository
  • Add your tutorial document to the docs folder
  • Adding your tutorial workflow link in the Data Analysis Tutorials and Workflows section in README.md in this format: tutorial title
  • Submit a pull request.

Acknowledgements

GIS Sequencing Platform and Data Generation
Hwee Meng Low, Yao Fei, Sarah Ng, Wendy Soon, CC Khor

Cancer Genomics and RNA Modifications
Viktoriia Iakovleva, Puay Leng Lee, Lixia Xin, Hui En Vanessa Ng, Jia Min Loo, Xuewen Ong, Hui Qi Amanda Ng, Suk Yeah Polly Poon, Hoang-Dai Tran, Kok Hao Edwin Lim, Huck Hui Ng, Boon Ooi Patrick Tan, Huck-Hui Ng, N.Gopalakrishna Iyer, Wai Leong Tam, Wee Joo Chng, Leilei Chen, Ramanuj DasGupta, Yun Shen Winston Chan, Qiang Yu, Torsten Wüstefeld, Wee Siong Sho Goh

Statistical Modeling and Data Analytics
Ying Chen, Nadia M. Davidson, Yuk Kei Wan, Hasindu Gamaarachchi, Andre Sim, Harshil Patel, Min Hao Ling, Yu Song Chuah, Naruemon Pratanwanich, Christopher Hendra, Laura Watten, Chelsea Sawyer, Dominik Stanojevic, Philip Andrew Ewels, Andreas Wilm, Mile Sikic, Alexandre Thiery, Michael I. Love, Alicia Oshlak, Jonathan Göke

Citing the SG-NEx project

The SG-NEx resource is described in:

Chen, Ying, et al. "A systematic benchmark of Nanopore long read RNA sequencing for transcript level analysis in human cell lines." bioRxiv (2021). doi: https://doi.org/10.1101/2021.04.21.440736

Please cite this pre-print when using these data, and add the following details: "The SG-NEx data was accessed on [DATE] at registry.opendata.aws/sg-nex-data".

Contact

Questions about SG-NEx? Please add an entry in the Discussions Forum. You can also contact Jonathan Göke

The Singapore Nanopore-Expression Project!