Skip to content

High-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers

License

Notifications You must be signed in to change notification settings

liaoherui/StrainScan

Repository files navigation

install with bioconda

StrainScan

One efficient, accurate and high-resolution strain-level microbiome composition analysis tool based on reference genomes and k-mers. StrainScan takes reference database and sequencing data as input, outputs strain-level microbiome compistion analysis report.

Contributor: Liao Herui and Ji Yongxin (Ph.D of City University of Hong Kong, EE), Nick Youngblut

Version: V1.0.14 (update at 2023-10-13)

Click here to check the log of all updates

[Update - 2022 - 05 - 01] :

  • V1.0.3: StrainScan can be installed via bioconda now!

[Update - 2022 - 06 - 07] :

  • V1.0.10: Add multuple threads to the reference database constrcution!

[Update - 2023 - 02 - 07] :

  • Two new intra-cluster searching modes are updated: plasmid_mode and extraRegion_mode.

[Update - 2023 - 04 - 22] :

  • StrainScan is able to take gzipped and PE FASTQs as input now!

[Update - 2023 - 05 - 04] :

  • StrainScan is able to take the custom clustering file generated by other clustering methods (e.g. PopPunk)! Additionally, we have made updates to the script (StrainScan_subsample.py) which now enables users to subsample their strains using hierarchical clustering.

[Update - 2023 - 05 - 15] :

  • Add a parameter "-b" that enables StrainScan to output the probability of detecting a strain in samples with low sequencing depth (e.g. <1X).

[Update - 2023 - 09 - 23] :

  • One database containing 1627 Staphylococcus aureus strains is publicly available!

[Update - 2023 - 09 - 29] :

  • V1.0.13: Update Bioconda version to latest GitHub version, which has more and new functions!!

[Update - 2023 - 10 - 03] :

  • One database containing 1124 Lactobacillus crispatus strains is publicly available!

[Update - 2023 - 10 - 13] :

  • V1.0.14: Fix a bug in identify.py about the identification of low-depth strains!

Overview of StrainScan:

strainscan_overview_new

Dependencies:

  • Python ==3.7.x
  • R
  • Sibeliaz ==1.2.2 (https://github.com/medvedevgroup/SibeliaZ)
  • Required python package: numpy==1.17.3, pandas==1.0.1, biopython==1.74, scipy==1.3.1, scikit-learn==0.23.1, bidict==0.21.3, treelib==1.6.1

Make sure these programs have been installed before using StrainScan.

Install (Linux or ubuntu only)

Option 1 - The first way to install StrainScan, is to use bioconda. Once you have bioconda environment installed, install package strainscan:

conda install -c bioconda strainscan

It should be noted that some commands have been replaced if you install StrainScan using bioconda. (See below)

Command (Not bioconda) Command (bioconda)
python StrainScan.py -h strainscan -h
python StrainScan_build.py -h strainscan_build -h

Option 2 - Also, yon can install StrainScan via Anaconda using the commands below:

git clone https://github.com/liaoherui/StrainScan.git
cd StrainScan

conda env create -f environment_candidate.yaml
conda activate strainscan

chmod 755 library/jellyfish-linux
chmod 755 library/dashing_s128

Option 3 - Or, you can install all dependencies of StrainScan mannually and then run the commands below.

git clone https://github.com/liaoherui/StrainScan.git
cd StrainScan

chmod 755 library/jellyfish-linux
chmod 755 library/dashing_s128

Pre-built databases download

The table below offers information about the pre-built databases of 6 bacterial species used in the paper and 2 additional bacterial species. Users can download these databases and use them to identify strains directly.

Species Source Number of Strains Number of Clusters Download link
Akkermansia muciniphila NCBI 157 53 Google drive
Cutibacterium acnes NCBI 275 28 Google drive
Prevotella copri NCBI 112 51 Google drive
Escherichia coli NCBI 1433 823 Google drive
Mycobacterium tuberculosis NCBI 792 25 Google drive
Staphylococcus epidermidis NCBI 995 378 Google drive
Staphylococcus aureus NCBI 1627 202 Google drive
Lactobacillus crispatus NCBI 1124 311 Google drive

You can also use bash scripts in the folder "Download_DB_script" to download the pre-built databases from Google drive. For example,

cd Download_DB_script
sh ecoli_db.sh

If you can not download databases from the google drive, you may try to download databases from the Baidu Netdisk.
Baidu Netdisk link: https://pan.baidu.com/s/1YFtHv2weJEBdwTz4LmTKOQ
Extraction code: ASDF

Usage

One example about database construction and identification commands can be found in "test_run.sh".

Use StrainScan to build your own custom database.

python StrainScan_build.py -i <Input_Genomes> -o <Database_Dir>

eg: python StrainScan_build.py -i Test_genomes -o DB_Small

(Note: input fasta can be gzipped format)

Use StrainScan to build your own custom database with custom clustering file.

python StrainScan_build.py -i <Input_Genomes> -c <Cluster_file> -o <Database_Dir>

The data format of the input clustering file can be found in the demo file Custom_cluster_demo/custom_cls.txt, where the first column is the cluster ID, the second column is the cluster size, and the last column is the prefix of the reference genomes in the cluster.

Use StrainScan_subsample to subsample your large-scale strains with high redundancy.

python StrainScan_subsample.py -i <Input_Genomes> -o <Output_Dir>

If you have large-scale strains with high redundancy and you want to remove the redundancy. Then you can use this script, which utilizes dashing and hierarchical clustering to subsample strains efficiently. The subsampled strains can be found in <Output_Dir>/Rep_ref and clustering information can be found in <Output_Dir>/Cls_res. More parameters can be found using python StrainScan_subsample.py -h.


Use StrainScan to identify bacterial strains in short reads.

python StrainScan.py -i <Input_reads> -d <Database_Dir> -o <Output_Dir>

eg: python StrainScan.py -i Sim_Data/GCF_003812785.fq -d DB_Small -o Test_Sim/GCF_003812785
or python StrainScan.py -i Sim_Data_mul/GCA_000144385_5X_GCF_008868325_5X.fq -d DB_Small -o Test_Sim/GCA_000144385_5X_GCF_008868325_5X

PE reads (can be gzipped FASTQ format)
python StrainScan.py -i GCF_003812785_1.fq.gz -j GCF_003812785_2.fq.gz -d DB_Small -o Test_Sim/GCF_003812785

Note: if the sequencing depth of targeted strains is super low (e.g., <1X), then you may get "Warning: No clusters can be detected!". In this case, you can use parameter "-b" to output the probability of detecting a strain (cluster) in low-depth samples. The higher the probability, the more likely the strain (cluster) is to be present.
eg: python StrainScan.py -i <Input_reads> -d <Database_Dir> -b 1 -o <Output_Dir>

Use StrainScan to identify plasmids of bacterial strains in short reads.

option-1: identify possible plasmids by using contigs <100000 bp:
python StrainScan.py -i <Input_reads> -d <Database_Dir> -p 1 -r <Ref_genome_Dir> -o <Output_Dir>

option-2: identify possible plasmids (or possible strains) using reference genomes provided by "-r".
python StrainScan.py -i <Input_reads> -d <Database_Dir> -p 2 -r <Ref_genome_Dir> -o <Output_Dir>

<Ref_genome_Dir> refer to the dir of reference genomes of identified clusters or all strains used to build the database.

Use StrainScan to identify bacterial strains in short reads under extraRegion_mode.

This mode will search possible strains and return strains with extra regions (could be different genes, SNVs or SVs to the possible strains) covered. If there is a novel strain not in the database, then its closest relative can be one specific strain while its partial regions (we call them "extraRegion" ) in the genome can be similar to other strains. In this case, this mode can search its closest relative and return strains with "extraRegion" covered for downstream analysis.

python StrainScan.py -i <Input_reads> -d <Database_Dir> -e 1 -o <Output_Dir>

Full command-line options

Identification - StrainScan.py (Default k-mer size: 31)

StrainScan - A kmer-based strain-level identification tool.

Example: python StrainScan.py -i  <Input_reads> -d <Database_Dir> -o <Output_Dir>

required arguments:
    -i, --input_fastq             Input fastq data.
    -j, --input_fastq_2		  Input fastq data (for pair-end data).
    -d, --database_dir            Path of StrainScan database.

optional arguments:
    -h, --help                    Show help message and exit.
    -o, --output_dir              The output directory. (Default: ./StrainScan_Result)
    -k, --kmer_size               The size of k-mer, should be odd number. (Default: k=31)
    -l, --low_dep                 This parameter can be set to "1" if the sequencing depth of input data is very low (e.g. < 5x). For super low depth ( < 1x ), you can use "-l 2" (default: -l 0)
    -b, --strain_prob		  If this parameter is set to 1, then the algorithm will output the probabolity of detecting a strain (or cluster) in low-depth (e.g. <1x) samples.  (default: -b 0)
    -p,	--plasmid_mode		  If this parameter is set to 1, the intra-cluster searching process will search possible plasmids using short contigs (<100000 bp) in strain genomes, which are likely to be plasmids. 
                                  If this parameter is set to 2, the intra-cluster searching process will search possible plasmids or strains using given reference genomes by "-r".
    				  Reference genome sequences (-r) are required if this mode is used. (default: -p 0)
    -r, --ref_genome		  The dir of reference genomes of identified cluster or all strains. If plasmid_mode is used, then this parameter is required.
    -e, --extraRegion_mode	  If this parameter is set to 1, the intra-cluster searching process will search possible strains and return strains with extra regions (could be different genes, SNVs or SVs to the possible strains) covered. (default: -e 0)
    -s, --minimum_snv_num         The minimum number of SNVs during the iterative matrix multiplication at Layer-2 identification. (Default: s=40)

Build database - StrainScan_build.py (Default k-mer size: 31)

StrainScan - A kmer-based strain-level identification tool.

Example:  python StrainScan_build.py -i <Input_genomes> -o <Database_Dir>

required arguments:
     -i, --input_fasta             The path of input genomes. ("fasta" format)
     
optional arguments:
     -o, --output_dir              The output directory of constructed database. (Default: ./StrainScan_DB)
     -c, --cls_file		   The custom clustering file provided by user.
     -k, --kmer_size               The size of k-mer, should be odd number. (Default: k=31)
     -t, --threads		   The threads used to build the database. (default: t=1)
     -u, --uk_num                  The maximum number of unique k-mers in each genome to extract. (Default: 100000)
     -g, --gk_ratio                The ratio of group-specific k-mers to extract. (Default: g=1.0)        
     -m, --strainest_sample        If this parameter is 1, then the program will search joint kmer sets from msa generated by Strainest. To use this parameter, you have to make sure Strainest can run normally. (Default: 0)
     -e, --memory_efficient	   If this parameter is 1, then the program will build the database with the memory efficient mode.
     -n, --mink_cutoff             Minimum k-mer number cutoff in a node of the cluster search tree (CST). (Default: n=1000)
     -x, --maxk_cutoff             Maximum k-mer number cutoff in a node of the cluster search tree (CST). (Default: x=30000)
     -r, --maxn_cutoff             Maximum cluster number for node reconstruction of the cluster search tree(CST). (Default: r=3000)

Output Format

The output of StrainScan contains two parts. The first part is the final identification report file in text format. This file contains all identified strains and their predicted depth and relative abundance, etc. The second part is the strain identification report files inside each cluster.

For your reference, two output files are given as example in the folder "Output_Example" in this repository. These files contain identification results of one single-strain and one two-strain (depth: 5X and 5X) simulated datasets, respectively.

Explaination about the columns in the final identification report file (E.g. "Output_Example/GCA_000144385_5X_GCF_008868325_5X/final_report.txt") of StrainScan.

Column_name Description
Strain_ID The numerical id of identified strains in the ascending order.
Strain_Name The name of identified strains. (In the example output, the name refers to the NCBI RefSeq accession id)
Cluster_ID The cluster id of identified strains. (For cluster information, users can check "<Database_Dir>/Cluster_Result/hclsMap_95_recls.txt")
Relative_Abundance_Inside_Cluster The predicted relative abundance of identified strains inside the cluster.
Predicted_Depth (Enet) The predicted sequencing depth of identified strains inside the cluster using elastic net model.
Predicted_Depth (Ab*cls_depth) The final predicted sequencing depth of identified strains.
Coverage The estimated k-mer-based coverage of identified strains.
Coverd/Total_kmr The number of "covered" and "total" k-mers of identified strains.
Valid_kmr The valid k-mer refers to the k-mer belonging to the identified strain during the iterative matrix multiplication. More valid k-mers there are, more likely this strain exist.
Remain_Coverage The coverage calculated by "covered" / "total" k-mers during the iterative matrix multiplication.
CV The number of "covered" and "valid" k-mers of identified strains.
Exist evidence By default, identified strains with "relative abundance > 0.02 and coverage >0.7" will be marked as "*". Strains with "*" are more likely to exist. However, for low-depth samples, this parameter may be not useful.

References:

how to cite this tool:

Liao, H., Ji, Y. & Sun, Y. High-resolution strain-level microbiome composition analysis from short reads. Microbiome 11, 183 (2023). https://doi.org/10.1186/s40168-023-01615-w