Python script designed to streamline bioinformatics analysis and facilitate data extraction from EzBioCloud
EzBioCloud is a bioscience's public data and analytics portal focusing on taxonomy, ecology, genomics, metagenomics, and microbiome of Bacteria and Archaea.
Unfortunatelly Ezbiocloud does not provide any API keys. Because of that, here I present a solution to automate processing of big scale microbiome analysis using original approach -automatic webdriver for Chrome Selenium
.
The programe download from Ezbiocloud crucial data in ordered way and extract some specific data eg. total valid reads, percentage valid reads, species, percentage etc.
Firstly interpreter ask User for:
- path where experiment folder might be created and experiment name,
- login and password to EZBioCloud,
- all samples IDs,
All samples' fastq files have to be already be uploaded to EZBioCloud
Webdriver enter EZBioCloud, login and search first given sample ID.
.xlsx files and .png charts for genus and species are downloaded and moved into a given folder.
Because of the fact that changing download folder location in Chrome using `` Webdriver is problematic - the files are first downloaded into Users Download folder by default and then renamed and moved into a given folder. You can change a path of a download folder location here:
source_folder
= r'C:\Users\Asus\Downloads'
Remember that download folder MUST be empty.
Sample file after this step:
Total valid reads and percentage valid reads values are taken and INFO.txt
file is created.
The main goal is to create a single details.xlsx
file based on files downloaded and EZBiocloud app for every sample. The excel sheet provide all microbiome genuses types sorted by percetage and create separated Details
column for species detected in a sample for each genus.
The threshold is set on 1% and only genus types and species with percentage more than 1% are processed and then shown in final excel sheet. #BEFORE
- Genus file example:
...
- Species file example:
... #AFTER
- Output
details.xlsx
file example(final excel):
During alignment, EZBioCloud sometimes assign reads to a taxonomic group instead of specific species. A taxonomic group is defined as a group of taxa (species/subspecies) that cannot be differentiated solely by 16S rRNA sequences. A typical example is the case of Escherichia coli and Shigella spp., which show almost identical 16S rRNA sequences. It is safer to identify such 16S rRNA sequences as a member of a species group that contains very similar 16S rRNA sequences, rather than to potentially wrongly assign them as E. coli. For example:
In this situation, contig data is used (contig is a set of identical and sometimes overlapping sequences that together represent a consensus region of DNA) in order to show the most likely species. Webdriver make a set of activities:
- Find taxonomic group in EZBiocloud
Taxonomic hierarchy
:
- Take first contig top hit
- Compare similarity percentage of all 5 Hit Species Name:
In above example first four species names will be taken, written in organized way together with taxonomic group percentage and added to detail.xlsx
file:
Rules of extracting Hit Species Name:
- Take all Hit Species Name with 100%
Similarity
- If there is no such Hit Species Name with 100%
Similarity
, then take Hit Species Name withSimilarity
above 99%