We performed deep metagenomic sequencing of seven sediment samples collected adjacent to ferromanganese nodules from the Clarion–Clipperton Fracture Zone (CCFZ) in the eastern Pacific Ocean. Here we provided all the steps of the analysis of this study, including the code, generated tables and figures, etc. The following is a description of all the files in the order of the analysis process.
- URL where sequencing data can be downloaded.
- Bash script for metagenomic assembly.
- Directory containing the assembled results for each sample. (The files are not uploaded due to their size, but you can run the above bash script to get them.)
- Bash script to evaluate assembly quality.
- Directory containing the results of the assembly quality for each sample.
- Merged file of the results of the assembly quality.
- Bash script to filter contigs with length less than 1000kb in each assembly.
- Directory containing filtered assemblies. (The files are not uploaded due to their size, but you can run the above bash script to get them.)
- Bash script for predicting ORFs on filtered contigs.
- Directory containing predicted results for ORFs. (The files are not uploaded due to their size, but you can run the above bash script to get them.)
- The HMM for the alignment of 2249 rpS3 marker sequences.
- Bash script for identifying rpS3 genes.
- Python script for extracting rpS3 protein sequences.
- Directory containing the results of the identification of the rpS3 gene.
- Merged fasta file of all rpS3 protein sequences from 7 samples.
- Bash script for clustering all rpS3 protein sequences.
- Directory containing the results of rpS3 clustering. (These files are not uploaded because there are too many of them, but you can run the above bash script to get them.)
- Representative sequences for each rpS3 protein cluster.
-
- Merged fasta file containing our 2267 rpS3 representative sequences and the 2249 rpS3 reference sequences.
- Bash script to build a phylogenetic tree.
- The tree in newick format.
- Taxonomic classification of rpS3 SGs at the phylum level.
-
- python script to extract the longest contig in each rpS3 protein cluster.
- Names of all contigs encoding rpS3 proteins.
- Information about the longest contig in each rpS3 protein cluster.
- Fasta file of all longest contigs.
- Directory containing indexed files for 'long_contig.fasta'.
-
- Bash script for mapping reads to the longest contigs.
- Bash script to calculate the coverage of each contig.
- Directory containing the results of coverage calculating.
- Merged file of the coverage of each contig in each sample.
- Python script for matching contig's coverage to rpS3's coverage.
- The coverage of each rpS3 protein cluster in each sample.
- The result for calculating the relative abundance of each rpS3 protein cluster based on coverage data using EXCEL.
-
- R script for visualization of Fig. S2a.
- Input file for the R script above.
- PDF file of Fig. S2a.
- R script for visualization of Fig. S2b.
- Input file for the R script above.
- PDF file of Fig. S2b.
- The combined figure generated using Adobe Illustrator.
- The TIFF format of Fig. S2.
-
- Directory containing indexed files for metagenomic assemblies. (The files are not uploaded due to their size, but you can run the bash script in the directory to get them.)
- Bash script to generate sequences map files.
- Directory containing sequences map files. (The files are not uploaded due to their size, but you can run the above bash script to get them.)
- Bash script for metagenome binning.
- Directory containing the results of coverage profiles. (The files are not uploaded due to their size, but you can run the above bash script to get them.)
- Directory containing MAGs for each sample. (The files are not uploaded due to their size, but you can run the above bash script to get them.)
- Bash script for refining MAGs.
- Directory containing the results of refining MAGs. (The files are not uploaded due to their size, but you can run the above bash script to get them.)
- Directory containing all refined MAGs. (The files are not uploaded due to their size, but you can put all refined MAGs in one directory based on the above results.)
- Bash script to deduplicate all refined MAGs.
-
- Directory containing 179 high quality nonredundant MAGs. (The files are not uploaded due to their size, but you can run the above bash script to get them.)
- Bash script to change the names of 179 MAGs.
- Directory containing 179 MAGs after changing the names. (The files are not uploaded due to their size, but you can run the above bash script to get them.)
- Python script to change the contig names of MAGs.
- Directory containing 179 MAGs after changing the contig names.
- Bash script for taxonomic classification of 179 MAGs.
- Directory containing results of classification.
-
- Information of rpS3 protein clusters.
- Taxonomic classification of MAGs at the phylum level.
- Python script to check the GTDB-based classification with rpS3-based classification.
- The result of running the python script above.
- R script for visualization of Fig. S3.
- Input file for the R script above.
- PDF file of Fig. S3. (The image is retouched on the output of the R script using Adobe Illustrator.)
- The TIFF format of Fig. S3.
- Bash script to evaluate assembly quality of 179 MAGs.
- Directory containing the results of the assembly quality.
- Information of all nonredundant MAGs.
- Directory containing indexed files for merged 179 MAGs. (The files are not uploaded due to their size, but you can run the bash script in the directory to get them.)
- Bash script for mapping reads from each individual metagenome to the 179 MAGs.
- Directory containing the results of mapping. (The files are not uploaded due to their size, but you can run the above bash script to get them.)
- Bash script to calculate the relative abundance of each MAG in each sample.
- Directory containing the results of abundance calculating.
- Merged file of the relative abundance of each MAG in each sample.
-
- R script for visualization of the relative abundance of these MAGs.
- Input file for the R script above.
- The stacked bar chart for showing the relative abundance of these MAGs in seven samples.
- Sequence size of each sample.
- Visualization for sequence size of each sample using tableau.
- Numuber of MAGs for each sample.
- Visualization for numuber of MAGs for each sample using tableau.
- Bash script to constructe archaeal and bacterial phylogenetic tree.
- Archaeal tree in newick format.
- Bacterial tree in newick format.
- Visualization for archaeal tree using iTOL.
- Visualization for bacterial tree using iTOL.
- PDF file of Fig. 1. (The image is generated by combining the above pictures using Adobe Illustrator.)
- The TIFF format of Fig. 1.
-
- Bash script to predict genes of the 179 MAGs.
- Directory containing the results of genes prediction. (The files are not uploaded due to their size, but you can run the above bash script to get them.)
-
- Bash script to identify rRNAs of MAGs.
- Directory containing the results of rRNA identification.
- Bash script to count the number of rRNA per MAG.
- Table of the number of rRNA per MAG.
-
- Bash script to identify tRNAs of MAGs.
- Directory containing the results of tRNA identification.
- Bash script to count the number of tRNA per MAG.
- Table of the number of tRNA per MAG.
-
- Taxonomy at the phylum level of each MAG.
- The mean relative abundance of each MAG.
- Bash script to identify proteins for iron redox reactions.
- Annotated results after filtering the results of running the above code.
- Custom blastp databases for other metals redox reactions.
- Bash script to identify proteins for other metals redox reactions.
- Directory containing the results of running the bash script above.
- Python script for collating the above results.
- Statistical result for annotations on redox reactions of other metals. (Since some query proteins matched both metal redox proteins and metal transport proteins, we manually corrected the result of the above python script.)
- The Transporter Classification Database (TCDB).
- Bash script to annotate membrane metal transport proteins.
- Directory containing the results of running the bash script above.
- Python script for collating the above results.
- Statistical result for annotations on metal transport. (Since some query proteins matched both metal redox proteins and metal transport proteins, we manually corrected the result of the above python script.)
- Family of metal transporters.
-
- IDs of all proteins involved in metal transport and redox reactions.
- Python script to get the fasta files of each protein above.
- Bash script for re-annotating these proteins using blastp against the NR database. (The NR database can be downloaded from NCBI.)
- Directory containing the fasta files of each protein and the corresponding annotated results. (These files are not uploaded because there are too many of them, but you can run the above python and bash script to get them.)
- Python script for collating the annotated results of NR database.
- The result of the above python script.
- Checked annotations involved in metal transport and redox reactions.
- Correctly annotated gene IDs.
- Information of MAGs containing genes before annotation checking.
- Python script to get information of MAGs containing correctly annotated genes.
- Information of MAGs containing correctly annotated genes.
- Information of TCDB's annotation in the 179 MAGs.
- Information of metal redox annotation in the 179 MAGs.
- The list of proteins involved in iron oxidation and iron reduction.
-
- Converts to a annotated list of whether the MAGs has the functions.
- Visualization for composition of MAGs which contain the genes encoding the related proteins using tableau.
- PDF file output by Tableau.
- PDF file of Fig. 2. (These images were generated by combining the pictures output by tableau using Adobe Illustrator.)
- The TIFF format of Fig. 2.
- PDF file of Fig. S5. (These images were generated by combining the pictures output by tableau using Adobe Illustrator.)
- The TIFF format of Fig. S5.
-
- Bash script to identify carbohydrate degradation enzymes (CAZymes). (The 'dbCAN-HMMdb-V8.txt' can be downloaded from https://bcb.unl.edu/dbCAN2/download/.)
- Directory containing the results of running the bash script above.
- Python script for collating the above results.
- Statistical result for annotations of CAZymes.
- Based on the above statistical result, the classification and abundance of each MAG were added.
- Python script to convert the above result in matrix form to the result in list form.
- Statistical result for annotations of carbohydrate degradation enzymes in list form.
- Taxonomic classification of MAGs at the phylum level.
- Python script to summarize percentage of genomes within each phylum which contained CAZymes.
- The result of running the python script above.
-
- R script to convert the file 'statistic_phylum_CAZy.csv' to the result in list form.
- The result of running the R script above. (This file is used for tableau's input file.)
- Visualization for the above result using Tableau.
- PDF file output by Tableau.
- Number of CAZymes per MAG.
- R script for visualization of Fig. 3b.
- PDF file of Fig. 3b.
- PDF file of Fig. 3. (The image is generated by combining the above pictures using Adobe Illustrator.)
- The TIFF format of Fig. 3.
-
- The annotations for each functional protein encoded by MAGs are obtained from the KEGG Ortholog (KO) annotations, which are predicted using the eggNOG-mapper tool in the methodology.
- Python script to summarize number of genomes within each phylum which contained proteins involved in the metabolism of small carbon compounds.
- The list for the presence or absence of enzymes involved in the metabolism of small carbon compounds in the 179 MAGs (The input file of the above python script).
- The result of the above python script.
- Sorted result in excel.
- The color corresponding to each phylum in the visualization.
- Python script to get R scripts that visualize the number of MAGs containing each gene.
- Directory containing all R scripts, input and output results.
- PDF file of Fig. S6. (The images were generated by combining pictures from the above directory using Adobe Illustrator.)
- The TIFF format of Fig. S6.
- Python script to get table S11
- The matrix for the presence or absence of enzymes involved in the metabolism of small carbon compounds in the 179 MAGs.
-
- NCycDB and SCycDB. (The files are not uploaded due to their size, but you can download it from GitHub and run python and bash scripts in the directory to organize and merge them.)
- Bash script to annotate proteins for nitrogen and sulfur cycling.
- Directory containing the results of running the bash script above.
- Bash script for re-annotating these proteins using eggNOG-mapper v2.1.2.
- Directory containing the results of running the bash script above.
- Merged above results. (The files are not uploaded due to their size, but you can use the 'cat' command to merge above results.)
- Python script for collating the results annotated using NCycDB, SCycDB and eggNOG-mapper.
- The result of the above python script.
- Checked annotations involved in nitrogen and sulfur metabolism.
- Correctly annotations.
-
- Python script to change format of annotations file. We only count whether the MAGs contain the functional proteins, regardless of the copy number. We consider that the genome of MAG is incomplete due to the limitation of metagenomic technology. Thus for an operon, if a MAG is predicted to have one of these proteins, then the MAG is considered to contain that function. It is worth noting that in many cases MAGs contain complete operons.
- The result of the above python script.
- R script to convert the file 'ncyc_scyc_bin_table_all.txt' to the result in list form.
- The result of running the R script above.
- Python script to summarize number of genomes within each phylum which contained nitrogen and sulfur metabolic proteins.
- The result of running the python script above.
- Sorted result in excel.
- The color corresponding to each phylum in the visualization.
- Python script to get R scripts that visualize the number of MAGs containing each gene.
- Directory containing all R scripts, input and output results.
- PDF file of Fig. 4. (The images were generated by combining pictures from the above directory using Adobe Illustrator.)
- The TIFF format of Fig. 4.
-
- Sequence of MnxG and MoxA proteins.
- The result of conserved domains of MnxG and MoxA proteins detected using NCBI's CD-search tool.
- Visualization for conserved domains using IBS.
- PDF file output by IBS and retouched using Adobe Illustrator.
- The TIFF format of Fig. S4.
-
- IDs of all proteins annotated in our study.
- Python script to get a dataset, which contains 179 directories corresponding to 179 MAGs, each directory contains a FASTA file of genome sequences (_genome.fna), a FASTA file of predicted protein sequences (_proteins.faa), a tab-delimited file of protein functional annotations (*_proteins_annotation.txt). (These files were uploaded to https://doi.org/10.5281/zenodo.7699702)
- IDs of 18 high relative abundance MAGs.
- Python script to get metal transport and redox profile of these 18 MAGs.
- The result of running the python script above.
- Python script to get nitrogen and sulfur metabolic profile of these 18 MAGs.
- The result of running the python script above.
- Python script to get carbohydrate degradation profile of these 18 MAGs.
- The result of running the python script above.
-
- The mean relative abundance and taxonomy of each MAG.
- R script for visualization of Fig. 1c.
- PDF file of Fig. 1c.
- Information of 18 high relative abundance MAGs.
-
- All functional profile of these 18 MAGs (manually checked).
- Visualization for overview of the metabolic functions of these MAGs using cytoscape.
- PDF file output by cytoscape.
- Directory containing the histograms that show the total number of detected and undetected genes for each functional category in 18 MAGs.
- PDF file of Fig. 5. (The images were generated by combining above pictures using Adobe Illustrator.)
- The TIFF format of Fig. 5.
-
- IDs of the remaining MAGs other than the dominant MAGs.
- Organized annotations involved in metal transport.
- Organized annotations involved in metal redox.
- Organized annotations involved in nitrogen metabolism.
- Organized annotations involved in sulfur metabolism.
- Organized annotations involved in carbohydrate degradation.
- Python script to randomly selected 18 MAGs in 100 replicates among the remaining MAGs and calculated the number of genes for each functional category.
- The result of running the python script above.
- R script for visualization of the number of genes for each functional category among the remianing MAGs.
- The result of running the R script above.
- The table comparing the number of genes in each functional profile of dominant MAGs and remaining MAGs.
- PDF file of Fig. S7. (The images were generated by combining above picture and table using Adobe Illustrator.)
- The TIFF format of Fig. S7.
- All microbial-dominated ecological functions.
- Python script to calculate the width of lines representing the total relative abundance of key and other MAGs performing each function.
- The result of running the python script above.
- PDF file of Fig. 6. (The elements of the image were drawn using Adobe Illustrator based on the above data.)
- The TIFF format of Fig. 6.