COSMO

Overview

COSMO: COrrection of Sample Mislabeling by Omics

Multi-omics Enabled Sample Mislabeling Correction

Installation

Download COSMO:

git clone https://github.com/bzhanglab/cosmo

Install Docker (>=19.03).
Install Nextflow. More information can be found in the Nextflow get started page.

All other tools used by COSMO have been dockerized and will be automatically installed when COSMO is run in the first time on a computer.

Usage

○ → nextflow run bzhanglab/COSMO --help
N E X T F L O W  ~  version 21.02.0-edge
Launching `cosmo.nf` [high_noyce] - revision: 46dc5c6c96
=========================================
COSMO => COrrection of Sample Mislabeling by Omics
=========================================
Usage:
nextflow run bzhanglab/COSMO
 Arguments:
   --d1_file               Dataset with quantification data at gene level.
   --d2_file               Dataset with quantification data at gene level.
   --cli_file              Sample annotation data.
   --cli_attribute         Sample attribute(s) for prediction. Multiple attributes.
                           must be separated by ",".
   --outdir                Output folder.
   --help                  Print help message.

Input

The formats for both datasets (--d1_file, --d2_file) are the same. An example input of quantification dataset (--d1_file or --d2_file) is shown below. The first column is the gene ID and all the other columns are the expression of proteins at gene level in different samples.

	Testing_1	Testing_2	Testing_3	Testing_4	Testing_5	Testing_6	Testing_7	Testing_8	Testing_9	Testing_10
A1BG	1.5963	2.8484	2.1092	2.7922	2.4444	3.9907	3.6792	3.7321	3.6123	3.1739
A2M	5.9429	5.0089	6.0823	6.0093	6.4553	6.0097	6.014	6.9721	4.4766	6.481
AAAS	1.9337	2.951	3.5984	2.0419	2.1217	0.9662	1.0086	NA	2.4936	2.2399
AACS	1.7549	NA	2.3948	NA	0.9946	2.5969	NA	NA	1.6488	NA
AAGAB	NA	NA	0.9982	NA	1.0282	1.6296	NA	NA	1.8141	NA
AAK1	1.0459	2.5435	1.7449	NA	1.0653	0.9855	2.0395	1.1588	NA	NA

The input for parameter --cli_file is the sample annotation file and an example is shown below:

sample	age	gender	stage	colon_rectum	msi	tumor_normal
Testing_1	47	Female	High	Colon	MSI-Low/MSS	Tumor
Testing_2	68	Female	High	Rectum	MSI-Low/MSS	Tumor
Testing_3	52	Male	Low	Colon	MSI-Low/MSS	Tumor
Testing_4	54	Female	Low	Colon	MSI-High	Tumor
Testing_5	72	Male	High	Colon	MSI-Low/MSS	Tumor
Testing_6	61	Male	High	Colon	MSI-Low/MSS	Tumor
Testing_7	58	Female	High	Colon	MSI-High	Tumor
Testing_8	73	Male	Low	Colon	MSI-Low/MSS	Tumor
Testing_9	68	Male	Low	Colon	MSI-Low/MSS	Tumor

Below is an example run COSMO:

nextflow run bzhanglab/COSMO --d1_file example_data/test_pro.tsv \
    --d2_file example_data/test_rna.tsv \
    --cli_file example_data/test_cli.tsv \
    --cli_attribute "gender,msi" \
    --outdir ./results

The data to run the above example can be found in this folder: "example_data".

Output

Below are the folders and files generated by the COSMO.

data_use

This directory contains all the input data files

test_pro.tsv
test_rna.tsv
test_cli.tsv

results/method1

Output files from method_1

genes.tsv Chromosomes annotation of genes
cleaned_data1.tsv Preprocessed data from the first dataset, d1_file (Missing value imputed if any)
cleaned_data2.tsv Preprocessed data from the second dataset, d2_file (Missing value imputed if any)
sample_correlation.csv Pearson correlation between samples from the first dataset (Rows) and the second dataset (Columns)
sample_correlation.png Heatmap image file of 'sample_correlationc.csv'

pairwise_matching.tsv Matching generated by stable marriage correlation. Every row indicates one matching pair of samples from the first dataset (d1_label) to samples from the second dataset (d1_label). The column d1rank is the preferential rank of d1 sample matched to the d2 sample; d2rank is the preferential rank of d2 sample matched to the d1 sample.

d1	d1_label	d2	d1_label	d1rank	d2rank	distance	correlation
1	Testing_1	1	Testing_1	1	1	2	0.60889
2	Testing_2	2	Testing_2	1	1	2	0.59604
3	Testing_3	3	Testing_3	1	1	2	0.64042
4	Testing_4	4	Testing_4	1	1	2	0.76045
5	Testing_5	5	Testing_5	1	2	3	0.66900
6	Testing_6	7	Testing_7	1	1	2	0.77152
7	Testing_7	6	Testing_6	1	1	2	0.70996
8	Testing_8	8	Testing_8	1	1	2	0.69767
9	Testing_9	9	Testing_9	1	1	2	0.75336

clinical_attributes_pred.tsv Classification results of every samples for both datasets, using method of winning team 1. Column gender_prob is the annotated binary label, d1gender_prob is the predicted probability of sample from the first dataset; while d2gender_prob is of sample from the second dataset. More columns will be generated if there are more clinical attributes.

sample	gender	gender_prob	d1gender	d1gender_prob	d2gender	d2gender_prob	pred_gender
Testing_1	Female	0	Female	0.01724	Female	0.32446	0.17085
Testing_2	Female	0	Female	0.00930	Female	0.17867	0.09398
Testing_3	Male	1	Male	0.97656	Male	0.78810	0.88233
Testing_4	Female	0	Female	0.00489	Female	0.25205	0.12847
Testing_5	Male	1	Male	0.99710	Male	0.58199	0.78955
Testing_6	Male	1	Male	0.99831	Female	0.41568	0.70699
Testing_7	Female	0	Female	0.02782	Male	0.57772	0.30277
Testing_8	Male	1	Male	0.99377	Male	0.76312	0.87844
Testing_9	Male	1	Male	0.99856	Male	0.69589	0.84722

errors.tsv count of different types of mislabeling errors
final.tsv table of corrected labels. Any inconsistency of id in the same row indicates the presence of mislabeling error. The table is generated using only classification results of method_1. The interpretation is the same as 'cosmo_final_results.tsv' in the 'final_res_folder'.

results/method2

test_ModelA_results.csv classification results of every samples of the first dataset, using method of winning team 2.
test_ModelB_results.csv classification results of every samples of the second dataset, using method of winning team 2.

results/final

cosmo_final_result.tsv table of corrected labels. The table is generated using integrated classification results of both method_1 and method_2. Each sample is assigned a number as unique id. A row with the consistent id across all the columns, indicates all the data belongs to the same patient and there is no mislabeling.

sample Clinical Data1 Data2

Testing_1 1 1 1

Testing_2 2 2 2

Testing_3 3 3 3

Testing_4 4 4 4

Testing_5 5 5 5

A row with different id indicates mislabeling error. Below is the example of swapping error in which samples Testing_6 and Testing_7 get swapped in Data2.

sample Clinical Data1 Data2

Testing_6 6 6 7

Testing_7 7 7 6

The same id occurred twice indicates duplication error. Below is an example of a duplicate sample Testing_8 in Data1.

sample Clinical Data1 Data2

Testing_8 8 8 8

Testing_9 9 8 9

Shifting error are represented with a continuous switching of id. Table below shows an example of a shifting error, where Samples Testing_10, Testing_11 and Testing_12 get shifted consecutively in Data2.

sample Clinical Data1 Data2

Testing_10 10 10 10

Testing_11 11 11 10

Testing_12 12 12 11

Testing_13 13 13 12

Testing_14 14 14 14

Datasets

The datasets used in the publication of COSMO are available at cosmo_datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
assets		assets
bin		bin
conf		conf
example_data		example_data
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COSMO

Overview

Installation

Usage

Input

Output

Datasets

About

Releases

Packages

Contributors 5

Languages

sample	Clinical	Data1	Data2
Testing_10	10	10	10
Testing_11	11	11	10
Testing_12	12	12	11
Testing_13	13	13	12
Testing_14	14	14	14

bzhanglab/COSMO

Folders and files

Latest commit

History

Repository files navigation

COSMO

Overview

Installation

Usage

Input

Output

Datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages