Skip to content

COSMO: COrrection of Sample Mislabeling by Omics

Notifications You must be signed in to change notification settings

bzhanglab/COSMO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COSMO

Overview

COSMO: COrrection of Sample Mislabeling by Omics

Multi-omics Enabled Sample Mislabeling Correction

Installation

  1. Download COSMO:
git clone https://github.com/bzhanglab/cosmo
  1. Install Docker (>=19.03).

  2. Install Nextflow. More information can be found in the Nextflow get started page.

All other tools used by COSMO have been dockerized and will be automatically installed when COSMO is run in the first time on a computer.

Usage

○ → nextflow run bzhanglab/COSMO --help
N E X T F L O W  ~  version 21.02.0-edge
Launching `cosmo.nf` [high_noyce] - revision: 46dc5c6c96
=========================================
COSMO => COrrection of Sample Mislabeling by Omics
=========================================
Usage:
nextflow run bzhanglab/COSMO
 Arguments:
   --d1_file               Dataset with quantification data at gene level.
   --d2_file               Dataset with quantification data at gene level.
   --cli_file              Sample annotation data.
   --cli_attribute         Sample attribute(s) for prediction. Multiple attributes.
                           must be separated by ",".
   --outdir                Output folder.
   --help                  Print help message.

Input

The formats for both datasets (--d1_file, --d2_file) are the same. An example input of quantification dataset (--d1_file or --d2_file) is shown below. The first column is the gene ID and all the other columns are the expression of proteins at gene level in different samples.

Testing_1 Testing_2 Testing_3 Testing_4 Testing_5 Testing_6 Testing_7 Testing_8 Testing_9 Testing_10
A1BG 1.5963 2.8484 2.1092 2.7922 2.4444 3.9907 3.6792 3.7321 3.6123 3.1739
A2M 5.9429 5.0089 6.0823 6.0093 6.4553 6.0097 6.014 6.9721 4.4766 6.481
AAAS 1.9337 2.951 3.5984 2.0419 2.1217 0.9662 1.0086 NA 2.4936 2.2399
AACS 1.7549 NA 2.3948 NA 0.9946 2.5969 NA NA 1.6488 NA
AAGAB NA NA 0.9982 NA 1.0282 1.6296 NA NA 1.8141 NA
AAK1 1.0459 2.5435 1.7449 NA 1.0653 0.9855 2.0395 1.1588 NA NA

The input for parameter --cli_file is the sample annotation file and an example is shown below:

sample age gender stage colon_rectum msi tumor_normal
Testing_1 47 Female High Colon MSI-Low/MSS Tumor
Testing_2 68 Female High Rectum MSI-Low/MSS Tumor
Testing_3 52 Male Low Colon MSI-Low/MSS Tumor
Testing_4 54 Female Low Colon MSI-High Tumor
Testing_5 72 Male High Colon MSI-Low/MSS Tumor
Testing_6 61 Male High Colon MSI-Low/MSS Tumor
Testing_7 58 Female High Colon MSI-High Tumor
Testing_8 73 Male Low Colon MSI-Low/MSS Tumor
Testing_9 68 Male Low Colon MSI-Low/MSS Tumor

Below is an example run COSMO:

nextflow run bzhanglab/COSMO --d1_file example_data/test_pro.tsv \
    --d2_file example_data/test_rna.tsv \
    --cli_file example_data/test_cli.tsv \
    --cli_attribute "gender,msi" \
    --outdir ./results

The data to run the above example can be found in this folder: "example_data".

Output

Below are the folders and files generated by the COSMO.

data_use This directory contains all the input data files
  • test_pro.tsv
  • test_rna.tsv
  • test_cli.tsv
results/method1 Output files from method_1
  • genes.tsv Chromosomes annotation of genes

  • cleaned_data1.tsv Preprocessed data from the first dataset, d1_file (Missing value imputed if any)

  • cleaned_data2.tsv Preprocessed data from the second dataset, d2_file (Missing value imputed if any)

  • sample_correlation.csv Pearson correlation between samples from the first dataset (Rows) and the second dataset (Columns)

  • sample_correlation.png Heatmap image file of 'sample_correlationc.csv'

  • pairwise_matching.tsv Matching generated by stable marriage correlation. Every row indicates one matching pair of samples from the first dataset (d1_label) to samples from the second dataset (d1_label). The column d1rank is the preferential rank of d1 sample matched to the d2 sample; d2rank is the preferential rank of d2 sample matched to the d1 sample.

    d1 d1_label d2 d1_label d1rank d2rank distance correlation
    1 Testing_1 1 Testing_1 1 1 2 0.60889
    2 Testing_2 2 Testing_2 1 1 2 0.59604
    3 Testing_3 3 Testing_3 1 1 2 0.64042
    4 Testing_4 4 Testing_4 1 1 2 0.76045
    5 Testing_5 5 Testing_5 1 2 3 0.66900
    6 Testing_6 7 Testing_7 1 1 2 0.77152
    7 Testing_7 6 Testing_6 1 1 2 0.70996
    8 Testing_8 8 Testing_8 1 1 2 0.69767
    9 Testing_9 9 Testing_9 1 1 2 0.75336
  • clinical_attributes_pred.tsv Classification results of every samples for both datasets, using method of winning team 1. Column gender_prob is the annotated binary label, d1gender_prob is the predicted probability of sample from the first dataset; while d2gender_prob is of sample from the second dataset. More columns will be generated if there are more clinical attributes.

    sample gender gender_prob d1gender d1gender_prob d2gender d2gender_prob pred_gender
    Testing_1 Female 0 Female 0.01724 Female 0.32446 0.17085
    Testing_2 Female 0 Female 0.00930 Female 0.17867 0.09398
    Testing_3 Male 1 Male 0.97656 Male 0.78810 0.88233
    Testing_4 Female 0 Female 0.00489 Female 0.25205 0.12847
    Testing_5 Male 1 Male 0.99710 Male 0.58199 0.78955
    Testing_6 Male 1 Male 0.99831 Female 0.41568 0.70699
    Testing_7 Female 0 Female 0.02782 Male 0.57772 0.30277
    Testing_8 Male 1 Male 0.99377 Male 0.76312 0.87844
    Testing_9 Male 1 Male 0.99856 Male 0.69589 0.84722
  • errors.tsv count of different types of mislabeling errors

  • final.tsv table of corrected labels. Any inconsistency of id in the same row indicates the presence of mislabeling error. The table is generated using only classification results of method_1. The interpretation is the same as 'cosmo_final_results.tsv' in the 'final_res_folder'.

results/method2
  • test_ModelA_results.csv classification results of every samples of the first dataset, using method of winning team 2.

  • test_ModelB_results.csv classification results of every samples of the second dataset, using method of winning team 2.

results/final
  • cosmo_final_result.tsv table of corrected labels. The table is generated using integrated classification results of both method_1 and method_2. Each sample is assigned a number as unique id. A row with the consistent id across all the columns, indicates all the data belongs to the same patient and there is no mislabeling.

    sample Clinical Data1 Data2
    Testing_1 1 1 1
    Testing_2 2 2 2
    Testing_3 3 3 3
    Testing_4 4 4 4
    Testing_5 5 5 5

    A row with different id indicates mislabeling error. Below is the example of swapping error in which samples Testing_6 and Testing_7 get swapped in Data2.

    sample Clinical Data1 Data2
    Testing_6 6 6 7
    Testing_7 7 7 6

    The same id occurred twice indicates duplication error. Below is an example of a duplicate sample Testing_8 in Data1.

    sample Clinical Data1 Data2
    Testing_8 8 8 8
    Testing_9 9 8 9

    Shifting error are represented with a continuous switching of id. Table below shows an example of a shifting error, where Samples Testing_10, Testing_11 and Testing_12 get shifted consecutively in Data2.

    sample Clinical Data1 Data2
    Testing_10 10 10 10
    Testing_11 11 11 10
    Testing_12 12 12 11
    Testing_13 13 13 12
    Testing_14 14 14 14

Datasets

The datasets used in the publication of COSMO are available at cosmo_datasets.

About

COSMO: COrrection of Sample Mislabeling by Omics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published