-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Which types of mutation effects should be ignored? #2
Comments
Are we interested in preserving mutation type as a data field? If I recall correctly, we were talking about having mutation as a binary outcome variable. If this is still the case, I think there are several ways we could get there. The first would be to parse the above set of effects, potentially eliminating some. I think it would be fine to eliminate the silent mutations category, but am unsure about the others. In the UCSC Xena documentation, they've grouped the mutation effects into four color coded groups - it seems like this might be based on severity, although I am not familiar enough with the topic to be sure. The groups (from here) are: Red --> Nonsense_Mutation, frameshift_variant, stop_gained, splice_acceptor_variant, splice_acceptor_variant&intron_variant, splice_donor_variant, splice_donor_variant&intron_variant, Splice_Site, Frame_Shift_Del, Frame_Shift_Ins Blue --> splice_region_variant, splice_region_variant&intron_variant, missense, non_coding_exon_variant, missense_variant, Missense_Mutation, exon_variant, RNA, Indel, start_lost, start_gained, De_novo_Start_OutOfFrame, Translation_Start_Site, De_novo_Start_InFrame, stop_lost, Nonstop_Mutation, initiator_codon_variant, 5_prime_UTR_premature_start_codon_gain_variant, disruptive_inframe_deletion, inframe_deletion, inframe_insertion, In_Frame_Del, In_Frame_Ins Green --> synonymous_variant, 5_prime_UTR_variant, 3_prime_UTR_variant, 5'Flank, 3'Flank, 3'UTR, 5'UTR, Silent, stop_retained_variant Orange --> others, SV, upstreamgenevariant, downstream_gene_variant, intron_variant, intergenic_region A second option would be using the somatic mutation data that is already called at the gene level. Positive mutation calls reflect the effects: nonsense, missense, frame-shif indels, splice site mutations, stop codon readthroughs, change of start codon, and inframe indels. We could also implement this same calling procedure ourselves. |
Yes, I agree - I think we can toss Silent mutations. I also think that keeping it simple would be the way to go. There are other resources available that are cleaner/simpler than this data available from TCGA Firehose that may be worth exploring. |
@clairemcleod & @gwaygenomics : If you wanted to provide simple groups that would get people started, how would you combine them? We can always provide the option to drill down to a greater level of detail (e.g. any KRAS G12V mutation), but I agree with you both that a simple initial interface is optimal. The very granular items will only be useful for mutations that are particularly common. |
Changed `base_url` for downloading data from the Xena browser from https://genome-cancer.ucsc.edu/download/public/xena/TCGA/TCGA.PANCAN.sampleMap/ to https://tcga.xenahubs.net/download/TCGA.PANCAN.sampleMap/. This new location seems to have resolved the unstandardized mutation effects reported in cognoma#2. Added json metadata files to `download` providing version info at time of download. Thanks @jingchunzhu for the suggestion. See https://groups.google.com/forum/#!msg/ucsc-cancer-genomics-browser/eg6nJOFSefw/wO0wNrMeAgAJ
In dhimmel/cancer-data@0239cba, I changed the download location for UCSC Xena data (and added version tracking). This resolved the unstandardized mutation effect types. The updated version of the frequency table is below (color refers to the Xena characterizations mentioned above):
@clairemcleod, nice find with the |
Addresses cognoma#2 -- add additional mutation effects. Added all red & blue mutations from http://xena.ucsc.edu/how-we-characterize-mutations/ that were present in the data.
I went with a simple solution. In dhimmel/cancer-data@ffe66ab, I retained only red and blue mutations (according to Xena), meaning orange and green mutations were removed. The only removed mutation effect category that was an appreciable portion of the data was "Silent" -- which I think we're all in agreement should be excluded. I posted the mutation and expression datasets from this commit to figshare. Mutations were retained for 8,508 samples, 7,706 of which had corresponding expression data. |
Unstandardized mutation types were resolved. See cognoma#2 (comment) Addressed this reviewer comment: cognoma#7 (comment)
The
PANCAN_mutation
dataset (online doc) contains several types of mutations under theeffect
column. My processing of the dataset (notebook) yielded the following mutation effect and frequencies (as counts and percentages):It appears that certain effects are duplicates — such as
5_prime_UTR_variant
,5'UTR
,UTR_5_PRIME
— which if true represents a poor case of standardization. If we want to improve the standardization, we can create our own mapping, or we can report the issue to the upstream creators (although these fixes usually take a long time).Anyways, we'll have to decide which types of effects to consider as functionally relevant mutations. For example, a "Silent" mutation generally does not have a biological effect. We could also let users decide for themselves, but that adds complexity.
@clairemcleod, @mp8, @DCousminer, @gwaygenomics, @cgreene, @stephenshank — I thought you may have a better understanding than I do of the biology here. Can any of these categories be discarded as irrelevant to a tumor's function and classification? Are you interested in creating a consolidated set of effects with duplicates merged?
The text was updated successfully, but these errors were encountered: