-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write specification for representing translocations and fusions/junctions #28
Comments
C/P from #51, closed as duplicate: A goal for this issue is to write a document that includes fusion use cases and a proposed model. This should also serve as an introductory exercise for handling ambiguous representations of fusions (e.g. only one fusion partner specified or only gene names specified) alongside particular representations of fusions (defined transcript regions present / absent). |
NTRK Fusions curation elements example (initial draft, WIP, from Angshumoy Roy and Gordana Raca, ClinGen Somatic WG): https://drive.google.com/file/d/18EEeIadChFwh79vEBz2knphKsYsfpONu/view?usp=sharing |
From Subha (ClinGen Somatic WG), a paper with a few interesting data elements to capture on fusions: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6329466/ |
From Marilyn Li (of AMP/ASCO/CAP guidelines) on fusions in HGVS:
|
I've some time to review the proposal and I can see a number of issues with SV support, many of which are also problems for VCF. Here is my take on (DNA) SV support:
Fundamentally, there are three fundamental primitives require to fully support arbitrary genomic rearrangements:
Other primitives are possible but unnecessary. Whilst VCF does support promiscuous breakpoints, in practice, nobody uses these. If there is any demand for the representation of such ambiguities, additional annotation on a set of breakends are able to represent this (single breakends were a late addition to VCF).
One criticial limitation of VCF is the lack of definition of what a symbolic allele actually represents. If I have a VCF, I don't know whether the Unlike SNVs and small indels, where the primitive corresponds directly to the events, there is a 1 to many relationship between events and primitives for SVs. A simple deletion event requires both a breakpoint and a DNA loss/CNV to co-occur on the same chromatid. Chromothripsis results in many deletion-like events, as do retrocopied processed transcripts. The lack of seperation between the detection of the basic primitives and the biological interpretation of the events producing those primitives is a critial flaw in the current VCF specifications. I've been working a lot of this lately and, if you CN and SV calls are good enough, path traversal and event classification is can be done, even for events composed of 50+ breakpoints. I'll provide a link to our bioarvix preprint as soon as it goes up (~1-2 weeks). Even something as simple as a gene fusion can (spoiler: and frequently does) have multiple underlying breakpoints in the DNA.
This is extremely important in the cancer setting. Phasing events to paternal/maternal is of marginal utility to event determination and downstream interpretation when aneuploidy is ubitiquitous. To be useful in a cancer setting, we need to be able to specify whether or not two events are adjacent on the same chromatid. I'm not even sure what to call this. It's not quite phasing but, as far as I know, the community hasn't come up with a term for this. This capability important even for SNV and small indels as to determining whether a gene has any functional copies when aneuploidy is present requires finding a deleterious event on every copy of that gene. Fortunately, this is almost handled by the
A recommended SV/CNV normalisation scheme should be included. I'm a big fan of centre-aligning and calling the middle of any interval of ambiguity or micro-homology as it a) maximises the chance of matching SV and CNV calls actually matching coordinates, and b) depending on the orientation, left/right aligning a breakpoints will result in the other side being aligned to opposite way. Note that NCBI’s Variant Overprecision Correction Algorithm only works for isolated SNV and indels and fails to resolve ambiguity on more complex events. For example, a del-with-ins event with a 50bp interval replaced by 5bp of random sequence will be incorrectly represented as S/W alignment of such events almost universally results in them being represented as a set of SNVs and smaller indels and the overprecision correction algorithm does not enforce a valid haplotype interpretation. The SNVs can be adjusted into one of the deletion calls which, given they are phased, doesn't make sense. Similarly, tandem duplications can be represented as insertions. Benchmarking SV callers on the CHM1/CHM13 data set was problematic due to the different representation. For example, a 3xSINE sequencing being expanded into a 4xSINE sequence had the long read callers report an INS at the first SINE element, and the short read callers report a DUP of the final SINE element. They both result in the same sequence, but they're reported as different event types and their positions are nowhere near overlapping.
As witth aneuplody-aware phasing, sub-clonality complicates the model but is absolutely required for the specifications to be useful in a somatic setting. The proposed On the plus side, the decision to use an interbase coordinate system was asolutely the correct decision and it avoids a whole host of problems that are a pain to deal with in VCF. How is group membership of this specifications determined? I'm more happy to join and draft up the changes required to support SVs. |
@d-cameron Thanks for the interest and willingness to contribute! We'd love to have you join the calls (Mon 1600 UTC). I realize that this is a terrible time for you. We have currently have 10-20 people on the call reliably, most in the US and Europe. We can pick up the substance of your proposals at a later date (and after some of the current construction dust has settled). |
We recently discussed how structural variants are annotated on knowledgebases at one of our oncology annotation meetings. Here's what we concluded:
Thank you for all the work you do! |
From a conversation with CIViC team today on fusion representation, capturing a few points shared by @obigriffith:
However, it sounded like if CIViC needed to represent a specific exon-exon junction between two transcripts (versus one of a set of exon-exon junctions that are collectively the subject of an evidence statement), this would be captured in the variant name, e.g. |
I'm a bit confused as to what is actually being represented by a gene fusion variant. Are we referring to the transcript-level products, or the breakpoint(s) in the underlying DNA enabling one or more fusion transcripts? If we're talking about both (or have use cases for both), then we need to distinguish between these two claims as there is a many to one mapping between fusion transcripts, and underlying rearranged DNA.
Note how I did not say breakpoints. The majority of TMPRSS2-ERG fusions are chromothripticly derived gene fusions with potentially hundreds of breakpoints involved in the originating event - it's not a simple translocation/deletion dichotomy. Complicating things further, a non-trivial percentage (30%+ for TMPRSS2-ERG) of these involve more than one breakpoint in the fusion itself. That is, TMPRSS2-(elsewhere)-ERG. It's not just single DNA fragments either - I've seen multiple cases of driver gene fusions involving 3+ underlying breakpoints. |
@d-cameron I encourage you (and other interested participants on this thread) to join in on our VR call tomorrow, where we will be discussing the varying levels of specificity our model will need to capture, including your comments above. |
Things to be aware of when designing model:
|
Living slide deck with notes on curation and strawman model: http://bit.ly/VR-SVdeck |
Per @d-cameron statement above regarding complex rearrangements: If we want to further clarify chromothrypsis, I think there is potential under this model:
I'm open to criticism on this. This is something our group put together for potentially classifying chromothrypsis for mate pair sequencing and then abandoned since we are not planning on doing clinical exploratory whole genome analysis using that assay anytime soon. |
Hi Daniel- Can you please describe what these figures are showing? I
*think* I understand, but it seems to me that we should capture this
background info for a future a requirements doc (a Google doc).
Generally, VR does anticipate the two-level model that you describe. I've
historically (in VMC) typically described these as observed and
representative variation, where observed variation is precise (to the
limits of the assay), whereas representative variation generalizes
observations. We're also considering rule-based variation, which is a kind
of representative variation, so this language likely needs to be revisited
(or at least agreed upon).
I'm glad to have your experience guiding our SV modeling.
…-Reece
On Thu, Aug 15, 2019 at 7:07 AM Daniel Cameron ***@***.***> wrote:
Apologies that I was not able to make the call.
From my perspective, we need to two-level model (I'm working on one for
VCFv4.3) that separates the low-level claims made by the callers (ie
break-junctions and CN segments), from the higher level claims that
associate them into events.
Take the following relatively simple example of chromothripsis:
[image: image]
<https://user-images.githubusercontent.com/6036536/63098082-6706de80-bfb5-11e9-8913-00b66b02db31.png>
Here we can fully resolve the chromothriptic event. There's a whole lot of
regions of CN gain/loss and a whole lot of SVs (NB: all SVs are 'important'
when reconstructing derivate chromosomes), but it's really just single
event happening.
Another relatively simple event is the breakage-fusion-bridge cycle on the
COLO829T cell line that looks like the following:
[image: image]
<https://user-images.githubusercontent.com/6036536/63098342-0b892080-bfb6-11e9-8eb0-a76cd6af23a0.png>
It's just a few fold-back inversions amplying chr3 but the fold-back
breakpoints involve 6, 10 and 12 as well. Again, this is a relatively
simple event with just a handful of breakpoints and CN segments involved. I
have samples where we have chromothripsis with 500+ breakpoints and can't
fully reconstruct the derivate chromosome but I still need to the able to
represent the partial reconstructions that we can do.
Another example of why it's important to separate the breakpoint/CN claims
from the event claims is for retrocopied genes. They look like the
following:
[image: image]
<https://user-images.githubusercontent.com/6036536/63098654-b00b6280-bfb6-11e9-8aa9-04643184beca.png>
Note how the SV breakpoints and CN changes all line up with the
intron/exon boundries. There are about 15 of these prevalent in the
population but not in the reference and if you look at the variant
databases such as dbvar, they claim that the gene in question has an
intronic deletions at almost every intron. At one level this is correct, as
the extra copy lacks introns, but at the same time, those DEL claims are
all incorrect as there's no actual deletions in the gene itself and having
a database full of them is completely misleading to users that are wanting
to check the prevalence of mutations in their gene of interest.
@bpitel12 <https://github.com/bpitel12>
a) List genomic regions of copy number loss within complex segment
I don't see the need for a separate definition of a complex segment in the
specifications themselves.
i) potential to call out tumor suppressor genes within regions of loss
Biallelic inactivation is really what's important for most tumor
suppressors and that's a commonly a combination of a SNV/indel and a
(partial) CN loss. I do see value in annotating of the impact of variants
but it's not always as straight-forward as a simple CN loss.
b) List genomic regions of copy number gain within complex segment
Again, I don't see the need for a complex segment definition. We're
possibly talking about similar things with different terminology. I'm
proposing the grouping of a set of SV/CN into an 'event'. Such a model
would not need to differentiate between simple and complex events but it
definitely needs to be more complex than a 'start/end complex region' style
of interval. A simple deletion would merely have a single breakpoint and CN
associated with it, whereas chromothripsis would have many of both.
The temporal association and overlaps between event would have to be
clarified though. For example, if a simple DEL is just a SV+CN segment,
nested deletions would be problematic as there would be 3 CN segments
relevant to the outer deleted region.
ii) potential to call out oncogenes in regions of copy number gain if
potentially significant.
Driver gene annotation is indeed useful, and amplified drivers are usually
contained with a single CN segment so that would usually work.
c) List coordinates/fusion partners signifying important structural
variants within chromothryptic region.
A non-trival number of expressed gene fusions involve multiple
breakpoints. It's not just geneA->SV->geneB. It's rearrangements such as
geneA->SV->other location->SV->another location->SV->geneB that still
produce functional fusion products. Again, we're looking at a set of SV/CN
that are responsible for a gene fusion (the complex ones typically form
part of a larger chromothriptic rearrangements).
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#28>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAA2XDKPA4CRKBHPIQ4C62DQEVPIJANCNFSM4FVRJS4A>
.
|
We now have our preprint up: https://www.biorxiv.org/content/10.1101/781013v1 @reece we now proper documentation for our circos plots https://github.com/hartwigmedical/hmftools/blob/master/sv-linx/README_VIS.md#circos-panel as well as our event classifier/DNA-Seq fusion caller https://github.com/hartwigmedical/hmftools/tree/master/sv-linx @ahwagner The NCI fusion specifications only work for RNA. It doesn't work for DNA for tthe following reasons:
Here is a simple real-world example of what I'm talking about: TMPRSS2 exon1 -> TMPRSS2 intron 1 in the wrong orientation -> ERG intron 1 = TMPRSS2 exon 1 to ERG exon 2 gene fusion. Either of these breakpoints in isolation does not give a gene fusion product, but combined (we also do phasing of SVs so we can tell they are cis SVs), you get a gene fusion. NB: It might be even messier than we have found as it's theoretically possible for n-way fusions in which more than two genes are involved (ie the other genome locations contain exons). We haven't found any in our data set but that's only because we haven't looked since our software only supports 2-way fusions. Another edge case we did not investigate is read-through transcription resulting from two genes being brought in close proximity to each other (but not fused per se). |
Here's another read-world example that I don't think the proposed design handles: In this sample, we see a chromoplex event causing a TMPRSS2-ERG fusion via PTEN itron 1 (follow the blue line) whilst also breaking PTEN (since it was put back together partly on the yellow and partly on the purple chromatid), as well as a loss of PPP2R2A. We have a single event resulting in ~20 breakpoints with three outcomes, all clinically relevant. At minimum, a usable genomic rearrangement (I use this term due the fundamentally interconnected relationship between SVs and CNVs) representation format needs to be able unambigiously allow (partial) derivative chromosomes to be defined and, preferrably, be able to explicitly represent complex events. We are currently using custom csv files to represent all this. We could do it in VCF but we'd have to define a whole lot of custom fields to do it since the spec-defined fields don't cover any of this. |
These are beautiful illustrations, @d-cameron! I wanted to chime in and say that the last example above is what we see fairly frequently in our lab through mate pair sequencing. To report these variants, we have been using a hybrid of HGVS nomenclature Cytogenetic ISCN detailed system nomenclature, similar to (but not exactly) what is described here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4067557/ Here are a few publications from our group that may add to the wealth of information above on how complex variants can be described in human-readable(ish) form. I think we're all looking for ways to describe this a little more easily: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6822454/pdf/13039_2019_Article_455.pdf Thanks, team. |
@bpitel12 how do you find that notation scaling? From the looks of it, the detailed notation requires ~40-60 characters per breakpoint (rearrangements using the nomenclature of the linked paper).. A 'simple' 100 break chromothripsis event requires ~5000 characters to specify and if you follow it by breakage fusion bridge, it looks like you'd have to define the full derivative chromatid structure and specify the 10 fold amplified regions 10 times since you need to walk all the way through the derivative chromosome. |
@d-cameron I appreciate your comment. We have never tried to annotate chromothripsis in this manner for that very reason. It is very difficult to be concise enough yet descriptive enough to effectively communicate the structure in text. I suppose if we had to we could provide the 5000 character long version as supplemental - but not sure who has the time/energy to sift through 5000 characters of structural description. In this case, we would likely supply a visual like you so nicely did in the string above. For communicating complex SVs, we often will provide some nomenclature for just the parts of the rearrangement that are clinically relevant. For example - imagine you have IGH inserted into the MYC region as part of a chromothriptic event on chromosome 8 in addition to MYC amplification. Maybe we can provide a visual similar to what you have shown above as well as some nomenclature that looks something like this (I apologize for not providing exact coordinates - in a time crunch for some other projects today, but wanted to respond and see if maybe you or others had some ideas on making this better): ins(14q32.33)(q24.1)(MYC+) ; 8pter-->cth8(q23qter)...(8q24.1(123,456,789)::14q32.33(89,123,456-->90,234,456)::8q24,1(123,567,890)... Thank you! |
I have problems to understand the value - but also the consistency - of describing CTLPs* completely, as single events; especially since they won’t ever happen again, in the exact form. But then, one can provide the elements in a format that allows reconstruction, right? I mean, such a Cicos plot is drawn from individual events, isn’t it? A representation of a CTLP has to:
So in a variant DB you would use a reference to the same callset/sample/experiment, whatever your flavour, with each fusion, CNV. In an evidence DB, you’d list the individual events with your ID’d variant. ————- |
It's from individual breakpoints and (allele-specific) copy numbers. 'event' is a loaded term. In my work, I define an 'event' as a transformation from one stable genomic configuration to another. Using this definition, a genome can have zero or more chromothripsis events, each of which will have many breakpoints. It also results in chromothripsis + breakage fusion bridge to be defined as a single event and, given that BFB occurs over multiple cell divisions, others may disagree with our event definition.
That approach won't handle compound fusion events where a fusion is the result of multiple breakpoints. Each breakpoint on it's own won't make the fusion, but traversing the derivative chromosome across multiple breakpoints results in a functional fusion product. It is also possible to have a 3-way fusions in which part of a third gene is between the 5' and 3' fusion partners.
The CNVs in the aggregate result across all derivate chromosomes. The CN and CN delta from a chromothripsis event is the sample copy number. Are you proposing deconvolution of the copy number into the consistuent 'events', or breaking down by derivate chromosome? AFAIK, there is currently no software that does either of these things.
Phasing of SVs is extremely important (~15% of driver fusions are compound fusions - mostly TMPRSS2-*-ERG fusions).
I am indeed advocating for this and am currently working on incorporating this into VCF vNext. We're not at the point were we can use a closed vocabulary for the events, but being able to tied related breakpoints/CN changes together is important. |
Relevant meta-thread from hts-specs: samtools/hts-specs#465 |
Hi All! As a medical scientist I would find it beneficial to have two forms of fusion nomenclature: a short one and a long one. I think short one should include information that is crucial for the clinical utility of the fusion. The longer form would include all details that help to identify the exact localisation of the fusion in genome. I would see the longer form to be included in the supplementary data of the medical report and a shorter version would be included in the main report comment together with a clinical utility of the fusion. Whether we should comment if the fusion was “in-frame” or not it depends on the context. From my experience I can say that when I was dealing with a DNA seq results I would not report a structural variant as a fusion (especially for new fusion genes) if it was out of frame and I could not confirm it by RNA methods. However, reading frame information could be beneficial, for example, when DNA seq was out-of frame but RNA results showed that the final product was in frame (for instance in exon skipping situation). Moving forward on this we should distinguish DNA and RNA focused nomenclature. Some labs haven’t got RNA sequencing methods in place. When it comes to RNA nomenclature Li's proposal looks nice and practical and I would definitely like to add gene name to both forms. |
Shouldn't the downstream impacts be within the scope of Variant Annotation and not Variant Representation? |
@d-cameron +1 |
I think that @TanskaAnnna intended to write these comments in the context of some of the comments earlier in this thread (1, 2, 3, 4) pertaining to data elements relevant to the characterization of gene fusions. Like me, Ania is thinking about how categorical notions of fusions (e.g. fusions that are defined by hyperactivity of a specified functional domain) should be represented as subjects of annotations. This is a fuzzy line between VR and VA that we are still trying to work through, but categorical variation is a key use case for the VICC driver project (which primarily works with aggregated concepts like this) which we are working to support in VRS. As a However, as meta-threads like this can get a little busy, I am going to capture Ania's comments over at cancervariants/fusions to continue the discussion on short-form vs. long-form nomenclatures, and bring back the key findings from that effort to inform building of these concepts in VRS and VA. |
Apologies for confusion folks and thanks @ahwagner - that's exactly what I had in mind. Will continue on Salient elements of gene fusions issue. |
Shirley Li mentioned COSMIC on the VICC General Call today; not seeing the link on this thread, so adding it here to track: https://cancer.sanger.ac.uk/cosmic/fusion Also mentioned gnomAD fusions, which we have not yet looked at. |
This issue was marked stale due to inactivity. |
VR needs to have a path for representing translocations.
Allele is currently defined as a contiguous sequence change at a single location. Translocations and junctions are unlikely to fit in that model.
See also #23 and #51.
The text was updated successfully, but these errors were encountered: