-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement support for transcript-based locations #199
Comments
This issue was marked stale due to inactivity. |
During the VR call on October 5, several ideas related to this topic were reviewed. Slides were posted here. At the end of the call, feedback was requested. This very long comment is my attempt to capture some of my thinking on this topic. Guiding principles:
I recognize those principles may appear academic to some. There is a necessary balance to be struck with the pragmatic need to move things forward. Comments on proposed models
Model 2 (Alex):
My thoughts
I believe the counting system defined by the From the perspective of the abstract parent classes: Following that pattern, we can then create: Note that Similarly, Sidebar: A
The clarification of Since Given this definition, and the one above for My struggle with I am uncertain of the difference between AnnotatedSequence (Model 2) and a Summary (TL;DR)
|
After 2 breaks and several national holidays, I finally worked my way through Bob's comment. :-) Let's please drop discussion of imaginary sequences. It's incidental to the model. I will remove it from future discussions. I care (very strongly) about one design goal: There should be only one notion of a collection of exons on a sequence. Just as Allele can be defined on any sequence, and just as one Allele might be used to derive another, a collection of exons on a specific sequence should have only one representation irrespective of sequence, and might be related to a similarly-structured set of exons on another sequence. |
Thanks for sharing your thoughts @rrfreimuth. I'll put my responses here in the hope it will facilitate additional feedback. I appreciate the call for semantic consistency, particularly as it relates to I like the idea of As to As to @reece's response to your comment, your point about imaginary intronic coordinates was similar to earlier comments I made in a leads call earlier this week. To explain his comment about it being incidental a little further, the notion of offset-based There is an unresolved sticking point between the two models, however: can the Transcript class be either a contiguous set of exons/cds on an RNA sequence or the discontiguous set of template exons/cds on a genomic sequence? Or is the Transcript class only the exons/cds on an RNA sequence, and the other is a distinct concept? The discussion on this point has gone quite deep, and while I won't rehash it here, suffice it to say that we're still working it through. However, I am beginning to think that our struggles to converge on a common model are deeper than that. I think we have been glossing over the issue that However, I am beginning to see a path where we model these complex entities as The good news is that the VR leads have collectively identified that resolving this issue is non-blocking, and that we can advance in parallel on our efforts to model |
This issue was marked stale due to inactivity. |
If there is a transcript representation that knows about CDS, it would be great to be able to indicate ribosomal frameshifting sites (translational slippage). Here an example how NCBI represents this currently: https://www.ncbi.nlm.nih.gov/nuccore/NM_001301302.1:
|
Glad you raised this. How do you imagine that this would be used in practice? Would you report variation in the shifted or unshifted coordinates? Is the shifting a property of the transcript, or a property of the coordinates on the transcript? If shifting is a property of the transcript, that would suggest an information model where a transcript consists of: a sequence, a set of exons on that sequence, an optional CDS <start,end>, and a new shift (signed int?). Is that what you had in mind? |
Ribosomal slippage happens in the middle of the mRNA, during translation. The way I would like to describe variants is relative to the protein-effect. To get this right, it will be important to be precise where exactly the variant is located. The variant could be either in the first part of a protein (in the unshifted region), or in the second part (the shifted region), or perhaps even overlap the site where the shift happens. (Note: I believe there are also examples with multiple shifts in one mRNA, not sure if they have been observed in human though) I agree, if shifting is represented as a property of a transcript, the transcript would have a sequence, exons, and alist of CDS <start, end>. I don't think it needs a signed int, since the CDS<start,end> position would already include this. The example above refers to this as |
This issue was marked stale due to inactivity. |
This should be revisited with the SA team with the upcoming gene and transcript models. |
Coordinates based on transcripts depend on the exon structure at least. Coding position requires also cds start and end.
Proposal: Implement TranscriptLocation, à la HGVS, by creating a class that stores coordinates on a specific transcript. The transcript reference should probably be a CURIE.
Ideally, the transcript will also depend on computed digest for the transcript (as UTA does) to de-dupe based on reference sequence, exon structure, and cds start/end.
The text was updated successfully, but these errors were encountered: