VRS 2.x and Derivative Molecules #461

ahwagner · 2024-01-31T22:07:09Z

ahwagner
Jan 31, 2024
Maintainer

The current Haplotype proposal for VRS 2.0 allows Adjacencies to accompany Alleles in the Haplotype members. Since Adjancency order is meaningful, the haplotype members array is now ordered. However, this requires some additional rules around expected order of Alleles.

@rrfreimuth recommended that we consider a model that does not impose ordering constraints on Alleles, perhaps as separate models. We are opening discussion on this point to collect opinions and ideas.

d-cameron · 2024-02-01T12:54:13Z

d-cameron
Feb 1, 2024
Collaborator

(partial) derivate chromosome (i.e. haplotype) reconstruction does require variant ordering for it to be unambiguous. We could support both ordered and unordered haplotypes but then we're changing our definition of a haplotype.

From my perspective, unordered is just an implicit linear ordering of the Alleles with respect to the reference. Is this not the case? When SVs are present, you can't assume that the standard intuitive textbook definition of one maternal + one paternal copy = 2 haplotypes. So what does haplotype actually mean then? For example, are de novo mutations on 2 different maternal copies of a trisomy chromosome considered the same haplotype? (No) What if there's one maternal chromosome but the region has been duplicated? (Yes?) Do the variants in the same duplicated region belong on the same haplotype or on different haplotypes? (same). None of this is made explicit in VRS and, given we have SV support, we need to define haplotype in more detail.

Haplotypes are a specific combination of Alleles that are in-cis: occurring on the same physical molecule.

Note that our existing definition of haplotype as based on the same physical molecule means that all VRS haplotypes implicitly require that the caller has verified the absence of any structural rearrangement that could separate the variants onto different physical molecules. For example, if you have fully phased (T2T) SNVs & a balanced chromosomal translocation is present, you need to break up your haplotype blocks because they're on different physical molecules. Is this what we intend? There are other definitions out there. If I say I've got fully phased maternal/paternal haplotypes then I'm making a claim about the inheritance pattern of those variants, not about the relationship between the physical molecules they occur on. Without SVs, these two definitions mean the same thing. With SVs, they do not.

1 reply

ahwagner Mar 4, 2024
Maintainer Author

Hey @d-cameron, I agree with most of what you're saying here and appreciate you taking the time to write this all out. I apologize for the delay in my response, but this was a lengthy post and I wanted to wait until I had a large block of time to respond to all of it.

From my perspective, unordered is just an implicit linear ordering of the Alleles with respect to the reference. Is this not the case?

Yes, Allele order is not meaningful in the current model, because of this implication. This means fewer constraints on use of the data model, and reduced potential for implementation error.

When SVs are present, you can't assume that the standard intuitive textbook definition of one maternal + one paternal copy = 2 haplotypes. So what does haplotype actually mean then? For example, are de novo mutations on 2 different maternal copies of a trisomy chromosome considered the same haplotype? (No) What if there's one maternal chromosome but the region has been duplicated? (Yes?) Do the variants in the same duplicated region belong on the same haplotype or on different haplotypes? (same).

Fully agreed–I know we see the problem and anticipated use of the Haplotype class with SVs the same way.

Note that our existing definition of haplotype as based on the same physical molecule means that all VRS haplotypes implicitly require that the caller has verified the absence of any structural rearrangement that could separate the variants onto different physical molecules. For example, if you have fully phased (T2T) SNVs & a balanced chromosomal translocation is present, you need to break up your haplotype blocks because they're on different physical molecules. Is this what we intend?

VRS is used to describe the state of a sequence; if you perform an assay and find that SNPs A, B, and C are all co-located on the same molecule (what I think of as in-cis), that's a Haplotype. If A, B, and C are typically found in-cis on a fully-phased reference assembly, that does not imply they are in-cis when present in an assayed sample. IMO, calls made about that sample should use the Haplotype model only to convey a message about A, B, and C being in-cis in that sample. Along that line of reasoning, if there is insufficient evidence to support them being in-cis, and especially if there is evidence to the contrary (e.g. a translocation) that is known to separate them, then they should simply be reported as Alleles (or alternate Haplotype blocks as informed by the assay). My opinion is moot, though, as VRS does not convey how the message was made or the context it is used in, but only about what the message says. In VRS, the message conveyed is the computational definition of Haplotype:

A set of non-overlapping Allele members that co-occur on the same molecule.

There are other definitions out there.

The computational definition was derived from the biological definitions at ISOGG, SequenceOntology, and GENO. The recurring themes of how it is used in those definitions include in-cis presence on chromosomes and genetic linkage, but I think you have rightly pointed out through example that the VRS information model only conveys information about the in-cis part. It is also relaxed enough to consider derived chromosome "haplotypes" that conflict with the referenced source definitions. I wonder if the use of a biological term as a computational data structure is part of the issue here.

I'm going to make a separate thread with a model proposal that addresses this.

Mrinal-Thomas-Epic · 2024-02-01T14:30:52Z

Mrinal-Thomas-Epic
Feb 1, 2024
Maintainer

I'm in favor of a single, ordered haplotype model, but am interested in hearing more about the perceived benefits of creating a separate, unordered haplotype model. My primary concerns are that it would create unnecessary complexity and more situations where there are multiple ways to represent the same haplotype.

Even if we defined an unordered haplotype model, we can't escape the need for an implicit ordering if we want to unordered haplotypes to have consistent identifiers, regardless of the order in which their alleles are specified. I think in VRS 1.3 this was done by ordering arrays of digests by unicode character sets values.

1 reply

ahwagner Mar 4, 2024
Maintainer Author

Hey Mrinal. I agree that we should not have two models with a shared class name, e.g. Haplotype (ordered) and Haplotype (unordered). I'm not 100% sure that is what @rrfreimuth meant, though I will leave it to him to comment further on that.

I do think he had a good point, though, in that ordering Alleles in a Haplotype will add unnecessary implementation complexity and increase the chance of errors. I think there is an obvious solution here, though, and will craft a separate thread about it.

ahwagner · 2024-03-18T21:59:46Z

ahwagner
Mar 18, 2024
Maintainer Author

A draft model of 4 proposed changes between VRS 1.x to 2.x to support structural variation:

The Haplotype class gets renamed to CisPhasedBlock to differentiate between this concept and the notions of ancestry and linkage typically associated with Haplotype (leading to some of the concerns stated in this comment).
A new DerivativeSequence class is developed that allows construction of novel contiguous sequences from multiple adjacencies and CisPhasedBlocks.
The previously-described Adjacency class is added that allows for the representation of two adjacent sequences and an optional intervening linker sequence expression.
A new SequenceTerminus class is added that allows for the specification of a single sequence location, representing the end of a sequence.

These changes address the concerns about ordered / unordered variant members by creating data classes specifically for each of cis-phased variant and derivative sequences.

4 replies

ahwagner Mar 18, 2024
Maintainer Author

@larrybabb DerivativeSequence object should support Allele as well (for when there is only one Allele and not a cis-phased block).

cmprocknow Mar 22, 2024

Would this mean a CisPhasedBlock should have at least 2 Allele members?

Mrinal-Thomas-Epic Mar 22, 2024
Maintainer

What would be the advantage of using CPB with multiple alleles in DerivativeSequence, rather than putting multiple Alleles in a row in DerivativeSequence.components?

What is the "Block" in CisPhasedBlock? Since the class only contains Alleles, could we call it CisPhasedAlleles?

ahwagner Mar 23, 2024
Maintainer Author

@cmprocknow yes, a CisPhasedBlock should have at least 2 Allele members.

@Mrinal-Thomas-Epic:

What would be the advantage of using CPB with multiple alleles in DerivativeSequence, rather than putting multiple Alleles in a row in DerivativeSequence.components?

There are three advantages to this approach. The first of these is semantics; the members of a Haplotype (VRS 1.3) / CPB (Draft VRS 2.0) represent allele co-occurrence on a molecule, and the components of a DerivativeSequence represent a derivate molecule composed from molecular adjacencies, which may include variants that reside on those component molecules. Separating out these classes allows us to semantically represent these distinct scenarios explicitly, instead of implicitly by the components used. As I write this, I wonder if we should be calling this class a DerivativeMolecule instead. 🤔

The second advantage is that by separating these classes out, implementations have no need to validate that order of elements matches a convention (e.g. ordering Alleles in increasing or decreasing coordinate space to agree with coordinates used by Adjacency components). Instead, order is always meaningful (or not meaningful) as defined by the containing object. This addresses the concern raised by @rrfreimuth.

The third advantage is that multiple Alleles in a CPB can be referenced as a unit in a way that is identical to the Haplotype model from VRS 1.3; meaning that the identifier for those CPBs can be used as a drop-in replacement in describing CPBs that exist on DerivedSequences.

I also put together an illustration to illustrate

these second and third points, hopefully this helps:

What is the "Block" in CisPhasedBlock? Since the class only contains Alleles, could we call it CisPhasedAlleles?

I chose "Block" since an instance of the class should represent a singular entity (e.g. "A haplotype" / "A cis phased block", not "A cis phased alleles"). Block is used elsewhere in the community to refer to similar concepts, e.g. haplotype blocks [1], [2] and phase blocks [3]. Other possible terms for this unit could include "Set" (e.g. VCF Phase Set) or "Group" (generic).

If we think it is useful to forward-constrain this class to only Alleles, we can consider adding Alleles to the class name; this might look like CisPhasedAlleleBlock or something like that. The tradeoff for this approach is the extra message space it would use in the type field when representing these concepts as JSON, so I think we should make sure doing so would help disambiguate it from other envisioned VRS classes. My initial feeling is we should not change the name to include "Allele" as I do not see much perceived benefit to accompany the tradeoff, but if you (or anyone else reading this) has reason to think we should add Allele to the class name (or any other changes to the class name), please do indicate so here.

larrybabb · 2024-06-13T17:58:52Z

larrybabb
Jun 13, 2024
Maintainer

If you are looking at the 2.x vrs.ga4gh.org readthedocs specification the DerivativeMolecule Concept under MolecularVariation will point you back to this discussion.

0 replies

larrybabb · 2024-07-08T21:06:52Z

larrybabb
Jul 8, 2024
Maintainer

@ahwagner As I'm converting the 1.3 sv_haplotype test I noticed that we do not have Allele as oneOf the options for the DerivativeSequence.components array. Based on my reading/understanding of the above, we must allow Allele to be a component when ordering of more than one Allele in a row is important or when there is only one Allele between Adjacency's. We require at least 2 members in a CisPhasBlock so a single Allele can NOT be described by a CisPhaseBlock. Please verify if this is correct.

1 reply

ahwagner Jul 31, 2024
Maintainer Author

Consensus from 7/30 call was to also allow Allele.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VRS 2.x and Derivative Molecules #461

{{title}}

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

VRS 2.x and Derivative Molecules #461

ahwagner Jan 31, 2024 Maintainer

Replies: 5 comments · 7 replies

d-cameron Feb 1, 2024 Collaborator

ahwagner Mar 4, 2024 Maintainer Author

Mrinal-Thomas-Epic Feb 1, 2024 Maintainer

ahwagner Mar 4, 2024 Maintainer Author

ahwagner Mar 18, 2024 Maintainer Author

ahwagner Mar 18, 2024 Maintainer Author

cmprocknow Mar 22, 2024

Mrinal-Thomas-Epic Mar 22, 2024 Maintainer

ahwagner Mar 23, 2024 Maintainer Author

larrybabb Jun 13, 2024 Maintainer

larrybabb Jul 8, 2024 Maintainer

ahwagner Jul 31, 2024 Maintainer Author

ahwagner
Jan 31, 2024
Maintainer

Replies: 5 comments 7 replies

d-cameron
Feb 1, 2024
Collaborator

ahwagner Mar 4, 2024
Maintainer Author

Mrinal-Thomas-Epic
Feb 1, 2024
Maintainer

ahwagner Mar 4, 2024
Maintainer Author

ahwagner
Mar 18, 2024
Maintainer Author

ahwagner Mar 18, 2024
Maintainer Author

Mrinal-Thomas-Epic Mar 22, 2024
Maintainer

ahwagner Mar 23, 2024
Maintainer Author

larrybabb
Jun 13, 2024
Maintainer

larrybabb
Jul 8, 2024
Maintainer

ahwagner Jul 31, 2024
Maintainer Author