Skip to content

Commit

Permalink
Merge pull request #418 from ga4gh/main
Browse files Browse the repository at this point in the history
v1.3 release candidate
  • Loading branch information
ahwagner authored Apr 16, 2023
2 parents 8a02415 + c4595ae commit ca83d84
Show file tree
Hide file tree
Showing 23 changed files with 1,341 additions and 896 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip setuptools
pip install -r .requirements.txt
pip install --pre -r .requirements.txt
- name: Test with pytest
run: |
Expand Down
7 changes: 4 additions & 3 deletions .requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
pytest
python-jsonschema-objects>=0.3,<=0.3.10
python-jsonschema-objects>=0.4.0
jsonschema==3.2.0
ipython
pyyaml
ga4gh.gks.metaschema>=0.1.1
sphinx ~= 3.5
ga4gh.gks.metaschema==0.2.0rc4
sphinx ~= 4.5
sphinx-rtd-theme ~= 1.2
96 changes: 53 additions & 43 deletions docs/source/appendices/design_decisions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,11 +32,11 @@ Allele Rather than Variant
The most primitive sequence assertion in VRS is the :ref:`Allele`
entity. Colloquially, the words "allele" and "variant" have similar
meanings and they are often used interchangeably. However, the VR
contributors believe that it is essential to distinguish the state of
the sequence from the change between states of a sequence. It is
contributors assert that it is essential to distinguish between the *state of*
a reference sequence from the *change from* a reference sequence. It is
imperative that precise terms are used when modelling data. Therefore,
within VRS, Allele refers to a state and "variant" refers to the change
from one Allele to another.
within VRS, "allele" refers to a state of a reference sequence and "variant" refers to a change
from a reference sequence.

The word "variant", which implies change, makes it awkward to refer to
the (unchanged) reference allele. Some systems will use an HGVS-like
Expand All @@ -45,45 +45,6 @@ when referring to an unchanged residue. In some cases, such "variants"
are even associated with allele frequencies. Similarly, a predicted
consequence is better associated with an allele than with a variant.

.. _should-normalize:

Implementations should normalize
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

VRS STRONGLY RECOMMENDS that Alleles be :ref:`normalized
<normalization>` when generating :ref:`computed identifiers
<computed-identifiers>`. The rationale for recommending, rather than
requiring, normalization is grounded in dual views of Allele objects
with distinct interpretations:

* Allele as minimal representation of a change in sequence. In this
view, normalization is a process that makes the representation
minimal and unambiguous.

* Allele as an assertion of state. In this view, it is reasonable to
want to assert state that may include (or be composed entirely of)
reference bases, for which the normalization process would alter the
intent.

Although this rationale applies only to Alleles, it may have have
parallels with other VRS types. In addition, it is desirable for all
VRS types to be treated similarly.

Furthermore, if normalization were required in order to generate
:ref:`computed-identifiers`, but did not apply to certain instances of
VRS Variation, implementations would likely require secondary
identifier mechanisms, which would undermine the intent of a global
computed identifier.

The primary downside of not requiring normalization is that Variation
objects might be written in non-canonical forms, thereby creating
unintended degeneracy.

Therefore, normalization of all VRS Variation classes is optional in
order to support the view of Allele as an assertion of state on a
sequence.



.. _fully-justified:

Expand Down Expand Up @@ -113,6 +74,55 @@ occurs in a low-complexity region, but rather describes the final and
unambiguous state of the resultant sequence.


.. _should-normalize:

Implementations should normalize Alleles
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

VRS STRONGLY RECOMMENDS that Alleles be :ref:`normalized
<normalization>` when generating :ref:`computed identifiers
<computed-identifiers>` unless there is compelling reason to do
otherwise. Those reasons are the subject of this section.

:ref:`Allele Normalization <normalization>` is the process of
comparing a span of reference sequence to a sequence state (often the
alternative sequence) and resolving that span to an unambiguous form. The fully-justified Allele normalization in VRS consists of two steps: trimming
and shuffling. In the trimming step, common flanking prefix and
suffix sequences are removed. For example, a CAG-to-CTG Allele would
be trimmed to merely A-to-T, with the position adjusted accordingly.
There are four cases of the resulting sequences:

1. The trimmed sequences are empty: The Allele refers to reference
state.
2. The trimmed sequences are non-empty: The Allele is a substitution
(perhaps multi-residue).
3. The reference sequence is empty: The Allele is a net insertion.
4. The state sequence is empty: The Allele is a net deletion.

When the Allele refers to a reference state (case 1), trimming would
reduce the variant to a null change. However, reduction to a null
state would make it impossible to refer to a specific span of
reference sequence. In order to permit users to refer to spans of
reference sequence, VRS does not require normalizing reference
agreement Alleles.

The trimming step applies only when the reference or the state
sequences are empty (cases 3 and 4). When these occur in the context
of repeating reference sequence that matches the inserted or deleted
sequence, the Allele may be shuffled left and right to identify the
fully-justified location of the variation. (See :ref:`normalization`
for details.)

In rare cases, data originators might have reason to associate an
annotation with a specific repeating unit in the context of repeated
sequence. In order to support this case, normalization is not
strictly required.

Most users will normalize most Alleles. Normalization should be
skipped only when doing so would decrease the intended precision of an
Allele.


.. _inter-residue-coordinates-design:

Inter-residue Coordinates
Expand Down
123 changes: 0 additions & 123 deletions docs/source/appendices/future_plans.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,129 +96,6 @@ Under consideration. See https://github.com/ga4gh/vrs/issues/28.
t(9;22)(q34;q11) in BCR-ABL


.. _genotype:

Genotype
########

The genetic state of an organism, whether complete (defined over the
whole genome) or incomplete (defined over a subset of the genome).

**Computational definition**

A list of Haplotypes.

**Information model**

.. list-table::
:class: reece-wrap
:header-rows: 1
:align: left
:widths: auto

* - Field
- Type
- Limits
- Description
* - _id
- :ref:`CURIE`
- 0..1
- Variation Id; MUST be unique within document
* - type
- string
- 1..1
- Variation type; MUST be set to '**Genotype**'
* - completeness
- enum
- 1..1
- Declaration of completeness of the Haplotype definition.
Values are:

* UNKNOWN: Other Haplotypes may exist.
* PARTIAL: Other Haplotypes exist but are unspecified.
* COMPLETE: The Genotype declares a complete set of Haplotypes.

* - members
- :ref:`Haplotype`\[] or :ref:`CURIE`\[]
- 0..*
- List of Haplotypes or Haplotype identifiers; length MUST agree
with ploidy of genomic region


**Implementation guidance**

* Haplotypes in a Genotype MAY occur at different locations or on
different reference sequences. For example, an individual may have
haplotypes on two population-specific references.
* Haplotypes in a Genotype MAY contain differing numbers of Alleles or
Alleles at different Locations.

**Notes**

* The term "genotype" has two, related definitions in common use. The
narrower definition is a set of alleles observed at a single
location and with a ploidy of two, such as a pair of single residue
variants on an autosome. The broader, generalized definition is a
set of alleles at multiple locations and/or with ploidy other than
two.The VRS Genotype entity is based on this broader definition.
* The term "diplotype" is often used to refer to two haplotypes. The
VRS Genotype entity subsumes the conventional definition of
diplotype. Therefore, the VRS model does not include an explicit
entity for diplotypes. See :ref:`this note
<genotypes-represent-haplotypes-with-arbitrary-ploidy>` for a
discussion.
* The VRS model makes no assumptions about ploidy of an organism or
individual. The number of Haplotypes in a Genotype is the observed
ploidy of the individual.
* In diploid organisms, there are typically two instances of each
autosomal chromosome, and therefore two instances of sequence at a
particular location. Thus, Genotypes will often list two
Haplotypes. In the case of haploid chromosomes or
haploinsufficiency, the Genotype consists of a single Haplotype.
* A consequence of the computational definition is that Haplotypes at
overlapping or adjacent intervals MUST NOT be included in the same
Genotype. However, two or more Alleles MAY always be rewritten as an
equivalent Allele with a common sequence and interval context.
* The rationale for permitting Genotypes with Haplotypes defined on
different reference sequences is to enable the accurate
representation of segments of DNA with the most appropriate
population-specific reference sequence.

**Sources**

SO: `Genotype (SO:0001027)
<http://www.sequenceontology.org/browser/current_svn/term/SO:0001027>`__
— A genotype is a variant genome, complete or incomplete.

.. _genotypes-represent-haplotypes-with-arbitrary-ploidy:

.. note:: Genotypes represent Haplotypes with arbitrary ploidy
The VRS defines Haplotypes as a list of Alleles, and Genotypes as
a list of Haplotypes. In essence, Haplotypes and Genotypes represent
two distinct dimensions of containment: Haplotypes represent the "in
phase" relationship of Alleles while Genotypes represents sets of
Haplotypes of arbitrary ploidy.

There are two important consequences of these definitions: There is no
single-location Genotype. Users of SNP data will be familiar with
representations like rs7412 C/C, which indicates the diploid state at
a position. In the VRS, this is merely a special case of a
Genotype with two Haplotypes, each of which is defined with only one
Allele (the same Allele in this case). The VRS does not define a
diplotype type. A diplotype is a special case of a VRS Genotype
with exactly two Haplotypes. In practice, software data types that
assume a ploidy of 2 make it very difficult to represent haploid
states, copy number loss, and copy number gain, all of which occur
when representing human data. In addition, assuming ploidy=2 makes
software incompatible with organisms with other ploidy. The VRS
makes no assumptions about "normal" ploidy.

In other words, the VRS does not represent single-position
Genotypes or diplotypes because both concepts are subsumed by the
Allele, Haplotype, and Genotypes entities.



.. _GitHub issue: https://github.com/ga4gh/vrs/issues
.. _genetic variation: https://en.wikipedia.org/wiki/Genetic_variation

Expand Down
6 changes: 2 additions & 4 deletions docs/source/impl-guide/computed_identifiers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,9 +119,7 @@ If the object is an instance of a VRS class, implementations MUST:
* ensure that objects are referenced with identifiers in the
``ga4gh`` namespace
* replace each nested :term:`identifiable object` with their
corresponding *digests*. (Note: Attributes of some objects, such
as :ref:`CopyNumber`, permit a mix of identifiable and
non-identifiable values.)
corresponding *digests*.
* order arrays of digests and ids by Unicode Character Set values
* filter out fields that start with underscore (e.g., `_id`)
* filter out fields with null values
Expand Down Expand Up @@ -193,7 +191,7 @@ Truncated Digest (sha512t24u)
The sha512t24u truncated digest algorithm [Hart2020]_ computes an ASCII digest
from binary data. The method uses two well-established standard
algorithms, the `SHA-512`_ hash function, which generates a binary
digest from binary data, and `Base64`_ URL encoding, which encodes
digest from binary data, and a URL-safe variant of `Base64`_ encoding, which encodes
binary data using printable characters.

Computing the sha512t24u truncated digest for binary data consists of
Expand Down
10 changes: 5 additions & 5 deletions docs/source/releases/1.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ Major Changes
#############

* :ref:`CopyNumberChange` introduced for relative copy number calls
* :ref:`CopyNumberCount` replaces `CopyNumber`
* :ref:`Genotype` introduced for describing genotypes
* :ref:`ComposedSequenceExpression` introduced for composing expressions
from multiple other sequence expressions
* :ref:`CopyNumberCount` replaces `CopyNumber (v1.2) <https://vrs.ga4gh.org/en/1.2.1/terms_and_model.html#copynumber>`_
* :ref:`Genotype` introduced as a new systemic variation concept
* :ref:`ComposedSequenceExpression` introduced for composing expressions from multiple other sequence expressions

Minor Changes
#############

* Clarifying updates for :ref:`Allele normalization guidance <>`
* Clarifying updates for :ref:`Allele normalization guidance
<should-normalize>`
* :ref:`Haplotype` allele member minimum was revised from 1 to 2
* Updated metaschema processor version
* Introduced ordered / unordered attribute in array declarations
Expand Down
1 change: 1 addition & 0 deletions docs/source/releases/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Releases
:maxdepth: 2
:includehidden:

1.3.rst
1.2.rst
1.1.rst
1.0.rst
Loading

0 comments on commit ca83d84

Please sign in to comment.