Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does . mean for an A, R, or G indexed field? #737

Open
danking opened this issue Jul 31, 2023 · 6 comments
Open

What does . mean for an A, R, or G indexed field? #737

danking opened this issue Jul 31, 2023 · 6 comments
Assignees
Labels

Comments

@danking
Copy link
Contributor

danking commented Jul 31, 2023

Hey all,

I'm trying to pin down exactly what . means for an A, R or G indexed field. In particular does . mean:

  1. This field is missing.
  2. This field is a non-missing array containing a single missing value.

Hail requires the user to explicitly acknowledge that their VCF has arrays of missing values and that they know how to interpret that. This is becoming an issue because folks are showing up to the support forum with VCFs that have INFO fields that look like AS_VQSLOD=.,.;AS_YNG=.,.. I'd like to pin down the interpretation of ;FOO=.; so that Hail can be less pedantic about these VQSLOD and YNG annotations.

To be clear, .,. and longer arrays are clear to us: an array of missing values. As is intermingling: 3,. is an array with two values one present one non-missing.

@danking
Copy link
Contributor Author

danking commented Jul 31, 2023

Vaguely related issue: #419

@pd3
Copy link
Member

pd3 commented Aug 1, 2023

The meaning of a single dot is ambiguous, it can mean both.

@danking
Copy link
Contributor Author

danking commented Aug 2, 2023

Is there appetite for modifying the spec to be unambiguous? I appreciate folks may be loathe to add more complexity.

We had a quick spitball over here and think trailing commas could resolve the ambiguity with a bit of backwards-incompatibility.

Specifically: a non-missing array-type field value should end with ,. A non-missing array-type field value whose last element is not the empty string may elide the trailing comma. If the last element is the empty string, the final comma is required.

Using a hopefully intuitive JSON-inspired syntax for the meaning column in which . represents N/A or missing.

VCF value meaning
. .
., [.]
.,, [.,""]
.,,, [.,"",""]
.,3 [.,"3"]
.,3, [.,"3"]
.,3,, [.,"3",""]
.,abc [.,"abc"]
.,abc, [.,"abc"]

Existing VCFs with empty strings in the last element of array-type field values are now interpreted incorrectly: they're one element shorter than expected. For A, R, and G indexed fields we can error or fix the size automatically. For . indexed fields this is an undetectable change of semantics.

VCFs using this hypothetical new spec (4.5?) would confuse existing tools, again the particularly bad case is . indexed fields which have no "checkbit".

@d-cameron
Copy link
Contributor

My reading of the specs is that the first option isn't actually valid:

Section 1.4.2: The Number entry is an Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a single number, then this value must be 1; if the INFO field describes a pair of numbers, then this value must be 2 and so on.

The "must" doesn't leave a lot of room for missing and the missing allowed in Section 1.6 applies to a record that is missing all INFO/FORMAT fields.

Section 1.6.2 seems to disagree with this and allow it: If a field contains a list of missing values, it can be represented either as a single MISSING value (‘.’) or as a list of missing values (e.g. ‘.,.,.’ if the field was Number=3).

@jkbonfield jkbonfield moved this from New items to Progressing in GA4GH File Formats Aug 22, 2023
@d-cameron
Copy link
Contributor

Is there appetite for modifying the spec to be unambiguous?

Yes, but not in a backwards incompatible manner.

@danking
Copy link
Contributor Author

danking commented Aug 22, 2023

Just so I'm clear on what is acceptable: VCFv4.4 is backwards compatible if and only if, for any VCFv4.3 file, if I change the fileformat line to VCFv4.4, the interpretation of its field values must not change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Progressing
Development

No branches or pull requests

4 participants