-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[query] bad error message when user needs to use array_elements_required=False #13346
Comments
I asked for some clarity from the VCF spec on what |
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution.
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution.
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution. I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it three times per commit and report each individual time as well as the average. ``` In [1]: import hail as hl In [2]: %%time ...: mt = hl.import_vcf( ...: '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz' ...: ) ...: mt._force_count_rows() ``` | commit | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) | | `cd06c248e4` (this PR) | 116s | 80s | 77s | 91+-18 | 78.5 +- 1.5 | | `f00f916faf` (`0.2.120`) | 112s | 80s | 79s | 90+-15 | 79.5 +- 0.5 | This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very minor amount of new work.
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution. I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it three times per commit and report each individual time as well as the average. ``` In [1]: import hail as hl In [2]: %%time ...: mt = hl.import_vcf( ...: '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz' ...: ) ...: mt._force_count_rows() ``` | commit | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) | |--------------------------|-----------|-----------|-----------|-------------|------------------| | `cd06c248e4` (this PR) | 116s | 80s | 77s | 91+-18 | 78.5 +- 1.5 | | `f00f916faf` (`0.2.120`) | 112s | 80s | 79s | 90+-15 | 79.5 +- 0.5 | This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very minor amount of new work.
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO or FORMAT fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution. I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it three times per commit and report each individual time as well as the average. ``` In [1]: import hail as hl In [2]: %%time ...: mt = hl.import_vcf( ...: '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz' ...: ) ...: mt._force_count_rows() ``` | commit | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) | |--------------------------|-----------|-----------|-----------|-------------|------------------| | `cd06c248e4` (this PR) | 116s | 80s | 77s | 91+-18 | 78.5 +- 1.5 | | `f00f916faf` (`0.2.120`) | 112s | 80s | 79s | 90+-15 | 79.5 +- 0.5 | This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very minor amount of new work.
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO or FORMAT fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details. See hail-is/hail-rfcs#8 for details on the problem and the solution. I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it three times per commit and report each individual time as well as the average. ``` In [1]: import hail as hl In [2]: %%time ...: mt = hl.import_vcf( ...: '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz' ...: ) ...: mt._force_count_rows() ``` | commit | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) | |--------------------------|-----------|-----------|-----------|-------------|------------------| | `cd06c248e4` (this PR) | 116s | 80s | 77s | 91+-18 | 78.5 +- 1.5 | | `f00f916faf` (`0.2.120`) | 112s | 80s | 79s | 90+-15 | 79.5 +- 0.5 | This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very minor amount of new work.
I have an RFC proposal to just handle the ambiguity: https://github.com/hail-is/hail-rfcs/blob/main/rfc/0008-handle-vcf-array-field-ambiguity I proposed a PR to fix this: #13465 However, I missed a key issue: many VCF's elide fields to indicate missingness. That is not ambiguous: a field that is entirely elided is clearly missing, not an array of one missing value. You can't do this in a FORMAT (aka entry aka genotype) field, but you can do this in an INFO field a la:
the I don't think we can fix this problem entirely from Python. We need to use Scala-side logic because after we parse in Scala, we lose the knowledge that a field was entirely elided versus a single missing dot. |
Fixes hail-is#13346. Another user was confused by this: hail-is#14102. Unfortunately, the world appears to have embraced missing values in VCF array fields even though the single element case is ambiguous. In hail-is#13346, I proposed a scheme by which we can disambiguate many of the cases, but implementing it ran into challenges because LoadVCF.scala does not expose whether or not an INFO field was a literal "." or elided entirely from that line. Anyway, this error message actually points users to the fix. I also changed some method names such that every method is ArrayType and never TypeArray.
Fixes hail-is#13346. Another user was confused by this: hail-is#14102. Unfortunately, the world appears to have embraced missing values in VCF array fields even though the single element case is ambiguous. In hail-is#13346, I proposed a scheme by which we can disambiguate many of the cases, but implementing it ran into challenges because LoadVCF.scala does not expose whether or not an INFO field was a literal "." or elided entirely from that line. Anyway, this error message actually points users to the fix. I also changed some method names such that every method is ArrayType and never TypeArray.
…ts (#14105) Fixes #13346. Another user was confused by this: #14102. Unfortunately, the world appears to have embraced missing values in VCF array fields even though the single element case is ambiguous. In #13346, I proposed a scheme by which we can disambiguate many of the cases, but implementing it ran into challenges because LoadVCF.scala does not expose whether or not an INFO field was a literal "." or elided entirely from that line. Anyway, this error message actually points users to the fix. I also changed some method names such that every method is ArrayType and never TypeArray.
If you're encountering this issue the quick fix is to use
array_elements_required=False
What happened?
https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/checkpoint.20with.20missing.20fields
Notice in particular:
These fields are array fields containing missing values. By default, Hail errors when parsing these due to the inherent ambiguity of a single dot: is it a missing array or an array with one, missing, element.
The error message should suggest that the user try using array_elements_required. The docs for
import_vcf
should provide enough information for the user to understand what this does.We should also consider making this the default.
Version
0.2.120
Relevant log output
No response
The text was updated successfully, but these errors were encountered: