Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] bad error message when user needs to use array_elements_required=False #13346

Closed
danking opened this issue Jul 31, 2023 · 2 comments · Fixed by #14105
Closed

[query] bad error message when user needs to use array_elements_required=False #13346

danking opened this issue Jul 31, 2023 · 2 comments · Fixed by #14105

Comments

@danking
Copy link
Contributor

danking commented Jul 31, 2023

If you're encountering this issue the quick fix is to use array_elements_required=False

hl.import_vcf(..., array_elements_required=False)

What happened?

https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/checkpoint.20with.20missing.20fields

is.hail.utils.HailException: gs://jn-vcf-cleanup-central1/McCarroll-Macosko-UM1-BICAN-Express-WGS-2023-0626/McCarroll-Macosko-UM1-BICAN-Express-WGS-2023-0626.vcf.gz:offset 1344376382: error while parsing line
chr1	10403	.	ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC	A,ACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC	.	LowQual	AC=1,1;AF=0.250,0.250;AN=4;AS_QUALapprox=0|23|45;AS_VQSLOD=.,.;AS_YNG=.,.;QUALapprox=45	GT:AD:GQ:RGQ	./.	0/1:23,7,0:20:23	./.	./.	./.	0/2:6,0,4:35:45	./.	./.	./.	./.	./.	./.	./.	./.	./.	./.	./.	./.

	at is.hail.utils.ErrorHandling.fatal(ErrorHandling.scala:21)
	at is.hail.utils.ErrorHandling.fatal$(ErrorHandling.scala:21)
	at is.hail.utils.package$.fatal(package.scala:78)
	at is.hail.io.vcf.MatrixVCFReader.$anonfun$executeGeneric$7(LoadVCF.scala:1934)
	at is.hail.io.vcf.MatrixVCFReader.$anonfun$executeGeneric$7$adapted(LoadVCF.scala:1922)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:515)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at __C2005collect_distributed_array_matrix_native_writer.apply_region1_27(Unknown Source)
	at __C2005collect_distributed_array_matrix_native_writer.apply(Unknown Source)
	at __C2005collect_distributed_array_matrix_native_writer.apply(Unknown Source)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$6(BackendUtils.scala:52)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.annotations.RegionPool.scopedRegion(RegionPool.scala:162)
	at is.hail.backend.BackendUtils.$anonfun$collectDArray$5(BackendUtils.scala:51)
	at is.hail.backend.spark.SparkBackendComputeRDD.compute(SparkBackend.scala:751)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:136)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

is.hail.utils.HailException: cannot set missing field for required type +PCString

Notice in particular:

AS_VQSLOD=.,.;AS_YNG=.,.

These fields are array fields containing missing values. By default, Hail errors when parsing these due to the inherent ambiguity of a single dot: is it a missing array or an array with one, missing, element.

The error message should suggest that the user try using array_elements_required. The docs for import_vcf should provide enough information for the user to understand what this does.

We should also consider making this the default.

Version

0.2.120

Relevant log output

No response

@danking danking added the bug label Jul 31, 2023
@danking
Copy link
Contributor Author

danking commented Jul 31, 2023

I asked for some clarity from the VCF spec on what . means so that we can possibly be less pedantic.

samtools/hts-specs#737

danking pushed a commit to danking/hail that referenced this issue Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.
danking pushed a commit to danking/hail that referenced this issue Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.
danking pushed a commit to danking/hail that referenced this issue Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.

I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the
following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it
three times per commit and report each individual time as well as the average.

```
In [1]: import hail as hl

In [2]: %%time
   ...: mt = hl.import_vcf(
   ...:     '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz'
   ...: )
   ...: mt._force_count_rows()
```

| commit                   | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) |
| `cd06c248e4` (this PR)   | 116s      | 80s       | 77s       | 91+-18      | 78.5 +- 1.5      |
| `f00f916faf` (`0.2.120`) | 112s      | 80s       | 79s       | 90+-15      | 79.5 +- 0.5      |

This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very
minor amount of new work.
danking pushed a commit to danking/hail that referenced this issue Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.

I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the
following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it
three times per commit and report each individual time as well as the average.

```
In [1]: import hail as hl

In [2]: %%time
   ...: mt = hl.import_vcf(
   ...:     '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz'
   ...: )
   ...: mt._force_count_rows()
```

| commit                   | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) |
|--------------------------|-----------|-----------|-----------|-------------|------------------|
| `cd06c248e4` (this PR)   | 116s      | 80s       | 77s       | 91+-18      | 78.5 +- 1.5      |
| `f00f916faf` (`0.2.120`) | 112s      | 80s       | 79s       | 90+-15      | 79.5 +- 0.5      |

This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very
minor amount of new work.
danking pushed a commit to danking/hail that referenced this issue Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO or FORMAT fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.

I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the
following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it
three times per commit and report each individual time as well as the average.

```
In [1]: import hail as hl

In [2]: %%time
   ...: mt = hl.import_vcf(
   ...:     '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz'
   ...: )
   ...: mt._force_count_rows()
```

| commit                   | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) |
|--------------------------|-----------|-----------|-----------|-------------|------------------|
| `cd06c248e4` (this PR)   | 116s      | 80s       | 77s       | 91+-18      | 78.5 +- 1.5      |
| `f00f916faf` (`0.2.120`) | 112s      | 80s       | 79s       | 90+-15      | 79.5 +- 0.5      |

This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very
minor amount of new work.
danking pushed a commit to danking/hail that referenced this issue Aug 21, 2023
CHANGELOG: Fixed hail-is#13346. Previously, when parsing VCFs, Hail failed on INFO or FORMAT fields with missing elements because the meaning of "." could be ambiguous. Hail now resovles the ambiguity, when possible, using the number of alleles. If the meaning is still ambiguous after considering the number of alleles, Hail uses a new `hl.import_vcf` parameter to resolve the ambiguity. See the `hl.import_vcf` docs for details.

See hail-is/hail-rfcs#8 for details on the problem and the solution.

I assessed the effect of removing the `array_elements_required=True` fast path by evaluating the
following code against this PR's tip commit `cd06c248e4` and `0.2.120` (`f00f916faf`). I ran it
three times per commit and report each individual time as well as the average.

```
In [1]: import hail as hl

In [2]: %%time
   ...: mt = hl.import_vcf(
   ...:     '/Users/dking/projects/hail-data/ALL.chr21.raw.HC.vcf.bgz'
   ...: )
   ...: mt._force_count_rows()
```

| commit                   | run 1 (s) | run 2 (s) | run 3 (s) | average (s) | warm average (s) |
|--------------------------|-----------|-----------|-----------|-------------|------------------|
| `cd06c248e4` (this PR)   | 116s      | 80s       | 77s       | 91+-18      | 78.5 +- 1.5      |
| `f00f916faf` (`0.2.120`) | 112s      | 80s       | 79s       | 90+-15      | 79.5 +- 0.5      |

This is what I expected. For a VCF with no ambiguity and few instances of ".", we've added a very
minor amount of new work.
@danking danking self-assigned this Oct 21, 2023
@danking
Copy link
Contributor Author

danking commented Oct 21, 2023

I have an RFC proposal to just handle the ambiguity: https://github.com/hail-is/hail-rfcs/blob/main/rfc/0008-handle-vcf-array-field-ambiguity

I proposed a PR to fix this: #13465 However, I missed a key issue: many VCF's elide fields to indicate missingness. That is not ambiguous: a field that is entirely elided is clearly missing, not an array of one missing value. You can't do this in a FORMAT (aka entry aka genotype) field, but you can do this in an INFO field a la:

##fileformat=VCFv4.2
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=NUMS,Number=*,Type=Float,Description="some numbers">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT ...
... AC=1,1;AN=1 ...

the NUMS field should be read as missing. My PR considered it unacceptably ambiguous because it thought it had been NUMS=..

I don't think we can fix this problem entirely from Python. We need to use Scala-side logic because after we parse in Scala, we lose the knowledge that a field was entirely elided versus a single missing dot.

@danking danking added the query label Oct 23, 2023
@danking danking removed their assignment Oct 30, 2023
danking pushed a commit to danking/hail that referenced this issue Dec 18, 2023
Fixes hail-is#13346.

Another user was confused by this: hail-is#14102. Unfortunately, the world
appears to have embraced missing values in VCF array fields even though the single element case is
ambiguous. In hail-is#13346, I proposed a scheme by which we can disambiguate many of the cases, but implementing
it ran into challenges because LoadVCF.scala does not expose whether or not an INFO field was a literal
"." or elided entirely from that line.

Anyway, this error message actually points users to the fix. I also changed some method names such that
every method is ArrayType and never TypeArray.
danking pushed a commit to danking/hail that referenced this issue Dec 18, 2023
Fixes hail-is#13346.

Another user was confused by this: hail-is#14102. Unfortunately, the world
appears to have embraced missing values in VCF array fields even though the single element case is
ambiguous. In hail-is#13346, I proposed a scheme by which we can disambiguate many of the cases, but implementing
it ran into challenges because LoadVCF.scala does not expose whether or not an INFO field was a literal
"." or elided entirely from that line.

Anyway, this error message actually points users to the fix. I also changed some method names such that
every method is ArrayType and never TypeArray.
danking added a commit that referenced this issue Jan 11, 2024
…ts (#14105)

Fixes #13346.

Another user was confused by this:
#14102. Unfortunately, the world
appears to have embraced missing values in VCF array fields even though
the single element case is ambiguous. In #13346, I proposed a scheme by
which we can disambiguate many of the cases, but implementing it ran
into challenges because LoadVCF.scala does not expose whether or not an
INFO field was a literal "." or elided entirely from that line.

Anyway, this error message actually points users to the fix. I also
changed some method names such that every method is ArrayType and never
TypeArray.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant