'GT' are present in MT but missing from VDS for same samples #3693

iris-garden · 2024-05-11T01:31:53Z

iris-garden
May 11, 2024
Maintainer

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Sep 19, 2023 at 18:55) akhattab said:

Hi, hope all is well.
I’m working on the AllofUs platform to calculate some PRS. I’m using a custom code and Hail MT for calculations, which is working but is time-consuming and expensive.
A part of the code is subsetting the variants in the PRS weights df from the MT, and I came across tpoterba post about filtering VDS using hail.vds.filter_intervals which is SUPER fast compared to MT. The issue is that most samples are missing ‘GT’ even though they are present in the MT.

My code:

PGS_sites = hl.import_locus_intervals(interval_file, reference_genome=‘GRCh38’, skip_invalid_intervals=True)
vds_prs = hl.vds.filter_intervals(vds_subset, PGS_sites, keep=True)

The number of variants from a PRS score present in MT:

The number of variants from the same PRS score present in VDS for the same cohort:

Am I missing something here? Any thoughts?

Thank you!
Ahmed

iris-garden · 2024-05-11T01:31:54Z

iris-garden
May 11, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Sep 20, 2023 at 20:31) danking said:

What code are you using to show the genotypes? vds.variant_data.GT.show() shows only the variant data. The VDS is a split representation: reference data and variant data are stored separately, each on in its own Hail Matrix Table: vds.reference_data and vds.variant_data.

If, for your analysis, you need reference information (in particular, if you need to decide if a given sample is a confident homozygous reference call instead of a no call), you need to densify the VDS. There’s some more information about this at the AoU VDS page.

I recommend you seriously consider using the smaller, pre-densified callsets unless you absolutely need genome-wide data. Densification is not a cheap process, it’s converting from a very sparse representation to a very large and dense representation. This is particularly true if you only need GTs! The majority of the VDS is quality metadata like PL and AD.

0 replies

iris-garden · 2024-05-11T01:31:54Z

iris-garden
May 11, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Sep 21, 2023 at 00:44) akhattab said:

Hi danking, thank you for your response!
Yes, I used vds.variant_data.GT.show() to show the genotypes.

If, for your analysis, you need reference information (in particular, if you need to decide if a given sample is a confident homozygous reference call instead of a no call)

Does that mean an NA could either be a nocall or a homozygous reference? I queried vds.reference_data for some of the variants with NA in the vds.variant_data and they also have NA also in the reference_data

Is it safe to assume that the variants with NA are homozygous reference?

0 replies

iris-garden · 2024-05-11T01:31:55Z

iris-garden
May 11, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Sep 21, 2023 at 14:40) danking said:

Does that mean an NA could either be a nocall or a homozygous reference?

Yes, that’s exactly what NA means.

The reference data is not stored row-by-row. The reference data is stored like a GVCF: there are “reference blocks” which implicitly span multiple rows.

Is it safe to assume that the variants with NA are homozygous reference?

No it is not.

If you need to confidently know if a sample is homozygous reference or no call you must use hl.vds.to_dense_mt:

mt = hl.vds.to_dense_mt(vds)
mt.GT.show()  # if you see NA here they are no calls

Note that densification is an expensive and time-consuming process! You probably want to filter to a small set of intervals-of-interest first using hl.vds.filter_intervals:

vds = hl.vds.filter_intervals(
    vds,
    hl.literal([
        hl.locus_interval('chr1', 123, 456)
    ])
)

If possible, I recommend using the pre-densified datasets because the All of Us project has already paid the cost of producing dense data at those loci!

0 replies

iris-garden · 2024-05-11T01:31:55Z

iris-garden
May 11, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Sep 22, 2023 at 17:06) akhattab said:

Great! Thank you so much, Dan. I appreciate the help!

0 replies

iris-garden · 2024-05-11T01:31:56Z

iris-garden
May 11, 2024
Maintainer Author

Note

The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.

(Sep 27, 2023 at 17:36) akhattab said:

Hi danking, sorry to keep bugging you with this but I have a follow-up question.
I’m trying to understand how Hail treats the missing calls.
When using mt.GT.show I could see the missing calls as NA:

But when using hl.is_missing, they are still NA instead of True I guess:

Same when using is_defined:

My question is, why are these calls ‘ignored’?
Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'GT' are present in MT but missing from VDS for same samples #3693

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

'GT' are present in MT but missing from VDS for same samples #3693

iris-garden May 11, 2024 Maintainer

(Sep 19, 2023 at 18:55) akhattab said:

Replies: 5 comments

iris-garden May 11, 2024 Maintainer Author

(Sep 20, 2023 at 20:31) danking said:

iris-garden May 11, 2024 Maintainer Author

(Sep 21, 2023 at 00:44) akhattab said:

iris-garden May 11, 2024 Maintainer Author

(Sep 21, 2023 at 14:40) danking said:

iris-garden May 11, 2024 Maintainer Author

(Sep 22, 2023 at 17:06) akhattab said:

iris-garden May 11, 2024 Maintainer Author

(Sep 27, 2023 at 17:36) akhattab said:

iris-garden
May 11, 2024
Maintainer

iris-garden
May 11, 2024
Maintainer Author

iris-garden
May 11, 2024
Maintainer Author

iris-garden
May 11, 2024
Maintainer Author

iris-garden
May 11, 2024
Maintainer Author

iris-garden
May 11, 2024
Maintainer Author