'GT' are present in MT but missing from VDS for same samples #3693
Replies: 5 comments
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Sep 20, 2023 at 20:31) danking said:What code are you using to If, for your analysis, you need reference information (in particular, if you need to decide if a given sample is a confident homozygous reference call instead of a no call), you need to densify the VDS. There’s some more information about this at the AoU VDS page. I recommend you seriously consider using the smaller, pre-densified callsets unless you absolutely need genome-wide data. Densification is not a cheap process, it’s converting from a very sparse representation to a very large and dense representation. This is particularly true if you only need GTs! The majority of the VDS is quality metadata like PL and AD. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Sep 21, 2023 at 14:40) danking said:
Yes, that’s exactly what NA means. The reference data is not stored row-by-row. The reference data is stored like a GVCF: there are “reference blocks” which implicitly span multiple rows.
No it is not. If you need to confidently know if a sample is homozygous reference or no call you must use mt = hl.vds.to_dense_mt(vds)
mt.GT.show() # if you see NA here they are no calls Note that densification is an expensive and time-consuming process! You probably want to filter to a small set of intervals-of-interest first using vds = hl.vds.filter_intervals(
vds,
hl.literal([
hl.locus_interval('chr1', 123, 456)
])
) If possible, I recommend using the pre-densified datasets because the All of Us project has already paid the cost of producing dense data at those loci! |
Beta Was this translation helpful? Give feedback.
-
Note The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated. (Sep 22, 2023 at 17:06) akhattab said:Great! Thank you so much, Dan. I appreciate the help! |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Note
The following post was exported from discuss.hail.is, a forum for asking questions about Hail which has since been deprecated.
(Sep 19, 2023 at 18:55) akhattab said:
Hi, hope all is well.
I’m working on the AllofUs platform to calculate some PRS. I’m using a custom code and Hail MT for calculations, which is working but is time-consuming and expensive.
A part of the code is subsetting the variants in the PRS weights df from the MT, and I came across tpoterba post about filtering VDS using hail.vds.filter_intervals which is SUPER fast compared to MT. The issue is that most samples are missing ‘GT’ even though they are present in the MT.
My code:
The number of variants from a PRS score present in MT:
The number of variants from the same PRS score present in VDS for the same cohort:
Am I missing something here? Any thoughts?
Thank you!
Ahmed
Beta Was this translation helpful? Give feedback.
All reactions