gdsSubset and other gds object interactions are not practial for loops/parallel #37

rbutleriii · 2024-07-31T20:47:57Z

If you want to gdsSubset your object, you cannot have it open, which is incredibly impractical for downstream use if I need information from it, for example sample IDs.

# prune to the relevant biomarker overlap
gds <- GdsGenotypeReader(gds_fn)
sample.sel <- getScanID(gds)

# for each biomarker make grm
lapply(bios, function(i) {
  # intersection for variable of interest
  both <- intersect(a[DS1V == 0 & !is.na(get(i)), PIDN], sample.sel)
  x <- tempfile()
  gdsSubset(gds_fn, x, sample.include = both)

  # run PC-Relate
  x <- GenotypeBlockIterator(x)
  mypcrelate <- pcrelate(x, 
    pcs = mypcair$vectors[both, 1:8], 
    training.set = intersect(mypcair$unrels, both),
    BPPARAM=BiocParallel::SerialParam()
  )

  # write pcrelate RDS and GRM to file
  saveRDS(mypcrelate, file = paste("mypcrelate", i, "rds", sep = "."))
})

This will fail unless I close the object first, and would be a major chokepoint if I had to access elements inside the gds object in the lapply, as it would open and close the gds file for each iteration. Not to mention impossible to use with something like future_lapply trying to parallelize the process since it only allows the file to be open once. Is there a reason it has to lock the file for use?

The text was updated successfully, but these errors were encountered:

smgogarten · 2024-07-31T21:14:01Z

This should really be an issue in the GWASTools package, since gdsSubset is a GWASTools function. But this is not the intended use of gdsSubset. You don't need to write a new GDS file for every sample set; instead, use the sample.include argument to pcrelate to select samples for every iteration.

If you really want to do it this way, set allow.fork=TRUE in your call to gdsSubset.

rbutleriii · 2024-08-01T01:40:53Z

If you do pass it sample.include, it will error without the training.set being subset, but it runs with the full set of pcs got the gds. Does the function handle the pc subset, or are the rows mismatched without pcs = mypcair$vectors[both, 1:8]?

plan("multicore", workers = 8)
options(future.globals.maxSize = 4 * 1e9)
# prune to the relevant biomarker overlap
gds <- GenotypeBlockIterator(GenotypeData(GdsGenotypeReader(gds_fn, allow.fork = TRUE)))
sample.sel <- getScanID(gds)

# for each biomarker make grm
future_lapply(bios, function(i) {
  # intersection for variable of interest
  both <- intersect(a[DS1V == 0 & !is.na(get(i)), PIDN], sample.sel)

  # run PC-Relate
  mypcrelate <- pcrelate(gds, 
    pcs = mypcair$vectors[, 1:8], 
    training.set = intersect(mypcair$unrels, both),
    BPPARAM = BiocParallel::SerialParam(),
    sample.include = both
  )

  # write pcrelate RDS and GRM to file
  saveRDS(mypcrelate, file = paste("mypcrelate", i, "rds", sep = "."))
}, future.seed = TRUE)

smgogarten · 2024-08-01T19:58:04Z

I don't entirely understand your question. If you're getting an error or unexpected results using sample.include with pcrelate, please post a reproducible example on the GENESIS issues page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gdsSubset and other gds object interactions are not practial for loops/parallel #37

gdsSubset and other gds object interactions are not practial for loops/parallel #37

rbutleriii commented Jul 31, 2024

smgogarten commented Jul 31, 2024 •

edited

Loading

rbutleriii commented Aug 1, 2024

smgogarten commented Aug 1, 2024

gdsSubset and other gds object interactions are not practial for loops/parallel #37

gdsSubset and other gds object interactions are not practial for loops/parallel #37

Comments

rbutleriii commented Jul 31, 2024

smgogarten commented Jul 31, 2024 • edited Loading

rbutleriii commented Aug 1, 2024

smgogarten commented Aug 1, 2024

smgogarten commented Jul 31, 2024 •

edited

Loading