Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gdsSubset and other gds object interactions are not practial for loops/parallel #37

Open
rbutleriii opened this issue Jul 31, 2024 · 3 comments

Comments

@rbutleriii
Copy link

If you want to gdsSubset your object, you cannot have it open, which is incredibly impractical for downstream use if I need information from it, for example sample IDs.

# prune to the relevant biomarker overlap
gds <- GdsGenotypeReader(gds_fn)
sample.sel <- getScanID(gds)

# for each biomarker make grm
lapply(bios, function(i) {
  # intersection for variable of interest
  both <- intersect(a[DS1V == 0 & !is.na(get(i)), PIDN], sample.sel)
  x <- tempfile()
  gdsSubset(gds_fn, x, sample.include = both)

  # run PC-Relate
  x <- GenotypeBlockIterator(x)
  mypcrelate <- pcrelate(x, 
    pcs = mypcair$vectors[both, 1:8], 
    training.set = intersect(mypcair$unrels, both),
    BPPARAM=BiocParallel::SerialParam()
  )

  # write pcrelate RDS and GRM to file
  saveRDS(mypcrelate, file = paste("mypcrelate", i, "rds", sep = "."))
})

This will fail unless I close the object first, and would be a major chokepoint if I had to access elements inside the gds object in the lapply, as it would open and close the gds file for each iteration. Not to mention impossible to use with something like future_lapply trying to parallelize the process since it only allows the file to be open once. Is there a reason it has to lock the file for use?

@smgogarten
Copy link

smgogarten commented Jul 31, 2024

This should really be an issue in the GWASTools package, since gdsSubset is a GWASTools function. But this is not the intended use of gdsSubset. You don't need to write a new GDS file for every sample set; instead, use the sample.include argument to pcrelate to select samples for every iteration.

If you really want to do it this way, set allow.fork=TRUE in your call to gdsSubset.

@rbutleriii
Copy link
Author

If you do pass it sample.include, it will error without the training.set being subset, but it runs with the full set of pcs got the gds. Does the function handle the pc subset, or are the rows mismatched without pcs = mypcair$vectors[both, 1:8]?

plan("multicore", workers = 8)
options(future.globals.maxSize = 4 * 1e9)
# prune to the relevant biomarker overlap
gds <- GenotypeBlockIterator(GenotypeData(GdsGenotypeReader(gds_fn, allow.fork = TRUE)))
sample.sel <- getScanID(gds)

# for each biomarker make grm
future_lapply(bios, function(i) {
  # intersection for variable of interest
  both <- intersect(a[DS1V == 0 & !is.na(get(i)), PIDN], sample.sel)

  # run PC-Relate
  mypcrelate <- pcrelate(gds, 
    pcs = mypcair$vectors[, 1:8], 
    training.set = intersect(mypcair$unrels, both),
    BPPARAM = BiocParallel::SerialParam(),
    sample.include = both
  )

  # write pcrelate RDS and GRM to file
  saveRDS(mypcrelate, file = paste("mypcrelate", i, "rds", sep = "."))
}, future.seed = TRUE)

@smgogarten
Copy link

I don't entirely understand your question. If you're getting an error or unexpected results using sample.include with pcrelate, please post a reproducible example on the GENESIS issues page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants