Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsetting a SimpleRleList object by a GRanges object 'x' should work, even when 'sum(width(x))' >= 2^31 #44

Open
hpages opened this issue Dec 1, 2021 · 1 comment

Comments

@hpages
Copy link
Contributor

hpages commented Dec 1, 2021

This is a spin-off from issue #43:

library(BSgenome.Hsapiens.UCSC.hg38)
genome <- BSgenome.Hsapiens.UCSC.hg38

x <- GRanges(c("chr1:1-1000", "chr1:11-1010", "chr1:1000001-1500000", "chrM:1-16569"), seqinfo=seqinfo(genome))
x_cvg <- coverage(x)  # SimpleRleList object

bins <- tileGenome(seqinfo(genome), tilewidth=1e6, cut.last.tile=TRUE)
sum(width(bins))  # >= 2^31
x_cvg[bins]
# Error in .numeric2end(x, NG) : 
#   when 'x' is an integer vector, it cannot contain NAs or negative values
# In addition: Warning message:
# In bindROWS(x, list(...), ignore.mcols = ignore.mcols) :
#   integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

sessionInfo():

R Under development (unstable) (2021-10-25 r81105)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.2.r81105/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.2.r81105/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.4 BSgenome_1.63.3                  
 [3] rtracklayer_1.55.2                Biostrings_2.63.0                
 [5] XVector_0.35.0                    GenomicRanges_1.47.5             
 [7] GenomeInfoDb_1.31.1               IRanges_2.29.1                   
 [9] S4Vectors_0.33.5                  BiocGenerics_0.41.2              

loaded via a namespace (and not attached):
 [1] zlibbioc_1.41.0             GenomicAlignments_1.31.2   
 [3] BiocParallel_1.29.4         lattice_0.20-45            
 [5] rjson_0.2.20                tools_4.2.0                
 [7] grid_4.2.0                  SummarizedExperiment_1.25.2
 [9] parallel_4.2.0              Biobase_2.55.0             
[11] matrixStats_0.61.0          yaml_2.2.1                 
[13] crayon_1.4.2                BiocIO_1.5.0               
[15] Matrix_1.3-4                GenomeInfoDbData_1.2.7     
[17] restfulr_0.0.13             bitops_1.0-7               
[19] RCurl_1.98-1.5              DelayedArray_0.21.2        
[21] compiler_4.2.0              MatrixGenerics_1.7.0       
[23] Rsamtools_2.11.0            XML_3.99-0.8               
@Roleren
Copy link

Roleren commented Oct 10, 2024

Yes, I have been working a lot with tweaks for this, what I ended up doing is using a while loop to split up only chromosomes that results in sums > 2^31-1, such that I only need to do it for those I know will fail.

I then split it that chromosome/bin set in two, and check if any of the subsets are still > 2^31-1, repeat if true (I set to max 20 split rounds to avoid infinite loop, if some crazy number is used etc)

This works for me, but the optimal would be if it is possible to catch this internally and allow some numeric types in the coverage subsetting function directly ?

Btw, this problem is also true for calculating the coverage() of GRanges object using weight = "score", where any score is > 2^31-1 if score column is integer (this happens if the same coordinate repeats over multiple ranges (and those scores are all < 2^31-1, but the sum is bigger than 2^31-1, if they are merged together before calling coverage() GRanges handles this internally and converts "score" to numeric), I use similar method then to what I described above, I first run coverage(), if any sum(runValues(cov)) are NA, I rerun the subset of the GRanges chromosomes with score as numeric.

If I set the entire "score" to numeric, my output object is 1GB, if I set it to numeric only for failing chromosomes my object is 580MB, and it is faster too. So a no brainer to fix that one at least for me on big data for human genome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants