Subsetting a SimpleRleList object by a GRanges object 'x' should work, even when 'sum(width(x))' >= 2^31 #44

hpages · 2021-12-01T21:21:42Z

This is a spin-off from issue #43:

library(BSgenome.Hsapiens.UCSC.hg38)
genome <- BSgenome.Hsapiens.UCSC.hg38

x <- GRanges(c("chr1:1-1000", "chr1:11-1010", "chr1:1000001-1500000", "chrM:1-16569"), seqinfo=seqinfo(genome))
x_cvg <- coverage(x)  # SimpleRleList object

bins <- tileGenome(seqinfo(genome), tilewidth=1e6, cut.last.tile=TRUE)
sum(width(bins))  # >= 2^31
x_cvg[bins]
# Error in .numeric2end(x, NG) : 
#   when 'x' is an integer vector, it cannot contain NAs or negative values
# In addition: Warning message:
# In bindROWS(x, list(...), ignore.mcols = ignore.mcols) :
#   integer overflow in 'cumsum'; use 'cumsum(as.numeric(.))'

sessionInfo():

R Under development (unstable) (2021-10-25 r81105)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.2.r81105/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.2.r81105/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38_1.4.4 BSgenome_1.63.3                  
 [3] rtracklayer_1.55.2                Biostrings_2.63.0                
 [5] XVector_0.35.0                    GenomicRanges_1.47.5             
 [7] GenomeInfoDb_1.31.1               IRanges_2.29.1                   
 [9] S4Vectors_0.33.5                  BiocGenerics_0.41.2              

loaded via a namespace (and not attached):
 [1] zlibbioc_1.41.0             GenomicAlignments_1.31.2   
 [3] BiocParallel_1.29.4         lattice_0.20-45            
 [5] rjson_0.2.20                tools_4.2.0                
 [7] grid_4.2.0                  SummarizedExperiment_1.25.2
 [9] parallel_4.2.0              Biobase_2.55.0             
[11] matrixStats_0.61.0          yaml_2.2.1                 
[13] crayon_1.4.2                BiocIO_1.5.0               
[15] Matrix_1.3-4                GenomeInfoDbData_1.2.7     
[17] restfulr_0.0.13             bitops_1.0-7               
[19] RCurl_1.98-1.5              DelayedArray_0.21.2        
[21] compiler_4.2.0              MatrixGenerics_1.7.0       
[23] Rsamtools_2.11.0            XML_3.99-0.8

The text was updated successfully, but these errors were encountered:

Roleren · 2024-10-10T09:06:23Z

Yes, I have been working a lot with tweaks for this, what I ended up doing is using a while loop to split up only chromosomes that results in sums > 2^31-1, such that I only need to do it for those I know will fail.

I then split it that chromosome/bin set in two, and check if any of the subsets are still > 2^31-1, repeat if true (I set to max 20 split rounds to avoid infinite loop, if some crazy number is used etc)

This works for me, but the optimal would be if it is possible to catch this internally and allow some numeric types in the coverage subsetting function directly ?

Btw, this problem is also true for calculating the coverage() of GRanges object using weight = "score", where any score is > 2^31-1 if score column is integer (this happens if the same coordinate repeats over multiple ranges (and those scores are all < 2^31-1, but the sum is bigger than 2^31-1, if they are merged together before calling coverage() GRanges handles this internally and converts "score" to numeric), I use similar method then to what I described above, I first run coverage(), if any sum(runValues(cov)) are NA, I rerun the subset of the GRanges chromosomes with score as numeric.

If I set the entire "score" to numeric, my output object is 1GB, if I set it to numeric only for failing chromosomes my object is 580MB, and it is faster too. So a no brainer to fix that one at least for me on big data for human genome.

hpages mentioned this issue Dec 1, 2021

Some methods expect no more than 2^31 elements #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subsetting a SimpleRleList object by a GRanges object 'x' should work, even when 'sum(width(x))' >= 2^31 #44

Subsetting a SimpleRleList object by a GRanges object 'x' should work, even when 'sum(width(x))' >= 2^31 #44

hpages commented Dec 1, 2021

Roleren commented Oct 10, 2024 •

edited

Loading

Subsetting a SimpleRleList object by a GRanges object 'x' should work, even when 'sum(width(x))' >= 2^31 #44

Subsetting a SimpleRleList object by a GRanges object 'x' should work, even when 'sum(width(x))' >= 2^31 #44

Comments

hpages commented Dec 1, 2021

Roleren commented Oct 10, 2024 • edited Loading

Roleren commented Oct 10, 2024 •

edited

Loading