Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of chromosomes in BigBed.Writer #2

Open
jonathanBieler opened this issue Nov 5, 2019 · 5 comments
Open

Order of chromosomes in BigBed.Writer #2

jonathanBieler opened this issue Nov 5, 2019 · 5 comments

Comments

@jonathanBieler
Copy link

I'm trying to write data to a BigBed file on each human chromosome, I defined my writer as such :

writer = BigBed.Writer(file, [(chr, length(genome[chr])) for chr in chrs])

With chrsbeing ["1", "2", ...]. Then I'm looping on chrs and do some write operations, but I'm getting a :

ArgumentError: disordered intervals

Because in the writer the chromosomes are getting reordered as : ["1", "10", "11"].

I managed to get around it by getting the chromosomes in the right order with:

ochrs = collect(values(writer.chromnames))[sortperm(collect(keys(writer.chromnames)))]

Would it be possible to have the writer keep the ordering it's given ? (maybe using an OrderedDict) or is there a good reason why it gets reordered ?

Alternatively a chromlist method to get the chromosome in the right order would help.

@CiaranOMara
Copy link
Member

CiaranOMara commented Nov 8, 2019

@jonathanBieler, does the following cover your idea of a chromlist method?

function Base.isless(::Int, ::Char)
    return true
end

function Base.isless(::Char, ::Int)
    return false
end

function seqname_isless(str1::String, str2::String) :: Bool

    function parse_seqname(str::String) :: Vector{Union{Char, Int}}
        arr = Vector{Char}(str)

        arr = convert(Vector{Union{Char, Int}}, arr)

        m = match(r"(\d+)", str)

        if m !== nothing
            for (capture, offset) in zip(reverse(m.captures), reverse(m.offsets))
                splice!(arr, UnitRange(offset, offset + length(capture) -1), parse(Int, capture))
            end
        end

        return arr

    end

    return parse_seqname(str1) < parse_seqname(str2)

end
julia> seqname_isless("chr1", "chrM")
true

julia> seqname_isless("chrM", "chr1")
false

julia> seqnames = ["chr11", "chr10", "chr01", "chr100", "chr1" , "chrM", "chr010"];

julia> sort(seqnames)
7-element Array{String,1}:
 "chr01" 
 "chr010"
 "chr1"  
 "chr10" 
 "chr100"
 "chr11" 
 "chrM"  

julia> sort(seqnames,lt=seqname_isless)
7-element Array{String,1}:
 "chr01" 
 "chr1"  
 "chr10" 
 "chr010"
 "chr11" 
 "chr100"
 "chrM"  

julia> seqnames = string.(1:11);

julia> sort(seqnames)
11-element Array{String,1}:
 "1" 
 "10"
 "11"
 "2" 
 "3" 
 "4" 
 "5" 
 "6" 
 "7" 
 "8" 
 "9" 

julia> sort(seqnames,lt=seqname_isless)
11-element Array{String,1}:
 "1" 
 "2" 
 "3" 
 "4" 
 "5" 
 "6" 
 "7" 
 "8" 
 "9" 
 "10"
 "11"

@jonathanBieler
Copy link
Author

jonathanBieler commented Nov 8, 2019

My issue isn't that the chromosome list is in a specific order or another, just that you need to know (or be able to set) the order it's stored internally in the Writer to be able to write without getting a disordered intervals error. So it would need to be something like I wrote above :

chromlist(writer) = 
    collect(values(writer.chromnames)[sortperm(collect(keys(writer.chromnames)))]

@jonathanBieler
Copy link
Author

jonathanBieler commented Nov 8, 2019

Here's an MWE :

    output = open("data.bb", "w")
    writer = BigBed.Writer(output, [("1", 12345), ("2", 9100), ("10", 123)])
    
    write(writer, ("1", 101, 150, "gene 1"))
    write(writer, ("2", 211, 250, "gene 2"))
    write(writer, ("10", 211, 250, "gene 3"))
    close(writer)

@jonathanBieler jonathanBieler changed the title Order in chromosome for BigBed.Write Order in chromosome for BigBed.Writer Nov 8, 2019
@jonathanBieler jonathanBieler changed the title Order in chromosome for BigBed.Writer Order of chromosomes in BigBed.Writer Nov 8, 2019
@jonathanBieler
Copy link
Author

jonathanBieler commented Nov 8, 2019

Actually it would be better if I could specify the order in the Writer ; I have a large text file I want to read line by line and write into a BigBed, but I don't want to reorder it to match the writer internal order.

Just commenting the sort here solves the problem (might create other though, I'm not sure) :

https://github.com/BioJulia/GenomicFeatures.jl/blob/8fc34ff680f5e742e25d2bd3d4722cb33fbe3cd5/src/bbi/btree.jl#L116

But conceptually I would leave the choice of ordering to the user (since it might be imposed by external constrains, like a file you got from another tool).

@CiaranOMara
Copy link
Member

@jonathanBieler, I agree that this is an issue/annoyance and have made a note of it in https://github.com/BioJulia/BigBed.jl/projects/1.

@CiaranOMara CiaranOMara transferred this issue from BioJulia/GenomicFeatures.jl May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants