N-D array conventions for variation data #702
Replies: 10 comments
-
(Posted by @alimanfoo) I'll start off with some info on storing genomic positions. Typically we have genome variation data at some set of positions (a.k.a. sites) in a reference genome. When working with variation data it is common to have data for only the genomic sites where variation has been observed. I.e., we don't have data for every genomic site, only a subset. In order to know where those data are located within the reference genome, we store the coordinates for the genomic sites where we have data. A reference genome is usually comprised of a set of contiguous nucleotide sequences, where each sequence has some identifier. For some reference genomes that are well assembled, each sequence is actually a full chromosome, and the sequence IDs are chromosome names. In less complete reference genomes, each sequence is only part of a chromosome (called a contig), and sequences have some fairly opaque alphanumerical ID. Here for convenience I'll use "chromosome" to refer to a contiguous piece of a reference genome, and "seqid" to refer to their IDs. Each chromosome is a linear sequence of nucleotides which can be indexed by their position. Usually 1-based indexing is used. So, e.g., position 1 is the first nucleotide in the sequence. Each site in a reference genome can thus be uniquely indexed by a pair (seqid, position). For example, we might have data for for three sites at positions 2, 7 and 12 within a given chromosome. These positions we typically store as a 1-dimensional array of 32-bit integers. 32-bit int is usually sufficient to cover the largest chromosome in most reference genomes, although a few organisms with very large genome size might require 64-bit ints. These positions are always positive, so either signed or unsigned int could be used. Usually the data have been sorted by position within each chromosome. I.e., positions within a chromosome are usually monotonically increasing. And usually there is no repetition of positions. But this is not always the case. When we have data from multiple chromosomes, we typically group the data by chromosome. I.e., we have a set of 1D integer arrays, each of which stores a set of positions for a single chromosome. A natural data structure for these genomic coordinates would be something like a pandas MultiIndex, where the first level is the seqids (strings) and the second level is the positions (ints). However, we don't typically use a pandas MultiIndex in practice because of excessive memory use. Note that there is no inherent ordering to the seqids. They are sometimes presented in lexical order, sometimes in order of chromosome size, but both are arbitrary. A common use case for genomic coordinates is to find the coordinates within some genome region, also known as a region query. A region query consists of a seqid and a start and stop position (1-based, stop inclusive). Given the above conventions for storing genomic positions, a region query is easy to satisfy, because it simply requires a binary search within the positions for the given chromosome. In scikit-allel we implemented a few classes to make this kind of querying more convenient:
It would be nice to use some more general classes here, e.g., from pandas, but memory usage has always been an issue. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) A short follow-up on genomic intervals. In some cases we have data for a set of genetic variants or features, each of which spans some genomic interval. E.g., a deletion of 2 or more nucleotides. These features are located with respect to a reference genome by For a set of such variants we may also want to make a query for all variants overlapping some query region. In other words, given a query If working with data from a single chromosome, the start and stop coordinates could each be stored as a 1D array of ints. Usually the data are sorted by start coordinate. To enable interval queries, if the dataset is not too large we just do brute-force scans. If the dataset is larger we usually build an interval tree, e.g., using the intervaltree package. However, we haven't built anything special to make this more convenient. Fairly recently, pandas added IntervalIndex which could potentially be used to implement genomic interval queries. It looks useful, but we haven't tried it yet, and suspect memory usage may again be an issue for larger datasets. Also, currently there is no way to use an IntervalIndex as a level within a MultiIndex. Ideally what we would want is something like a MultiIndex, where the first level is seqids and the second level is a set of IntervalIndexes, and which is memory efficient. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) Here's some info on conventions we use to represent genotype calls. We typically represent a set of genotype calls as a 3D array of signed integers, where the first dimension corresponds to a set of genetic variants, the second dimension corresponds to a set of biological samples, and the third dimension is the ploidy of the organism. E.g., for a set of genotype calls at 1000000 variants in 1000 humans, this gives an array of shape (1000000, 1000, 2). The integer values in this array correspond to alleles, where 0 is the reference allele (i.e., whatever allele is present in the reference genome), values greater than 0 are alternate alleles, and values less than 0 (typically -1) are used to indicate a missing call. The actual alternate alleles (i.e., the nucleotide changes) themselves are stored elsewhere. In human SNP data it is uncommon to find variants with more than one alternate allele (a "biallelic" variant), so these arrays are composed almost entirely of zeros and ones. In data on indel variation or SNPs from more diverse organisms such as mosquitoes then it can be much more common to find variants with multiple alternate alleles, and hence higher integer values may be more common. However, it is rare to find variants with more than a handful of alternate alleles, and so it's usually more than enough range to use 8-bit signed integers. A benefit of this representation is that it can be used to store genotype calls for polyploid organisms, e.g., many plants have ploidy greater than 2. A limitation of this representation is that it cannot be used for organisms with variable ploidy. E.g., in many organisms ploidy effectively varies across the genome due to regions of copy number variation. There are also organisms (e.g., Leishmania) where different individuals have different ploidies for the same chromosome. In scikit-allel we created a wrapper class called GenotypeArray which is based on this convention. I.e., it wraps a numpy array that follows this convention and provides some domain-specific methods to transform and operate on these data. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) A follow-up on genotype calls, sometimes we use a different approach where a set of genotype calls is represented as a a 3D array, where as above the first dimension corresponds to a set of genetic variants and the second dimension corresponds to a set of biological samples, but the third dimension corresponds to the set of alleles. The integer values then store the allele counts for each variant and each sample. This approach has the benefit that it can cope with ploidy that varies within and/or between samples. It is also sometimes more convenient for certain computations, e.g., genetic distance between samples. A limitation of this approach is that you have to assume a fixed number of alleles for all variants. If you are working with SNP data then there are always at most 4 alleles (one reference, three alternate), and so this is not a problem. E.g., for a set of genotype calls at 1000000 SNPs in 1000 samples of any ploidy, this gives an array of shape (1000000, 1000, 4). In scikit-allel we created a wrapper class called GenotypeAlleleCountsArray which is based on this convention and is useful (but is a total mouthful to say!) |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) Our convention for representing a set of haplotypes (i.e., phased genotype calls) is as a 2D array of integers, where the first dimension corresponds to variants and the second dimension corresponds to haplotypes. As with genotype calls, the integer values represent different alleles, with 0 representing the reference allele, positive integers representing alternative alleles, and -1 used to represent missing data. Usually 8-bit signed int is used. E.g., for a set of haplotypes at 1000000 variants from 1000 diploid samples, this gives an array of shape (1000000, 2000). In scikit-allel we created a wrapper class called HaplotypeArray based on this convention. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) One more data structure we use often is a set of allele counts. This is a 2D array of integers where the first dimension corresponds to variants and the second dimension corresponds to alleles. The integer values are the number of observations of each allele at each variant site. An allele counts array can be derived from an array of genotype calls (or haplotypes) by counting the number of occurrences of each integer value within each row of the array. I.e., it's a kind of summarisation. In scikit-allel we created a wrapper class called AlleleCountsArray. A note is that in reality different variants will have different numbers of alleles. To accommodate this we choose the length of the second dimension to be the largest number of alleles observed in a given dataset. If working with SNPs, we assume this is 4 (including the reference allele). |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo) One type of data that is often used is a set of genotype likelihoods. These are the likelihoods for all possible genotypes at a given set of variants and samples. A set of genotype calls can be derived from a set of genotype likelihoods via several processes, e.g., by choosing the genotypes with the greatest likelihood. Other metrics such as "genotype qualities" can also be derived, e.g., by taking the ratio between the first and send most likely genotypes. Many downstream analyses choose to work with genotype likelihoods rather than genotype calls in order to propagate uncertainty. In our mosquito population genomics work in MalariaGEN we have deep sequence data (~30X) which means that we usually have high confidence genotype calls, and so we don't work directly with genotype likelihoods. This is why we haven't built any data structures or convenience functions for working with them. But in many other situations they are important. A challenge with representing genotype likelihoods is dealing with multiallelic data. For any given variant the number of possible genotypes depends on the number of alleles. In general, the number of possible genotypes is (n multichoose p) where n is the number of alleles and p is the ploidy. E.g., for diploid samples and a biallelic variant (2 alleles), there are 3 possible genotypes. One way of representing genotype likelihoods would be as a 3D array of floats, where the first dimension was variants, the second dimension was samples, and third dimension was the number of possible genotypes. E.g., for 1000000 biallelic variants and 1000 diploid samples this would given an array of shape (1000000, 1000, 3). For data that includes multiallelic data, the size of the third dimension would need to be set large enough to accommodate the number of possible genotypes at the variant with the largest number of alleles. E.g., if working with SNP data that includes quadriallelic sites, there can be up to 10 possible genotypes. So an array of 1000000 multiallelic SNPs and 1000 samples would have shape (1000000, 1000, 10). If there are variants with even larger numbers of alleles (as can occur, e.g., with indels) the size of the third dimension can become unwieldy. So for SNP data it may still be useful to define a convention based on ND arrays with fixed shape as described above, but for other datasets it may be worth exploring array-like data conventions that allow dimensions with variable size, a.k.a. "ragged" or "jagged" arrays, e.g., as in ndtypes or awkward array. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @eric-czech) Hey @alimanfoo, thanks for sharing these! Very helpful. re: genotype likelihoods: What do you think about a representation for likelihoods like Perhaps inverting some overdetermined nonlinear system to get an arbitrary solution for allele probabilities on each chromosome (from genotype probabilities in vcf/bgen) could be worth it for organisms/sites with high enough ploidy and polymorphism, particularly if the analysis is only going to be focused on specific alleles (presumably the most common vs the rest) where you could just calculate products of likelihoods along the ploidy dimension for a specific allele when needed. I don't at all understand the story behind what lead to working genotype probabilities being so much more common but I'd be curious to hear if you think that's an artifact of people working mostly with human data where it's easier to just assume bi-allelic sites. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @eric-czech) re: genomic positions a few thoughts:
|
Beta Was this translation helpful? Give feedback.
-
(Posted by @eric-czech) re: genotype calls -> variable ploidy When you say that the ploidy varies between or possibly within samples, are you referring to CNVs that may have different alleles, datasets with different mosquito species, or individual samples that actually have different ploidy for certain chromosomes (or maybe some combination of these)? I'm struggling to understand the biology behind this one. |
Beta Was this translation helpful? Give feedback.
-
(Posted by @alimanfoo)
I thought it might be useful to start a topic to share existing conventions around how data on genetic variation is represented using N-dimensional typed arrays, i.e., using numpy-like data structures. Identifying some common data conventions could be a useful first step towards greater interoperability. I'll post some info on the approaches we've taken in MalariaGEN and scikit-allel, but any other approaches welcome.
P.S., what I'm trying to capture here are some general conventions for representing genetic variation data using array-like data structures, i.e., things that have a number of dimensions, a shape and a data type. I'm not making any commitment about the actual concrete physical storage of these arrays. I.e., they could be stored in CPU memory or GPU memory or on-disk or elsewhere. They could be dense or sparse. They could be contiguous or chunked. They could have row-major or column-major memory layout. Those are all implementation options, interesting to discuss but somewhat orthogonal to what I thought I'd try to capture here.
Beta Was this translation helpful? Give feedback.
All reactions