Skip to content

File Types

Adam Novak edited this page Nov 22, 2024 · 16 revisions

Glossary of vg-related File Types

The vg ecosystem uses a lot of file formats. Some are new and not consistently used yet, and some are old and still required for some less-popular operations.

Some of these are described in more detail at File Formats and Index Types.

Reference Formats

These formats store genome references that define spaces in which genomics can be done.

Name Description Extension Status Notes
VG Protobuf The original vg graph format .vg Conditionally useful Can also be used to store paths without the nodes and edges they belong to. Default output format of vg construct in vg v1.40.0, since it can be generated incrementally. Can be concatenated with cat. Usually block-GZIP compressed, but some old files aren't. Consists of count-prefixed groups of length-prefixed Protobuf messages, where a string type tag takes the place of the first message in each group.
GBZ "GBZ" graph, a compressed format storing a graph as traversed by sample haplotypes .gbz Recommended Stores not only the graph but also large numbers of haplotypes, so you don't need an additional GBWT file. Internally, stores a GBWT and a GBWTGraph. Can't store edges that are not followed.
GFA Graphical Fragment Assembly: a text-based format for storing graphs and their embedded paths. .gfa Recommended for interchange vg uses GFA 1.x and doesn't really support GFA 2.
HashGraph Graph format based on a hashtable, from libbdsg. .hg, .vg Recommended Default output format of many vg subcommands as of v1.40.0.
PackedGraph Graph format based on succinct data structures, from libbdsg. .pg, .vg Recommended for large graphs This format can store graphs in less space than HashGraph, but is also slower and more complicated.
Memory-Mapped PackedGraph A version of PackedGraph that can be incrementally read from disk .mpg? Experimental Might not actually be adopted; GBZ solves a different but often more important problem
ODGI (vg flavor) "Optimized Dynamic Genome/Graph Implementation" format. .odgi Removed vg used to support a version implemented in libbdsg, which was NOT wire-compatible with the version implemented in the odgi project, and so was removed.
XG Compressed, immutable graph format. Doesn't really stand for anything. .xg Conditionally useful PackedGraph may be better, but many tools reference "xg files" as historically this was the only practical format for whole-genome graphs.
GBWTGraph Supplemental information to turn a GBWT into a graph. .gg Deprecated Stores node sequences only; defines a graph when used together with a GBWT file of haplotypes.
VG JSON This is the VG Protobuf format, with the Protobuf Graph objects represented as JSON. .json Conditionally useful Useful for exporting small graphs for analysis with jq, or importing graphs from tools that can't use libbdsg or libvgio. Generally GFA should be used instead.
Indexed VG Protobuf This is the VG Protobuf format, stored in a sorted order with an auxilliary index file for random access. .sorted.vg Deprecated Was never very popular, and Memory-Mapped PackedGraph is intended as a replacement.
FASTA "FASTA" format for storing DNA sequences .fa, .fasta, .fna Recommended for linear references This is a linear genome reference format that vg construct can consume.

Read and Alignment Formats

These formats store short or long reads from DNA sequencing machines, and can describe how they fit into references.

Name Description Extension Status Notes
GAM Protobuf Graph Alignment/Map, vg's main format for aligned reads .gam Recommended
GAF Graph Alignment Format, a text-based format for aligned reads .gaf Recommended for interchange See Graph Alignment Format in vg for notes on tag meanings. In particular, vg uses the cs tag "difference string" instead of a CIGAR.
Sorted GAM GAM file with reads sorted by graph node ID. Useful for random access with an index. .sorted.gam Recommended
GAM JSON JSON version of the Protobuf GAM format. .json Conditionally useful Used for analyzing reads with jq.
GAMP Protobuf Multi-path alignment version of GAM .gamp Recommended
GAMP JSON JSON version of GAMP format .json Conditionally useful
BAM Binary Alignment/Map format for alignments against a linear reference .bam Recommended
SAM Sequence/Alignment Map format, a text-based version of BAM .sam Recommended
FASTQ Version of FASTA with per-base quality scores. Used for unaligned reads. .fq, .fastq Recommended

Sample Information Formats

These formats can describe individual people or other organisms and how their genomes fit into or differ from references.

Name Description Extension Status Notes
GBWT Graph Burrows-Wheeler Transform file, storing haplotypes for samples .gbwt Conditionally useful It sometimes makes more sense to use a GBZ.
GBZ See above under Reference Formats
VCF Variant Call Format file, storing sample genotypes and haplotypes against a linear reference .vcf, .vcf.gz Recommended Not all VCF 4.3 features are supported by vg
Pack File Stores read information as counts of visited graph elements .cx Recommended
Pileup Protobuf Stores read information as counts of visited graph elements .pileup? Deprecated
Pileup JSON JSON version of the Pileup Protobuf format .json Deprecated
Locus Protobuf Stores genotypes against a graph reference .loci Experimental
Locus JSON JSON version of the Locus Protobuf format .json Deprecated

Miscellaneous Formats

These formats store other kinds of information, or are precomputed indexes to speed up operations on other data.

Name Description Extension Status Notes
Distance Index (v1) Index for computing distances between points in a graph .dist Deprecated Used in vg giraffe
Distance Index (v2) Index for computing distances between points in a graph .dist Recommended
GCSA Generalized Compressed Suffix Array, version 2, for finding substrings in a graph .gcsa Recommended Used in vg map and vg mpmap
Minimizer Index Used to find "minimizer" substrings in a graph .min Recommended Used in vg giraffe
BED Browser Extensible Data format, used for defining regions .bed Recommended
Dot GraphViz input format .dot Conditionally useful vg view -d can export graphs in Dot format for visualization with GraphViz's dot tool.
Snarl Protobuf Hierarchical decomposition of a graph into variable sites, called "snarls" .snarls Recommended
Snarl JSON JSON representation of Protobuf snarl data .json Conditionally useful
SnarlTraversal Protobuf Binary representation of possible paths through snarls .trav? Conditionally useful
SnarlTraversal JSON Text representation of possible paths through snarls .json Conditionally useful
Node ID Translation Recorded information about changes made to nodes while modifying a graph .trans Conditionally useful
VG Protobuf Index Index over a sorted VG Protobuf file .vgi Experimental
GAM Index Index over a sorted GAM Protobuf file .gai, .gam.index Recommended Useful for vg chunk to fetch out reads for a particular region
FASTA Index Index over a FASTA file for random access .fai Recommended
BAM Index Index over a sorted BAM file for random access .bai Recommended
Tabix VCF Index Index over a sorted, compressed VCF file for random access .tbi Recommended
Zipcodes Store supplemental distance information for positions on a graph '.zipcodes' Experimental
Clone this wiki locally