Skip to content

Reference Sequence Library (RSL)

Robert J. Gifford edited this page Jun 23, 2024 · 5 revisions

DIGS is a project-based framework in which investigations are centered around a genome feature of interest. Any genome feature can be investigated in principle, so long as it contains sufficient sequence conservation to be reliably detected in a similarity search.

DIGS requires a library of FASTA-formatted reference sequences for the gene or genetic element under investigation. Probes for screening are selected from the reference library.

The 'reference sequence library' is a curated set of sequences relevant to the genome feature under investigation. Usually, this will consist of:

  • A set of conserved DNA or polypeptide sequences derived from the genome feature of interest.

However, depending on the kind of investigation being performed, it may also contain:

  • Sequences that do not derive from the genome feature under investigation, but can provide useful information about the locus in which it occurs.
  • Sequences representing genome features that are not relevant to the investigation, but are sufficiently similar to them to generate 'false positive' matches.

The DIGS tool uses a simple rule to capture data from the headers of FASTA-formatted reference (and probe) sequences. headers should be structured so as to define two hierarchical name elements; ‘name’ and ‘gene_name’, separated by an underscore. In the example shown above these are a virus name and a gene name. Other two-level hierarchical naming schemes (e.g. species & gene name, gene-subdomain name) can also be used, providing the same scheme is used consistently throughout the project. Reference sequences should be stored in a file, the path to which will be specified in the DIGS control file.