-
Notifications
You must be signed in to change notification settings - Fork 6
Reference Sequence Library (RSL)
DIGS is a project-based framework in which investigations are centered around a genome feature of interest. Any genome feature can be investigated in principle, so long as it contains sufficient sequence conservation to be reliably detected in a similarity search.
DIGS requires a library of FASTA-formatted reference sequences for the gene or genetic element under investigation. Probes for screening are selected from the reference library.
The 'reference sequence library' is a curated set of sequences relevant to the genome feature under investigation. Usually, this will consist of:
- A set of conserved DNA or polypeptide sequences derived from the genome feature of interest.
However, depending on the kind of investigation being performed, it may also contain:
- Sequences that do not derive from the genome feature under investigation, but can provide useful information about the locus in which it occurs.
- Sequences representing genome features that are not relevant to the investigation, but are sufficiently similar to them to generate 'false positive' matches.
The DIGS tool uses a simple rule to capture data from the headers of FASTA-formatted reference (and probe) sequences. headers should be structured so as to define two hierarchical name elements; ‘name’ and ‘gene_name’, separated by an underscore. In the example shown above these are a virus name and a gene name. Other two-level hierarchical naming schemes (e.g. species & gene name, gene-subdomain name) can also be used, providing the same scheme is used consistently throughout the project. Reference sequences should be stored in a file, the path to which will be specified in the DIGS control file.