A repository accompanying the manuscript "Measuring the invisible – The sequences causal of genome size differences in eyebrights (Euphrasia) revealed by k-mers".
Intraspecific genome size (GS) variation is due to presence/absence variation, which may affect single-copy regions or in genomic repeats. To-date, studies targetting the sequence underpinning GS variation commonly use low pass sequencing ("genome skimming") data analysed with the RepeateExplorer pipeline. Such studies have convincingly identified repeats involved in GS variation, but they necessarily paint an incomplete picture - using genome skimming data, it is not possible to assess the contribution to GS variation of low- and single-copy sequences.
Here, we implement an alternative approach. We use k-mers (short sub-sequences of length 21 generated from sequencing reads) from high-coverage sequencing data sets. We compare k-mer inventories between individuals, which allows us the assess the role of all genomic copy-number classes, from single-copy sequences to highly repetitive satellite DNAs.
- K-mer tool kit. You will need to have KMC3 installed. KMC3 can be set up with anaconda, for instance by running
conda install -c bioconda kmc
, or (generating a new environment)conda create -n kmc -c bioconda kmc
. - File links. You should rename (or generate links to) your sequencing data files so that each sample has a unique prefix that can be used to easily select all of an individual's files.
- Quality filtering/trimming. (Optionally, but recommended) trim and clean your sequencing data. Sequnecing errors do not matter much. They generate unique k-mers that do not significantly affect estimates. Sequencing adapter contaminations, however, can show as high-copy number k-mers, biasing genome size estimates. We used fastp.
- Oraganellar assemblies. You need reference sequences for the plastid and mitochrondrial genome. You may choose to assemble de novo from your data using GetOrganelle or download something suitable from a repository. These assemblies are then used to remove organelle k-mers from your data, which would otherwise bias genome size estimates.
The pipeline has two steps:
- Generation of k-mer databases and k-mer spectra (These need to be analysed manually to assess the sequencing coverage (for instance using Tetmer.)
- Generation of the scaled and binned joint k-mer spectra