Compound formation is very common in the Swedish language and there are many combinations of words that can be reversed to form another correct word; sometimes with a completely different meaning. For some reason @HerrLantz thinks these are really exciting so I wanted to help him find more of them 🔍
But I don't like to think for myself so I made this program to do that for me.
$ make # defaults to `make all`
This command runs the following operations:
- Download dictionaries from source
- Combine and sanitize the data into a complete dataset
- Divide the dataset into partitions
- Find reversed compound words
Or run them separately:
$ make dictionary # download the dictionaries from source
$ make prepare # build the combined dataset from source
$ make divide # divide the dataset into partitions for each letter
$ make conquer # find reversed compound words using the partitions
The combined dictionary ends up in dump/dictionary.txt
and the partitioned version is created in dump/lexicon/
. The list of possible reversed compound words are piped into dump/compounds.txt
.
All dictionaries are drawn from Nordic Words where they reside in the public domain as of 2018-03-19.
Specifically, the following dictionaries are incorporated:
- Svenska by Projekt Runeberg
- ss100.txt by Lars Aronsson
- ord.niklas.frykholm by Niklas Frykholm
- ord.stava by swnet & Niklas Frykholm
- ord.swnet by swnet
The combined dataset has 306164 entries.
The files stored in the dictionaries/
subdirectory are encoded as ISO-8859-1, which makes åäö behave a little weird in some contexts. The dictionary generated by make prepare
is however encoded as UTF-8