A basic programme that filters out poor data for Neural Machine Translation systems. it is compatible with languages using the Roman Alphabet (including Diacritics like á)
It has the following filters:
Copy - Checks the source and target sentences to make sure they don't match each other.
Duplicate - Checks to make sure the sentence isn't repeated anywhere else.
Sentence Ratio - Compares the lengths of the corresponding sentences.
Character Ratio - Compares the amount of non-alphabetical symbols to alphabetical symbols.
Length - Checks to make sure the sentences aren't too small or too large.