Corpus Filter

A basic programme that filters out poor data for Neural Machine Translation systems. it is compatible with languages using the Roman Alphabet (including Diacritics like á)

It has the following filters:

Copy - Checks the source and target sentences to make sure they don't match each other.

Duplicate - Checks to make sure the sentence isn't repeated anywhere else.

Sentence Ratio - Compares the lengths of the corresponding sentences.

Character Ratio - Compares the amount of non-alphabetical symbols to alphabetical symbols.

Length - Checks to make sure the sentences aren't too small or too large.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Filter.py		Filter.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Corpus Filter

About

Releases

Packages

Languages

JustCunn/CorpusFilter

Folders and files

Latest commit

History

Repository files navigation

Corpus Filter

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages