Skip to content

Small, easy-to-use programme dedicated to corpus filtering

Notifications You must be signed in to change notification settings

JustCunn/CorpusFilter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Corpus Filter

A basic programme that filters out poor data for Neural Machine Translation systems. it is compatible with languages using the Roman Alphabet (including Diacritics like á)

It has the following filters:

Copy - Checks the source and target sentences to make sure they don't match each other.

Duplicate - Checks to make sure the sentence isn't repeated anywhere else.

Sentence Ratio - Compares the lengths of the corresponding sentences.

Character Ratio - Compares the amount of non-alphabetical symbols to alphabetical symbols.

Length - Checks to make sure the sentences aren't too small or too large.

About

Small, easy-to-use programme dedicated to corpus filtering

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages