Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpora exclusion rules #940

Open
ZJaume opened this issue Nov 25, 2024 · 0 comments
Open

Corpora exclusion rules #940

ZJaume opened this issue Nov 25, 2024 · 0 comments
Labels
language-coverage Issues related to covering specific languages

Comments

@ZJaume
Copy link
Collaborator

ZJaume commented Nov 25, 2024

The skip_datasets in the config could be improved, I think.

  • The Multi* versions should not be excluded, still valuable. Specially if the language pair is not english-centric.
  • If SPC is not failing anymore, should be removed. For the two or three language pairs that I have taken a look in this corpus, it was quite clean. Mightbe a good resource.
  • There should definitely be another skip_datasets by language pair. For example Ubuntu and PHP are full of garbage for Chinese but are good for other language pairs.
@eu9ene eu9ene added the language-coverage Issues related to covering specific languages label Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
language-coverage Issues related to covering specific languages
Projects
None yet
Development

No branches or pull requests

2 participants