Create an analyze-datasets
step in the pipeline
#924
Labels
quality
Improving robustness and translation quality
analyze-datasets
step in the pipeline
#924
When training languages at scale, it's difficult to analyze datasets while creating the configs, as most likely the person managing the training will not speak the language. However, many datasets have problems that are obvious when looking at the data. OpusCleaner provides a UI tool to look at the datasets, but it only does the
head
of the file. For instance, in NLLB, this is not representative as the NLLB database is sorted alphabetically. The first sentences in NLLB are usually quite bad as they are punctuation and it's not representative of the larger set of data.The work here would be to create a single merged task that would generate this analysis, similar to the
merge-corpus
task. The config generator could even suggest running new languages up to theanalyze-datasets
point, and then excluding any datasets that have issues. We can always write new cleaners if we feel like it, but this is labor intensive and is hard to scale the resources of the team to train many languages.We already have analysis tasks that count the length of words in the sentence. However it I haven't found those useful, and would prefer a merged step with everything in one place so it is easy to trigger and look at manually.
The text was updated successfully, but these errors were encountered: