Create an `analyze-datasets` step in the pipeline #924

gregtatum · 2024-11-06T18:52:27Z

When training languages at scale, it's difficult to analyze datasets while creating the configs, as most likely the person managing the training will not speak the language. However, many datasets have problems that are obvious when looking at the data. OpusCleaner provides a UI tool to look at the datasets, but it only does the head of the file. For instance, in NLLB, this is not representative as the NLLB database is sorted alphabetically. The first sentences in NLLB are usually quite bad as they are punctuation and it's not representative of the larger set of data.

The work here would be to create a single merged task that would generate this analysis, similar to the merge-corpus task. The config generator could even suggest running new languages up to the analyze-datasets point, and then excluding any datasets that have issues. We can always write new cleaners if we feel like it, but this is labor intensive and is hard to scale the resources of the team to train many languages.

We already have analysis tasks that count the length of words in the sentence. However it I haven't found those useful, and would prefer a merged step with everything in one place so it is easy to trigger and look at manually.

The text was updated successfully, but these errors were encountered:

ZJaume · 2024-11-07T10:12:43Z

I think we mixed up things yesterday a little bit. The samples on the Opus website are heads, but OpusCleaner randomly samples all the datasets for the viewer.

gregtatum · 2024-11-07T20:37:34Z

Ah, well that's good at least. There's still a mismatch on OpusCleaner workflow and our data ingestion pipeline. Basically we don't really work locally, so viewing files and downloading things locally is quite a bit harder, especially as we are automating the process as much as possible. I believe OpusCleaner hosts its own live local web server to do things. I ran into a few issues running it with larger datasets failing to download, and it never reporting the issue, so it was a bit slow to go from our config to manually downloading things on OpusCleaner. Regardless, it seems like we need this capability one way or another.

ZJaume · 2024-11-11T13:04:41Z

I think there was a way of running OpusCleaner on a server an use the UI somewhat locally by using ssh port forwarding.

gregtatum added the quality Improving robustness and translation quality label Nov 6, 2024

gregtatum changed the title ~~Create an analayze-datasets step in the pipeline~~ Create an analyze-datasets step in the pipeline Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an `analyze-datasets` step in the pipeline #924

Create an `analyze-datasets` step in the pipeline #924

gregtatum commented Nov 6, 2024

ZJaume commented Nov 7, 2024

gregtatum commented Nov 7, 2024

ZJaume commented Nov 11, 2024 •

edited

Loading

Create an analyze-datasets step in the pipeline #924

Create an analyze-datasets step in the pipeline #924

Comments

gregtatum commented Nov 6, 2024

ZJaume commented Nov 7, 2024

gregtatum commented Nov 7, 2024

ZJaume commented Nov 11, 2024 • edited Loading

Create an `analyze-datasets` step in the pipeline #924

Create an `analyze-datasets` step in the pipeline #924

ZJaume commented Nov 11, 2024 •

edited

Loading