Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an analyze-datasets step in the pipeline #924

Open
gregtatum opened this issue Nov 6, 2024 · 3 comments
Open

Create an analyze-datasets step in the pipeline #924

gregtatum opened this issue Nov 6, 2024 · 3 comments
Labels
quality Improving robustness and translation quality

Comments

@gregtatum
Copy link
Member

When training languages at scale, it's difficult to analyze datasets while creating the configs, as most likely the person managing the training will not speak the language. However, many datasets have problems that are obvious when looking at the data. OpusCleaner provides a UI tool to look at the datasets, but it only does the head of the file. For instance, in NLLB, this is not representative as the NLLB database is sorted alphabetically. The first sentences in NLLB are usually quite bad as they are punctuation and it's not representative of the larger set of data.

The work here would be to create a single merged task that would generate this analysis, similar to the merge-corpus task. The config generator could even suggest running new languages up to the analyze-datasets point, and then excluding any datasets that have issues. We can always write new cleaners if we feel like it, but this is labor intensive and is hard to scale the resources of the team to train many languages.

We already have analysis tasks that count the length of words in the sentence. However it I haven't found those useful, and would prefer a merged step with everything in one place so it is easy to trigger and look at manually.

@gregtatum gregtatum added the quality Improving robustness and translation quality label Nov 6, 2024
@ZJaume
Copy link
Collaborator

ZJaume commented Nov 7, 2024

I think we mixed up things yesterday a little bit. The samples on the Opus website are heads, but OpusCleaner randomly samples all the datasets for the viewer.

@gregtatum
Copy link
Member Author

Ah, well that's good at least. There's still a mismatch on OpusCleaner workflow and our data ingestion pipeline. Basically we don't really work locally, so viewing files and downloading things locally is quite a bit harder, especially as we are automating the process as much as possible. I believe OpusCleaner hosts its own live local web server to do things. I ran into a few issues running it with larger datasets failing to download, and it never reporting the issue, so it was a bit slow to go from our config to manually downloading things on OpusCleaner. Regardless, it seems like we need this capability one way or another.

@ZJaume
Copy link
Collaborator

ZJaume commented Nov 11, 2024

I think there was a way of running OpusCleaner on a server an use the UI somewhat locally by using ssh port forwarding.

@gregtatum gregtatum changed the title Create an analayze-datasets step in the pipeline Create an analyze-datasets step in the pipeline Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
quality Improving robustness and translation quality
Projects
None yet
Development

No branches or pull requests

2 participants