Reduce monolingual data for en-lt to investigate distillation performance #915

gregtatum · 2024-10-31T16:11:51Z

In #771 I tested the effects of reducing the distillation data to understand that expensive part of our pipeline. However, we should do it again for the base student model, as the other one was done for a tiny model too see if there is a difference. Also, I want to test it on a morphologically more complex language like Lithuanian.

The text was updated successfully, but these errors were encountered:

gregtatum · 2024-11-06T12:15:06Z

In #771 @ZJaume commented:

This results seem very interesting to me. I believe the fact that NLLB and Paracrawl are full of redundant and repetitive data has something to do with this. If there is interest in finding a better way to sample, I think n-gram saturation (rank lower the sentences that have a significant portion of the 2-grams or 3-grams already present in the corpus) could be something worth to explore.

After some light searching I found this paper with has an approach we could use if we wanted to go this route, which seems like a reasonable balance of cost vs quality: STACC, OOV Density and N-gram Saturation: Vicomtech’s Participation in the WMT 2018 Shared Task on Parallel Corpus Filtering

gregtatum added the experiment A training experiment with hypothesis and results label Oct 31, 2024

gregtatum self-assigned this Oct 31, 2024

gregtatum mentioned this issue Oct 31, 2024

Reduce monolingual data for da-en to investigate distillation performance #771

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce monolingual data for en-lt to investigate distillation performance #915

Reduce monolingual data for en-lt to investigate distillation performance #915

gregtatum commented Oct 31, 2024

gregtatum commented Nov 6, 2024 •

edited

Loading

Reduce monolingual data for en-lt to investigate distillation performance #915

Reduce monolingual data for en-lt to investigate distillation performance #915

Comments

gregtatum commented Oct 31, 2024

gregtatum commented Nov 6, 2024 • edited Loading

gregtatum commented Nov 6, 2024 •

edited

Loading