Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data priority, incremental training? #22

Open
JanCizmar opened this issue Oct 29, 2022 · 2 comments
Open

Data priority, incremental training? #22

JanCizmar opened this issue Oct 29, 2022 · 2 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@JanCizmar
Copy link

Hi there!

  1. I would like to use the data currently provided in data-index.json, but at the same time, I would like to use my custom data. Can I tell the script to generate a model considering my custom data is more relevant / has a bigger priority?

  2. Let's say I have one large dataset I am using all the time, and then I have multiple smaller datasets which I would like to train different models for each. Is something like an incremental build possible, so I would reuse some previous output and just "append" my custom data to save some training time and resources?

Thanks!

@PJ-Finlay
Copy link
Collaborator

There's no direct support for this but you can accomplish this by modifying argostrain/train.py.

I would add input("Downloaded Argos Data") after the data has been downloaded here and then append your custom data to run/source and run/target.

You could also train one base model and then fine tune it using custom data. However, this will also require using custom code.

I want to improve using custom data and fine tuning so if anyone has suggestions or pull requests they're appreciated.

@PJ-Finlay PJ-Finlay added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Nov 4, 2022
@martin-leoorg
Copy link

Would incremental training also be possible with the suggestions from libretranslate? I think the base models that are available are quite good already, but having the feedback from libretranslate incorporated might make corner cases even better - this might depend on the actual use case (e.g. a medical use case might need a different fine-tuning than a scuba-diving one, to pick random examples).

Having a possibility to quickly improve the base model without having to use a high-power machine for training the complete model again with 99.9% same input data would be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants