Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update train_sts_seed_optimization with SentenceTransformerTrainer #3092

Merged
merged 4 commits into from
Dec 2, 2024

Conversation

JINO-ROHIT
Copy link
Contributor

This PR updates the example script train_sts_seed_optimization.py with SentenceTransformerTrainer.

I also noticed the documentation was quite outdated when I was referring for some args, should we try and look to update them too?

@tomaarsen

Comment on lines 96 to 97
# Configure the training. We skip evaluation in this example
warmup_steps = math.ceil(len(train_dataloader) * num_epochs * 0.1) # 10% of train data for warm-up
warmup_steps = math.ceil(len(train_dataset) * num_epochs * 0.1) # 10% of train data for warm-up
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SentenceTransformerTrainingArguments has a warmup_ratio=0.1 that we can use instead.


# Stopping and Evaluating after 30% of training data (less than 1 epoch)
# We find from (Dodge et al.) that 20-30% is often ideal for convergence of random seed
steps_per_epoch = math.ceil(len(train_dataloader) * stop_after)
steps_per_epoch = math.ceil(len(train_dataset) * stop_after)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is used right now

steps_per_epoch=steps_per_epoch,
evaluation_steps=1000,
# 5. Define the training arguments
args = SentenceTransformerTrainingArguments(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the stop_after isn't actually making it stop after this many steps.
Normally you can use max_steps, but then I think it messes with the scheduler, ideally we want the scheduler to be "normal" but then still stop after stop_after steps, but I'm not sure if that's the old behaviour either.

@tomaarsen
Copy link
Collaborator

I'm also curious what you mean with the outdated docs - I'd also like that to be fixed if possible.

@JINO-ROHIT
Copy link
Contributor Author

ahh yeah ill change those.

sorry about the docs, i was accidentally referring to the old fit method here - https://www.sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#:~:text=steps_per_epoch%20%E2%80%93%20Number%20of%20training%20steps,the%20DataLoader%20size%20from%20train_objectives.&text=warmup_steps%20%E2%80%93%20Behavior%20depends%20on%20the,to%20the%20maximal%20learning%20rate. and saw args like steps_per_epoch and warmup_steps that werent there in the Trainer

@JINO-ROHIT
Copy link
Contributor Author

also, i dont quite understand the stop_after bit as well, is a custom callback expected?

@tomaarsen
Copy link
Collaborator

also, i dont quite understand the stop_after bit as well, is a custom callback expected?

Makes sense, this is a little confusing. I think the idea is that we create 1 epoch of e.g. 100k steps. The seed (e.g. for data sampling) has been shown to be fairly important for training embedding models, so we want to train e.g. the first 30k steps out of the 100k and then see where we're at. Then we can pick the seed that performed the best after just a bit of training.

But if we use max_steps=0.3 * total_steps, then our scheduler will also recognize that we're only doing 30k steps, and update accordingly.

Instead, we want the scheduler to think that we're doing 100k steps, but indeed we want the training to stop after 30k (or stop_after * total_steps). I think a custom callback is indeed a good solution, I'll whip something up and add it to this PR.

  • Tom Aarsen

@tomaarsen
Copy link
Collaborator

This was my final log with this script at the default parameters:

Current sts-dev_spearman_cosine Scores per Seed: {7: 0.8510304677377857,
 9: 0.8487004831766769,
 5: 0.8486879350254634,
 1: 0.8446502257325709,
 2: 0.8422240179194527,
 4: 0.8402418940176778,
 6: 0.8381279059862979,
 0: 0.8367594493128546,
 3: 0.831016545559169,
 8: 0.8271094833035064}

So there's indeed a pretty big difference, 0.827 VS 0.851.

  • Tom Aarsen

@JINO-ROHIT
Copy link
Contributor Author

ah thanks for explaining, this makes sense.
whup, there still seems to be a diff with your custom callback? is it possible because of the randomness of the seed itself?

@tomaarsen
Copy link
Collaborator

A difference? Between the evaluation scores you mean?
Indeed, those are because of the different seed (which impacts a lot in the training process, from data sampling to dropout, and also weight initialization if you're training fully from scratch).

  • Tom Aarsen

@JINO-ROHIT
Copy link
Contributor Author

ahh okay okay

@tomaarsen tomaarsen merged commit 39b6eae into UKPLab:master Dec 2, 2024
9 checks passed
@tomaarsen
Copy link
Collaborator

Thanks for tackling this!

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants