Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for using other Learning Rate Schedulers and Optimizers #76

Open
PonteIneptique opened this issue Dec 6, 2020 · 5 comments
Open

Comments

@PonteIneptique
Copy link
Contributor

PonteIneptique commented Dec 6, 2020

Hey !
I started reading about some other optimizers, as things went through my news feed (stuff like this or that).

I ended up trying to implement it in pie but wanted to see first what would be the results. The test were done as follow: same training set (~500k words), same learning rate, same testing set (~ 63k tokens), cuda, 10 run per configuration. No optimization were done.

For optimizers, were tested Ranger and Adam. I did not try anything else
For learning rate, were tested ReduceLROnPlaeau, CosineAnnealing, Delayed(CosineAnnealing).
Patience overall is 15 steps with improvement. CosineAnnealing T0 is 40, Delay is 10.

Basically, Ranger does not outperform Adam (maybe with other parameters, who knows, as the beta is different from Adam) but Delay(CosineAnnealing) is reaching same results in 40% less time.

If you are okay, PR will be under way.

Results:

image

image

image

image

image

image

@emanjavacas
Copy link
Owner

We could include an option to select the lr scheduler. That's easy since it's just swapping the pytorch lr scheduler and adapting the step call. If you have the code around feel free to push a PR and we can see how to include it!

@PonteIneptique
Copy link
Contributor Author

PonteIneptique commented Dec 9, 2020

So, small update with my old branch, regarding Flat(Cosine)(Delay=10, CosineTmax=40, patience=11): I can definitely recommend it. On a corpus of 1.5M tokens (3 times the previous one), it's not only faster, it's a also scoring higher with less deviation:

image
image
image
image
image

@PonteIneptique
Copy link
Contributor Author

Hey @emanjavacas :)
I was very bugged by the results on Ranger on the first batch, because I remembered running small trainings and having better results than with Adam. Then I remembered I read that Ranger takes a higher learning rate to start with, and that I did use a higher one for my preliminary tests.
So I did it as well with the LASLA corpus, and I scored better results (note that my Adam LR is fine tuned, after close to 100 run to find the best hyperparams), with a 10x higher LR than my Adam one:

image
image

@PonteIneptique
Copy link
Contributor Author

I also found out I am using CosineAnnealing the wrong way, but it still perform better than Adam: instead of using T_max as the cycle for which you'd find a cosine curve of LR, I have been using it as a slope (the LR is badly offset, it should be 10 epochs on the right):
image

@PonteIneptique
Copy link
Contributor Author

PonteIneptique commented Mar 1, 2021

Coming back with new experiences, regarding Ranger vs Adam.

I have been playing with single tasks models (which indeed improve when fine tuned correctly), and Ranger clearly yields results that are more stable:

image

The second before last and the second are the same config, just the optimizer is changing (without finetuning optimizer hyperparams)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants