Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about CorrDiff example #706

Open
MyGitHub-G opened this issue Nov 6, 2024 · 2 comments
Open

Questions about CorrDiff example #706

MyGitHub-G opened this issue Nov 6, 2024 · 2 comments
Assignees

Comments

@MyGitHub-G
Copy link

my machine got the error about " out of memory"
Image

i have change the "total_batch_size" to 2,
ues "amp-fp16" ,
and use dataset "'2018-01-01' to '2018-01-10'" ten days data

Are there any other places where I can modify to change the model's memory usage?
What is the meaning of the "training_duration: 200000000"?

@MyGitHub-G
Copy link
Author

my machine got the error about " out of memory" Image

i have change the "total_batch_size" to 2, ues "amp-fp16" , and use dataset "'2018-01-01' to '2018-01-10'" ten days data

Are there any other places where I can modify to change the model's memory usage? What is the meaning of the "training_duration: 200000000"?

I have solved the problem. My machine only support the "batch_size_per_gpu" as 1 .
However, what is the meaning of the "training_duration: 200000000"? What does it primarily affect?

@mnabian
Copy link
Collaborator

mnabian commented Nov 8, 2024

@MyGitHub-G training_duration is the number of (repeated) samples/images the model sees during the training. If you divide it by the number of unique samples in the dataset, it gives you the number of epochs.

@mnabian mnabian self-assigned this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants