Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amp NanLossDuringTrainingError에러 관련 문의 #10

Open
Jimojimojimo opened this issue Jun 21, 2021 · 1 comment
Open

amp NanLossDuringTrainingError에러 관련 문의 #10

Jimojimojimo opened this issue Jun 21, 2021 · 1 comment

Comments

@Jimojimojimo
Copy link

안녕하세요 유익한 자료 공개해 주셔서 정말 감사드립니다.

automatic mixed precision 사용했을 때 NanLossDuringTrainingError가 뜨는데 learningrate나 환경설정 등 따로 수정해야 되는 부분이 있나요?

[1,5]<stderr>:ERROR:tensorflow:Error recorded from training_loop: NaN loss during training.
[1,5]<stderr>:INFO:tensorflow:training_loop marked as finished
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.
@krevas
Copy link
Contributor

krevas commented Jun 21, 2021

안녕하세요
질문 주신 에러문은 보았을때 모델의 이전 체크 포인트가 현재 학습 세션과 충돌하는 것으로 보여집니다.
run_pretraining.py을 실행할때 --model_name이 같으면 동일한 위치에 체크포인트가 저장되기 때문에 충돌이 나게 됩니다.
--model_name을 변경해서 한번 실행해 보는것을 추천드립니다.

감사합니다.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants