You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I'm working on a conda environment trying to reproduce training results.
I installed necessary packages, so the code runs fine. Tensorflow also detects my GPU (NVIDIA GeForce RTX 3090).
However, it takes very long time to start training, and I keep getting nan value for the loss and val_loss.
Here is the output I get when the training DOES proceed:
dict_keys(['CAMERA', 'Real', 'coco'])
Epoch 1/100
2024-01-25 13:49:04.065612: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2024-01-25 13:49:04.279901: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2024-01-25 13:49:57.467798: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
196/1000 [====>.........................] - ETA: 41:01 - loss: nan/home/midea/miniconda3/envs/nocs/lib/python3.7/site-packages/scipy/ndimage/interpolation.py:605: UserWarning: From scipy 0.13.0, the output shape of zoom() is calculated with round() instead of int() - for these inputs the size of the returned array has changed.
"the returned array has changed.", UserWarning)
1000/1000 [==============================] - 767s 767ms/step - loss: nan - val_loss: nan
WARNING:tensorflow:From /home/midea/miniconda3/envs/nocs/lib/python3.7/site-packages/keras/callbacks/tensorboard_v1.py:343: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.
Epoch 2/100
1000/1000 [==============================] - 184s 184ms/step - loss: nan - val_loss: nan
Epoch 3/100
1000/1000 [==============================] - 187s 187ms/step - loss: nan - val_loss: nan
Epoch 4/100
1000/1000 [==============================] - 188s 188ms/step - loss: nan - val_loss: nan
Epoch 5/100
1000/1000 [==============================] - 189s 189ms/step - loss: nan - val_loss: nan
Epoch 6/100
1000/1000 [==============================] - 186s 186ms/step - loss: nan - val_loss: nan
Epoch 7/100
1000/1000 [==============================] - 185s 185ms/step - loss: nan - val_loss: nan
Epoch 8/100
1000/1000 [==============================] - 189s 189ms/step - loss: nan - val_loss: nan
Epoch 9/100
1000/1000 [==============================] - 187s 187ms/step - loss: nan - val_loss: nan
Epoch 10/100
1000/1000 [==============================] - 188s 188ms/step - loss: nan - val_loss: nan
Epoch 11/100
1000/1000 [==============================] - 189s 189ms/step - loss: nan - val_loss: nan
Epoch 12/100
1000/1000 [==============================] - 185s 185ms/step - loss: nan - val_loss: nan
Epoch 13/100
540/1000 [===============>..............] - ETA: 1:24 - loss: nan
Otherwise, when the training DOES NOT proceed, this is the error I get:
dict_keys(['CAMERA', 'Real', 'coco'])
Epoch 1/100
2024-01-25 15:31:03.464801: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2024-01-25 15:31:03.658949: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2024-01-25 15:31:46.257837: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2024-01-25 15:39:51.069436: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2024-01-25 15:39:51.069476: I tensorflow/stream_executor/stream.cc:4838] [stream=0xe336d00,impl=0xe4f0b40] did not memzero GPU location; source: 0x7f53567fad20
2024-01-25 15:39:51.069481: I tensorflow/stream_executor/stream.cc:315] did not allocate timer: 0x7f53567fad30
2024-01-25 15:39:51.069484: I tensorflow/stream_executor/stream.cc:1839] [stream=0xe336d00,impl=0xe4f0b40] did not enqueue 'start timer': 0x7f53567fad30
2024-01-25 15:39:51.069493: I tensorflow/stream_executor/stream.cc:1851] [stream=0xe336d00,impl=0xe4f0b40] did not enqueue 'stop timer': 0x7f53567fad30
2024-01-25 15:39:51.069496: F tensorflow/stream_executor/gpu/gpu_timer.cc:65] Check failed: start_event_ != nullptr && stop_event_ != nullptr
Aborted (core dumped)
The following is the settings of my conda environment:
Hello, I'm working on a conda environment trying to reproduce training results.
I installed necessary packages, so the code runs fine. Tensorflow also detects my GPU (NVIDIA GeForce RTX 3090).
However, it takes very long time to start training, and I keep getting nan value for the loss and val_loss.
Here is the output I get when the training DOES proceed:
Otherwise, when the training DOES NOT proceed, this is the error I get:
The following is the settings of my conda environment:
What have I possibly done wrong?
The text was updated successfully, but these errors were encountered: