Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting 'nan' results #80

Open
jysh1023 opened this issue Jan 25, 2024 · 0 comments
Open

Getting 'nan' results #80

jysh1023 opened this issue Jan 25, 2024 · 0 comments

Comments

@jysh1023
Copy link

jysh1023 commented Jan 25, 2024

Hello, I'm working on a conda environment trying to reproduce training results.
I installed necessary packages, so the code runs fine. Tensorflow also detects my GPU (NVIDIA GeForce RTX 3090).
However, it takes very long time to start training, and I keep getting nan value for the loss and val_loss.

Here is the output I get when the training DOES proceed:

dict_keys(['CAMERA', 'Real', 'coco'])
Epoch 1/100
2024-01-25 13:49:04.065612: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2024-01-25 13:49:04.279901: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2024-01-25 13:49:57.467798: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
 196/1000 [====>.........................] - ETA: 41:01 - loss: nan/home/midea/miniconda3/envs/nocs/lib/python3.7/site-packages/scipy/ndimage/interpolation.py:605: UserWarning: From scipy 0.13.0, the output shape of zoom() is calculated with round() instead of int() - for these inputs the size of the returned array has changed.
  "the returned array has changed.", UserWarning)
1000/1000 [==============================] - 767s 767ms/step - loss: nan - val_loss: nan
WARNING:tensorflow:From /home/midea/miniconda3/envs/nocs/lib/python3.7/site-packages/keras/callbacks/tensorboard_v1.py:343: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

Epoch 2/100
1000/1000 [==============================] - 184s 184ms/step - loss: nan - val_loss: nan
Epoch 3/100
1000/1000 [==============================] - 187s 187ms/step - loss: nan - val_loss: nan
Epoch 4/100
1000/1000 [==============================] - 188s 188ms/step - loss: nan - val_loss: nan
Epoch 5/100
1000/1000 [==============================] - 189s 189ms/step - loss: nan - val_loss: nan
Epoch 6/100
1000/1000 [==============================] - 186s 186ms/step - loss: nan - val_loss: nan
Epoch 7/100
1000/1000 [==============================] - 185s 185ms/step - loss: nan - val_loss: nan
Epoch 8/100
1000/1000 [==============================] - 189s 189ms/step - loss: nan - val_loss: nan
Epoch 9/100
1000/1000 [==============================] - 187s 187ms/step - loss: nan - val_loss: nan
Epoch 10/100
1000/1000 [==============================] - 188s 188ms/step - loss: nan - val_loss: nan
Epoch 11/100
1000/1000 [==============================] - 189s 189ms/step - loss: nan - val_loss: nan
Epoch 12/100
1000/1000 [==============================] - 185s 185ms/step - loss: nan - val_loss: nan
Epoch 13/100
 540/1000 [===============>..............] - ETA: 1:24 - loss: nan

Otherwise, when the training DOES NOT proceed, this is the error I get:

dict_keys(['CAMERA', 'Real', 'coco'])
Epoch 1/100
2024-01-25 15:31:03.464801: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2024-01-25 15:31:03.658949: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2024-01-25 15:31:46.257837: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2024-01-25 15:39:51.069436: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2024-01-25 15:39:51.069476: I tensorflow/stream_executor/stream.cc:4838] [stream=0xe336d00,impl=0xe4f0b40] did not memzero GPU location; source: 0x7f53567fad20
2024-01-25 15:39:51.069481: I tensorflow/stream_executor/stream.cc:315] did not allocate timer: 0x7f53567fad30
2024-01-25 15:39:51.069484: I tensorflow/stream_executor/stream.cc:1839] [stream=0xe336d00,impl=0xe4f0b40] did not enqueue 'start timer': 0x7f53567fad30
2024-01-25 15:39:51.069493: I tensorflow/stream_executor/stream.cc:1851] [stream=0xe336d00,impl=0xe4f0b40] did not enqueue 'stop timer': 0x7f53567fad30
2024-01-25 15:39:51.069496: F tensorflow/stream_executor/gpu/gpu_timer.cc:65] Check failed: start_event_ != nullptr && stop_event_ != nullptr 
Aborted (core dumped)

The following is the settings of my conda environment:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
_tflow_select             2.1.0                       gpu  
absl-py                   0.15.0             pyhd3eb1b0_0  
astor                     0.8.1            py37h06a4308_0  
blas                      1.1                    openblas    conda-forge
blosc                     1.21.3               h6a678d5_0  
bottleneck                1.3.5            py37h7deecbd_0  
brotli                    1.0.9                h5eee18b_7  
brotli-bin                1.0.9                h5eee18b_7  
brunsli                   0.1                  h2531618_0  
bzip2                     1.0.8                h7b6447c_0  
c-ares                    1.19.1               h5eee18b_0  
ca-certificates           2023.12.12           h06a4308_0  
cairo                     1.16.0               hb05425b_5  
certifi                   2022.12.7        py37h06a4308_0  
cfitsio                   3.470                h5893167_7  
charls                    2.2.0                h2531618_0  
cloudpickle               2.0.0              pyhd3eb1b0_0  
cudatoolkit               10.0.130                      0  
cudnn                     7.6.5                cuda10.0_0  
cupti                     10.0.130                      0  
cycler                    0.11.0             pyhd3eb1b0_0  
cython                    0.29.33          py37h6a678d5_0  
cytoolz                   0.12.0           py37h5eee18b_0  
dask-core                 2021.10.0          pyhd3eb1b0_0  
dbus                      1.13.18              hb2f20db_0  
expat                     2.5.0                h6a678d5_0  
ffmpeg                    4.3.2                h37c90e5_3    conda-forge
fftw                      3.3.9                h27cfd23_1  
flit-core                 3.6.0              pyhd3eb1b0_0  
fontconfig                2.14.2               h14ed4e7_0    conda-forge
fonttools                 4.25.0             pyhd3eb1b0_0  
freetype                  2.12.1               h4a9f257_0  
fsspec                    2022.11.0        py37h06a4308_0  
gast                      0.2.2                    py37_0  
gettext                   0.21.0               hf68c758_0  
giflib                    5.2.1                h5eee18b_3  
glib                      2.70.2               h780b84a_4    conda-forge
glib-tools                2.70.2               h780b84a_4    conda-forge
gmp                       6.2.1                h295c915_3  
gnutls                    3.6.15               he1e5248_0  
google-pasta              0.2.0              pyhd3eb1b0_0  
graphite2                 1.3.14               h295c915_1  
grpcio                    1.42.0           py37hce63b2e_0  
gst-plugins-base          1.14.5               h0935bb2_2    conda-forge
gstreamer                 1.18.5               ha1a6a79_0  
h5py                      2.10.0           py37hd6299e0_1  
harfbuzz                  2.9.1                h83ec7ef_1    conda-forge
hdf5                      1.10.6               h3ffc7dd_1  
icu                       68.1                 h2531618_0  
imagecodecs               2021.8.26        py37hf0132c2_1  
imageio                   2.19.3           py37h06a4308_0  
importlib-metadata        4.11.3           py37h06a4308_0  
jasper                    1.900.1              hd497a04_4  
joblib                    1.1.1            py37h06a4308_0  
jpeg                      9e                   h5eee18b_1  
jxrlib                    1.1                  h7b6447c_2  
keras                     2.3.1                         0  
keras-applications        1.0.8                      py_1  
keras-base                2.3.1                    py37_0  
keras-preprocessing       1.1.2              pyhd3eb1b0_0  
kiwisolver                1.4.4            py37h6a678d5_0  
krb5                      1.20.1               h568e23c_1  
lame                      3.100                h7b6447c_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libaec                    1.0.4                he6710b0_1  
libblas                   3.9.0           16_linux64_openblas    conda-forge
libbrotlicommon           1.0.9                h5eee18b_7  
libbrotlidec              1.0.9                h5eee18b_7  
libbrotlienc              1.0.9                h5eee18b_7  
libcblas                  3.9.0           16_linux64_openblas    conda-forge
libclang                  11.1.0          default_ha53f305_1    conda-forge
libcurl                   8.2.1                h91b91d3_0  
libdeflate                1.8                  h7f8727e_5  
libedit                   3.1.20230828         h5eee18b_0  
libev                     4.33                 h7f8727e_1  
libevent                  2.1.10               h9b69904_4    conda-forge
libffi                    3.4.4                h6a678d5_0  
libgcc-ng                 13.2.0               h807b86a_3    conda-forge
libgfortran               3.0.0                         1    conda-forge
libgfortran-ng            11.2.0               h00389a5_1  
libgfortran5              11.2.0               h1234567_1  
libglib                   2.70.2               h174f98d_4    conda-forge
libgomp                   13.2.0               h807b86a_3    conda-forge
libiconv                  1.16                 h7f8727e_2  
libidn2                   2.3.4                h5eee18b_0  
liblapack                 3.9.0           16_linux64_openblas    conda-forge
liblapacke                3.9.0           16_linux64_openblas    conda-forge
libllvm11                 11.1.0               h9e868ea_6  
libnghttp2                1.52.0               ha637b67_1  
libnsl                    2.0.0                h5eee18b_0  
libopenblas               0.3.21               h043d6bf_0  
libopencv                 4.5.3            py37h25009ff_1    conda-forge
libpng                    1.6.39               h5eee18b_0  
libpq                     12.15                h37d81fd_1  
libprotobuf               3.16.0               h780b84a_0    conda-forge
libssh2                   1.10.0               h37d81fd_2  
libstdcxx-ng              11.2.0               h1234567_1  
libtasn1                  4.19.0               h5eee18b_0  
libtiff                   4.4.0                hecacb30_2  
libunistring              0.9.10               h27cfd23_0  
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libwebp                   1.2.4                h11a3e52_1  
libwebp-base              1.2.4                h5eee18b_1  
libxcb                    1.15                 h7f8727e_0  
libxkbcommon              1.0.3                he3ba5ed_0    conda-forge
libxml2                   2.9.12               h72842e0_0    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
libzopfli                 1.0.3                he6710b0_0  
locket                    1.0.0            py37h06a4308_0  
lz4-c                     1.9.4                h6a678d5_0  
markdown                  3.4.1            py37h06a4308_0  
markupsafe                2.1.1            py37h7f8727e_0  
matplotlib-base           3.5.3            py37hf590b9c_0  
munkres                   1.1.4                      py_0  
mysql-common              8.0.29               haf5c9bc_1    conda-forge
mysql-libs                8.0.29               h28c427c_1    conda-forge
ncurses                   6.4                  h6a678d5_0  
nettle                    3.7.3                hbbd107a_1  
networkx                  2.6.3              pyhd3eb1b0_0  
nspr                      4.35                 h6a678d5_0  
nss                       3.89.1               h6a678d5_0  
numexpr                   2.8.4            py37hd2a5715_0  
numpy                     1.21.5           py37hf838250_3  
numpy-base                1.21.5           py37h1e6e340_3  
openblas                  0.3.3                ha44fe06_1    conda-forge
opencv                    4.5.3            py37h89c1867_1    conda-forge
openh264                  2.1.1                h4ff587b_0  
openjpeg                  2.4.0                h3ad879b_0  
openssl                   1.1.1w               h7f8727e_0  
opt_einsum                3.3.0              pyhd3eb1b0_1  
packaging                 22.0             py37h06a4308_0  
pandas                    1.3.5            py37h8c16a72_0  
partd                     1.2.0              pyhd3eb1b0_1  
pcre                      8.45                 h295c915_0  
pillow                    9.4.0            py37h6a678d5_0  
pip                       22.3.1           py37h06a4308_0  
pixman                    0.40.0               h7f8727e_1  
protobuf                  3.16.0           py37hcd2ae1e_0    conda-forge
py-opencv                 4.5.3            py37h6531663_1    conda-forge
pycocotools               2.0.4            py37hda87dfa_2    conda-forge
pyparsing                 3.0.9            py37h06a4308_0  
python                    3.7.16               h7a1cb2a_0  
python-dateutil           2.8.2              pyhd3eb1b0_0  
python_abi                3.7                     2_cp37m    conda-forge
pytz                      2022.7           py37h06a4308_0  
pywavelets                1.3.0            py37h7f8727e_0  
pyyaml                    6.0              py37h5eee18b_1  
qt                        5.12.9               h9d6b050_2    conda-forge
readline                  8.2                  h5eee18b_0  
scikit-image              0.18.3           py37h51133e4_0  
scikit-learn              1.0.2            py37h51133e4_1  
scipy                     1.2.0           py37_blas_openblashb06ca3d_200    conda-forge
setuptools                65.6.3           py37h06a4308_0  
six                       1.16.0             pyhd3eb1b0_1  
snappy                    1.1.10               h6a678d5_1  
sqlite                    3.41.2               h5eee18b_0  
tensorboard               1.14.0           py37hf484d3e_0  
tensorflow                1.14.0          gpu_py37h4491b45_0  
tensorflow-base           1.14.0          gpu_py37h8d69cac_0  
tensorflow-estimator      1.14.0                     py_0  
tensorflow-gpu            1.14.0               h0d30ee6_0  
termcolor                 1.1.0            py37h06a4308_1  
threadpoolctl             2.2.0              pyh0d69192_0  
tifffile                  2021.7.2           pyhd3eb1b0_2  
tk                        8.6.12               h1ccaba5_0  
toolz                     0.12.0           py37h06a4308_0  
typing_extensions         4.4.0            py37h06a4308_0  
webencodings              0.5.1                    py37_1  
werkzeug                  0.16.1                     py_0  
wheel                     0.38.4           py37h06a4308_0  
wrapt                     1.14.1           py37h5eee18b_0  
x264                      1!161.3030           h7f98852_1    conda-forge
xz                        5.4.5                h5eee18b_0  
yaml                      0.2.5                h7b6447c_0  
zfp                       0.5.5                h295c915_6  
zipp                      3.11.0           py37h06a4308_0  
zlib                      1.2.13               hd590300_5    conda-forge
zstd                      1.5.5                hc292b87_0  

What have I possibly done wrong?

@jysh1023 jysh1023 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 2, 2024
@jysh1023 jysh1023 reopened this Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant