Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Colab: GPUs: None detected #1644

Open
talmo opened this issue Dec 19, 2023 Discussed in #1642 · 5 comments
Open

Google Colab: GPUs: None detected #1644

talmo opened this issue Dec 19, 2023 Discussed in #1642 · 5 comments

Comments

@talmo
Copy link
Collaborator

talmo commented Dec 19, 2023

TLDR: Google Colab no longer works with TensorFlow <2.15.

This is an issue since some of our dependencies break with TensorFlow >2.11ish.

This is likely because of the CUDA/CuDNN versions. As of Dec 19, 2023 nvidia-smi reports:

  • Driver Version: 535.104.05
  • CUDA Version: 12.2

Here's a notebook for testing.

Potential workarounds:

  • Use Paperspace as an alternative to Google Colab
  • !apt update && apt install cuda-11-8 before installing [source] -- Note: Tested to work with SLEAP v1.3.3, but takes ~5-10 minutes to install.
  • Tools -> Command palette -> type in and select 'use fallback runtime'. But this will only work until early Jan 2024 unfortunately [source]

Proper fix: Update usage of dependencies to work with Python 3.10 + TensorFlow 2.15 while maintaining backwards compatibility with at least TF 2.10 for Windows support.

Discussed in #1642

Originally posted by delaroob December 17, 2023
Hi everyone,

I'm trying to continue training a SLEAP network in Colab. I've done the process (importing the same stuff, running the same code blocks etc.) several times in the past few days without any problems, however, it seems like I can't connect to any GPUs.

As the matter of fact, I can't run anything in colab right now except for like saving variables, importing packages and stuff that doesn't really require much comp power. Deeplabcut doesn't work either, the runtime colapses and restarts without further information.

In runtime python3 with a v100 GPU is selected and I still have 122 comp units available.

Thanks in advance for any help and let me know if additional information is required to solve the issue!

Here is the stuff I run (it's basically the demo notebook):

!pip uninstall -qqq -y opencv-python opencv-contrib-python
!pip install -qqq "sleap[pypi]>=1.3.3"
from google.colab import drive
drive.mount('/content/drive/')

(i've already done the next "iteration" of training yesterday, so I skipped the unzip and training part, since I just wanted to run inference and predict instances)

!sleap-track "/content/drive/MyDrive/sleap/colab2/male.mp4" -m "/content/drive/MyDrive/sleap/colab2/models/231213_081111.single_instance"

output:

INFO:numexpr.utils:NumExpr defaulting to 8 threads.
2023-12-17 16:30:34.863435: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-12-17 16:30:34.863471: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Started inference at: 2023-12-17 16:30:37.969681
Args:
{
│   'data_path': '/content/drive/MyDrive/sleap/colab2/male.mp4',
│   'models': ['/content/drive/MyDrive/sleap/colab2/models/231213_081111.single_instance'],
│   'frames': '',
│   'only_labeled_frames': False,
│   'only_suggested_frames': False,
│   'output': None,
│   'no_empty_frames': False,
│   'verbosity': 'rich',
│   'video.dataset': None,
│   'video.input_format': 'channels_last',
│   'video.index': '',
│   'cpu': False,
│   'first_gpu': False,
│   'last_gpu': False,
│   'gpu': 'auto',
│   'max_edge_length_ratio': 0.25,
│   'dist_penalty_weight': 1.0,
│   'batch_size': 4,
│   'open_in_gui': False,
│   'peak_threshold': 0.2,
│   'max_instances': None,
│   'tracking.tracker': None,
│   'tracking.max_tracking': None,
│   'tracking.max_tracks': None,
│   'tracking.target_instance_count': None,
│   'tracking.pre_cull_to_target': None,
│   'tracking.pre_cull_iou_threshold': None,
│   'tracking.post_connect_single_breaks': None,
│   'tracking.clean_instance_count': None,
│   'tracking.clean_iou_threshold': None,
│   'tracking.similarity': None,
│   'tracking.match': None,
│   'tracking.robust': None,
│   'tracking.track_window': None,
│   'tracking.min_new_track_points': None,
│   'tracking.min_match_points': None,
│   'tracking.img_scale': None,
│   'tracking.of_window_size': None,
│   'tracking.of_max_levels': None,
│   'tracking.save_shifted_instances': None,
│   'tracking.kf_node_indices': None,
│   'tracking.kf_init_frame_count': None
}

2023-12-17 16:30:37.999611: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-17 16:30:37.999983: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-12-17 16:30:38.000129: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-12-17 16:30:38.000255: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-12-17 16:30:38.000375: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-12-17 16:30:38.045719: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia
2023-12-17 16:30:38.046198: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Versions:
SLEAP: 1.3.3
TensorFlow: 2.8.4
Numpy: 1.22.4
Python: 3.10.12
OS: Linux-6.1.58+-x86_64-with-glibc2.35

System:
GPUs: None detected.

Video: /content/drive/MyDrive/sleap/colab2/male.mp4
2023-12-17 16:30:38.122476: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% ETA: -:--:-- ?2023-12-17 16:30:41.717931: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -36 } dim { size: -37 } dim { size: -38 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -18 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -18 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "101" frequency: 2000 num_cores: 8 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 1048576 l3_cache_size: 40370176 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -18 } dim { size: -40 } dim { size: -41 } dim { size: 1 } } }
Predicting... ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   5% ETA: 0:49:26 4.0 FPS
@talmo talmo pinned this issue Dec 19, 2023
@roomrys roomrys assigned roomrys and unassigned roomrys Jan 5, 2024
@NeuTTH
Copy link

NeuTTH commented Mar 26, 2024

Following up on this. Facing the same issue with using SLEAP on google collab

@amblypatty

This comment was marked as resolved.

@talmo
Copy link
Collaborator Author

talmo commented Apr 10, 2024

Hi @amblypatty,

Did you try installing the older version of cuda first with !apt update && apt install cuda-11-8?

Thanks!

Talmo

@fangyuanlin2002
Copy link

I'm using Paperspace to do the sample project, step 1 and 2 didn't error. But when it comes to step 3 - train the model, I get sleap-train: command not found. This shouldn't happen because we installed sleap at the top. Would you please help?

@talmo
Copy link
Collaborator Author

talmo commented Jun 30, 2024

Hi @FangyuanLinGoBears2024,

Are you seeing any errors when you do pip install sleap[pypi] at the top?

Thanks!

Talmo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants