EfficientViT-GazeSAM is a gaze-prompted image segmentation model capable of running in real time with TensorRT on an NVIDIA RTX 4070. GazeSAM is comprised of a face detection component (ProxylessGaze), gaze detection component (L2CS-Net), an object detection component (YOLO-NAS), a depth estimation component (Depth-Anything), and an image segmentation component (EfficientViT).
Prior to following the runtime-specific instructions below, please make sure to follow the conda environment creation and package installation instructions for this repo.
# install extra packages
pip install -r extra_requirements.txt
-
Ensure the following packages are installed.
a. TensorRT
b. torch2trt
c.
python -m pip install cuda-python
. -
Please follow the engine creation instructions within the
models
directory here. You can choose between the default version (FP32 + FP16 engines) and the optimized version (FP32, FP16, and INT8 engines). The optimized version is approximately 5ms faster per frame (on an RTX 4070) but both will run in real-time.
-
python -m pip install onnxruntime-gpu
Note: if you run into ONNXRuntime issues, you can try uninstalling
onnxruntime
andonnxruntime-gpu
, then reinstallingonnxruntime-gpu
. -
Download the ONNX model components here and save them to the
models/onnx
directory (make sure to create theonnx
subfolder).
-
Setup EfficientViT-SAM model [guide]
-
Setup depth estimation model
a. Download the Depth-Anything repo and save it as a subfolder within this current directory.
b.
cp models/create_pytorch/dpt_replacement.py Depth-Anything/depth_anything/dpt.py
. This prepends the torchhub local download path with "Depth-Anything".c. Download the Depth-Anything-Base checkpoint here. Save it within the
models/pytorch
directory (make sure to create thepytorch
subfolder). -
Setup gaze estimation model
a. Download the L2CS-Net pickle file here. Save it within the
models/pytorch
directory (make sure to create thepytorch
subfolder). -
Download the ONNX model components here. Save the files within the
models/onnx
directory (make sure to create theonnx
subfolder).
GazeSAM can process webcam and video file inputs. To run with webcam, run python gazesam_demo.py --webcam
. To run with input video, python gazesam_demo.py --video <path>
.
By default, we run with TensorRT (use the runtime
flag to change this, but note that only TensorRT mode will produce results in real-time). Results are saved by default to the output_videos
directory (modifiable via the output-dir
flag).
If you generated engines using the optimized script, set --precision-mode optimized
. Modes described here. You can download the example video here.
Input video + default engines example: python gazesam_demo.py --video input_videos/example.mp4 --precision-mode default
Webcam + optimized engines example: python gazesam_demo.py --webcam --precision-mode optimized
If EfficientViT is useful or relevant to your research, please kindly recognize our contributions by citing our paper:
@inproceedings{cai2023efficientvit,
title={Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction},
author={Cai, Han and Li, Junyan and Hu, Muyan and Gan, Chuang and Han, Song},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={17302--17313},
year={2023}
}