This README shows how to perform hardware-aware optimization of Ultra-Fast-Lane-Detection-V2 ResNet-18 model on TuSimple dataset.
The repository code was tested on Python 3.8.
To get started, install torch==1.13.1
and torchvision==0.14.1
compatible with your CUDA using
the instruction from the official site.
The repository is based on two main packages:
- ENOT Framework — a flexible tool for Deep Learning developers which automates neural architecture optimization.
- ENOT Latency Server — small open-source package that provides simple API for latency measurement on remote device.
Follow the installation guide to install enot-autodl==3.3.3
.
To install enot-latency-server
simply run:
pip install enot-latency-server==1.2.0
Install other requirements:
NOTE: You must have the same CUDA version on your system as PyTorch's CUDA version. We built
my_interp
using CUDA 11.7.
pip install -r requirements.txt
# Install NVIDIA DALI - very fast data loading library:
pip install --extra-index-url https://developer.download.nvidia.com/compute/redist --upgrade nvidia-dali-cuda110
cd my_interp
# If the following command fails, you might need to add path to your cuda to PATH:
# PATH=/usr/local/cuda-11/bin:$PATH bash build.sh
bash build.sh
NOTE: All pruning/training procedures are performed on x86-64 computer, ONLY latency measurements are performed on a remote target device, so you do not need to install
enot-autodl
package on the target device, you only need to installenot-latency-server
package for latency measurements.
Download preprocessed TuSimple dataset from Google Drive and unzip it to the repository root:
unzip dataset.zip
The dataset should have the following structure:
└── ultra-fast-lane-detector-v2 (repository root)
└──dataset
├── clips
├── 0313-1
├── 0313-2
├── 0530
├── 0531
└── 0601
├── label_data_0313.json
├── label_data_0531.json
├── label_data_0601.json
├── test_label.json
├── test_tasks_0627.json
├── test.txt
├── train_gt.txt
└── tusimple_anno_cache.json
If you want to use your own path for dataset, change data_root
parameter in configs/tusimple_res18.py
.
To train baseline model, run:
bash commands/baseline/train.sh
The result of this command is the model_best.pth
checkpoint in the runs/baseline
directory.
Use this command to verify baseline accuracy:
bash commands/baseline/test.sh
To optimize a model by latency for Jetson, run our latency server on Jetson (see instruction).
NOTE: Substitute
--host
and--port
in the commands and.sh
scripts below with the host and port of your server on Jetson.
To optimize a model by latency for Jetson, run the corresponding script (x2/x3 means latency acceleration):
bash commands/x2_jetson/prune.sh
bash commands/x3_jetson/prune.sh
After pruning, the model should be tuned with the following command:
bash commands/x2_jetson/tune.sh
bash commands/x3_jetson/tune.sh
To use INT8 data type for model inference, follow our quantization pipeline:
bash commands/x3_jetson/quant.sh
Use this command to verify the optimized model accuracy:
bash commands/x2_jetson/test.sh
bash commands/x3_jetson/test.sh
Use this command to verify the optimized model latency:
bash commands/x2_jetson/measure.sh
bash commands/x3_jetson/measure.sh
Download our checkpoints from Google Drive.
To extract checkpoints
use the following command:
unzip ufld_ckpt_with_onnx_quant.zip
To check their metrics, run with the following commands:
python test.py configs/tusimple_res18.py --model_ckpt checkpoints/baseline/model_best.pth
python test.py configs/tusimple_res18.py --model_ckpt checkpoints/x2_jetson/model_best.pth
python test.py configs/tusimple_res18.py --model_ckpt checkpoints/x3_jetson/model_best.pth
To check metrics on ONNX run:
python test.py configs/tusimple_res18.py --onnx_path checkpoints/baseline/model_best.onnx --batch_size 1
python test.py configs/tusimple_res18.py --onnx_path checkpoints/x2_jetson/model_best.onnx --batch_size 1
python test.py configs/tusimple_res18.py --onnx_path checkpoints/x3_jetson/model_best.onnx --batch_size 1
NOTE: We recommend to check metric for
quantized_model.onnx
on a target device (see our instruction in Validation on Jetson AGX Orin device)
To check their latency, run the following commands:
python measure.py --model_ckpt checkpoints/baseline/model_best.pth --host <jetson-server-host> --port 15003
python measure.py --model_ckpt checkpoints/x2_jetson/model_best.pth --host <jetson-server-host> --port 15003
python measure.py --model_ckpt checkpoints/x3_jetson/model_best.pth --host <jetson-server-host> --port 15003
python measure.py --onnx checkpoints/x3_jetson/quantized_model.onnx --host <jetson-server-host> --port 15003
To make sure that your model accuracy is not affected by computations in FP16 or INT8 on Jetson device, follow this validation pipeline:
-
Create a dataset in the pickle format:
python pickle_dataset.py configs/tusimple_res18.py --pickle_data_path pickle_data
-
Send an ONNX model,
pickle_data
, andinference_on_device.py
to the Jetson device usingscp
:scp -P <jetson-port> -r path/to/model.onnx pickle_data inference_on_device.py <user-name>@<jetson-host>:/your/location/
-
Install OnnxRuntime package with TensorRT backend using the following commands:
wget https://nvidia.box.com/shared/static/mvdcltm9ewdy2d5nurkiqorofz1s53ww.whl -O onnxruntime_gpu-1.15.1-cp38-cp38-linux_aarch64.whl pip3 install onnxruntime_gpu-1.15.1-cp38-cp38-linux_aarch64.whl
-
Run inference on the pickled dataset from the directory with previously copied model ONNX,
pickle_data
, andinference_on_device.py
:python3 inference_on_device.py -m your/model.onnx -i pickle_data -o out_pickle --device jetson
-
Send the resulting
out_pickle
directory to your PC to theultra-fast-lane-detector-v2
repository root usingscp
:scp -P <pc-port> -r out_pickle <user-name>@<pc-host>:/path/to/ultra-fast-lane-detector-v2/
python test_on_pickles.py configs/tusimple_res18.py --batch_size 1 --pickled_inference_results out_pickle
To optimize a model by latency for Texas Instruments (TI), you need to run a latency server on TI and a compile server on x86 PC (Linux OS). The compile server creates binaries for a model and sends them to the latency server. The latency server measures model latency using these binaries. Use our instruction to run latency server and compile server.
NOTE: Substitute
--host
and--port
in the commands and.sh
scripts below with the host and port of your compile server on x86 PC.
To optimize a model by latency for TI, run the corresponding script (x4 means latency acceleration):
bash commands/x4_ti/prune.sh
After pruning, the model should be tuned with the following command:
bash commands/x4_ti/tune.sh
Use this command to verify the optimized model accuracy:
bash commands/x4_ti/test.sh
Use this command to verify the optimized model latency:
bash commands/x4_ti/measure.sh
Download our checkpoints from Google Drive.
To extract checkpoints
use the following command:
unzip ufld_ckpt_with_onnx_quant.zip
To check their metrics, run with the following commands:
python test.py configs/tusimple_res18.py --model_ckpt checkpoints/baseline/model_best.pth
python test.py configs/tusimple_res18.py --model_ckpt checkpoints/x3_ti/model_best.pth
python test.py configs/tusimple_res18.py --model_ckpt checkpoints/x4_ti/model_best.pth
NOTE: Model
checkpoints/x3_ti/model_best.pth
was obtained on Jetson (checkpoints/x2_jetson/model_best.pth
) and has x3 acceleration on TI device.
To check metrics on ONNX run:
python test.py configs/tusimple_res18.py --onnx_path checkpoints/baseline/model_best.onnx --batch_size 1
python test.py configs/tusimple_res18.py --onnx_path checkpoints/x3_ti/model_best.onnx --batch_size 1
python test.py configs/tusimple_res18.py --onnx_path checkpoints/x4_ti/model_best.onnx --batch_size 1
To check their latency on TI, run the following commands:
python measure.py --model_ckpt checkpoints/baseline/model_best.pth --host <compile-server-host> --port 15003 --ti_server
python measure.py --model_ckpt checkpoints/x3_ti/model_best.pth --host <compile-server-host> --port 15003 --ti_server
python measure.py --model_ckpt checkpoints/x4_ti/model_best.pth --host <compile-server-host> --port 15003 --ti_server
TI NPU performs computations in FX8 data type (8-bit fixed point numbers). To make sure that your model accuracy is not affected by computations in FX8, follow this validation pipeline:
-
Create a dataset in the pickle format:
python pickle_dataset.py configs/tusimple_res18.py --pickle_data_path pickle_data
-
Download calibration data from Google Drive to the repository root.
-
Create model artifacts for TI NPU using these calibration data:
python compile_model.py -m <your-checkpoint.pth> -c ufldv2_calibration.zip -o compiled_artifacts --host <compilation-server-host> --port <compilation-server-port>
NOTE: Make sure that your compilation server is up-to-date with the last version from the repository.
NOTE: It takes more about 60 min to calibrate and compile the baseline model on our x86_PC.
-
Send
compiled_artifacts
,pickle_data
, andinference_on_device.py
to the TI device usingscp
:scp -P <ti-port> -r compiled_artifacts pickle_data inference_on_device.py <user-name>@<ti-host>:/your/location/
-
Run inference on the pickled dataset from the directory with previously copied
compiled_artifacts
,pickle_data
, andinference_on_device.py
:TIDL_TOOLS_PATH=/opt/latency_server/tidl_tools python3 inference_on_device.py -m compiled_artifacts -i pickle_data -o out_pickle --device ti
-
Send the resulting
out_pickle
directory to your PC to theultra-fast-lane-detector-v2
repository root usingscp
:scp -P <pc-port> -r out_pickle <user-name>@<pc-host>:/path/to/ultra-fast-lane-detector-v2/
python test_on_pickles.py configs/tusimple_res18.py --batch_size 1 --pickled_inference_results out_pickle