Performance Optimizations

Moved the final scaling and uint8 quantization to GPU, reducing CPU and main memory bandwidth consumption. 2.5x speed-up.
Instruct FFMPEG to use RGB frames instead of BGR so no need to swap channels.
Batched inference (controlled by invoking the --batch & --batches parameter, default is 4).
Instruct torch to make contiguous tensors after the BCHW -> BHWC transform on GPU. So no need to copy the buffer before writing to FFMPEG . Reduced output IO time by 10x.
Use NVENC pipilene when available to decode and encode the images when piping inputs

Provide feedback