feat: use whisper as ML model #26

FineFindus · 2024-11-23T12:27:23Z

Replaces vosk with faster_whisper, resulting in much faster audio transcriptions. On my system the same audio file took multiple hours with vosk and less than 10 minutes with the patch.

faster_whisper is much faster than vosk (less than 10 minutes, as opposed to hours). It also handles downloading the required models, allowing us to drop the download option. Whisper also correctly capitalizes words, resulting the Chapter markers needing to be updated to account for that, as well as transcribing Prologue wrong. fix: captilize words, add prolog

GNO21 · 2024-11-24T17:30:33Z

It does appear faster, though there's a lot of work that would need to be done to do this.

Option changes need made for whisper, including model selections. The model is hard coded at the moment.
Installation instructions for whisper, which can be a pain depending on what versions of python and libraries you have installed
Faster-whisper is different that whisper and needs installed
Instructions/options for CPU vs GPU usage of whisper
It appears to do translation in roughly 30 second chunks, which isn't high enough resolution to find a chapter boundary. The current vosk solution translates in 2-3 second chunks. I found https://github.com/linto-ai/whisper-timestamped, which might work, though its not clear if it works with faster-whisper.

This has a lot of potential. Whisper is pretty awesome.

FineFindus · 2024-11-24T19:53:36Z

Help is welcome :)
faster_whisper handles the download of the models.
see 2.
Should also be handled by faster_whisper, but I can't verify.
Not sure what you mean by this? I've been running it like this for about 1/2 year and have had no problems with timestamp accuracy compared to vosk.

GNO21 · 2024-11-24T20:52:09Z

I need to learn more about how faster_whisper works and how it modifies what whisper does :)
There are different inputs to the WhisperModel( ) function to enable CUDA. Whisper itself has different flags to install for GPU support. I don't know enough about faster_whisper. I get 0% GPU usage with the current flags. I can get around 90% GPU usage with WhisperModel(model_size, device="cuda", compute_type="float32") on an nVidia GPU.
If you look at the text output (.srt file), it puts a single timestamp every 30s or more with faster-whisper. Vosk puts a timestamp every 2-3s or so. This means that when the program goes through and searches for keywords, the closest timestamp when using the whisper model can be around 30s from the actual found "chapter marker" word.

FineFindus · 2024-11-25T08:23:38Z

There are different inputs to the WhisperModel( ) function to enable CUDA. Whisper itself has different flags to install for GPU support. I don't know enough about faster_whisper. I get 0% GPU usage with the current flags. I can get around 90% GPU usage with WhisperModel(model_size, device="cuda", compute_type="float32") on an nVidia GPU.

The default device is set to auto, so I would've assumed it used CUDA when available?

FineFindus added 2 commits November 23, 2024 13:18

fix: typo

5f86ec9

FineFindus marked this pull request as draft November 24, 2024 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use whisper as ML model #26

feat: use whisper as ML model #26

FineFindus commented Nov 23, 2024

GNO21 commented Nov 24, 2024 •

edited

Loading

FineFindus commented Nov 24, 2024

GNO21 commented Nov 24, 2024

FineFindus commented Nov 25, 2024

feat: use whisper as ML model #26

Are you sure you want to change the base?

feat: use whisper as ML model #26

Conversation

FineFindus commented Nov 23, 2024

GNO21 commented Nov 24, 2024 • edited Loading

FineFindus commented Nov 24, 2024

GNO21 commented Nov 24, 2024

FineFindus commented Nov 25, 2024

GNO21 commented Nov 24, 2024 •

edited

Loading