Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use whisper as ML model #26

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

FineFindus
Copy link

Replaces vosk with faster_whisper, resulting in much faster audio transcriptions. On my system the same audio file took multiple hours with vosk and less than 10 minutes with the patch.

faster_whisper is much faster than vosk (less than 10 minutes, as
opposed to hours). It also handles downloading the required models,
allowing us to drop the download option.
Whisper also correctly capitalizes words, resulting the Chapter markers
needing to be updated to account for that, as well as transcribing Prologue
wrong.

fix: captilize words, add prolog
@GNO21
Copy link

GNO21 commented Nov 24, 2024

It does appear faster, though there's a lot of work that would need to be done to do this.

  1. Option changes need made for whisper, including model selections. The model is hard coded at the moment.
  2. Installation instructions for whisper, which can be a pain depending on what versions of python and libraries you have installed
  3. Faster-whisper is different that whisper and needs installed
  4. Instructions/options for CPU vs GPU usage of whisper
  5. It appears to do translation in roughly 30 second chunks, which isn't high enough resolution to find a chapter boundary. The current vosk solution translates in 2-3 second chunks. I found https://github.com/linto-ai/whisper-timestamped, which might work, though its not clear if it works with faster-whisper.

This has a lot of potential. Whisper is pretty awesome.

@FineFindus
Copy link
Author

  1. Help is welcome :)
  2. faster_whisper handles the download of the models.
  3. see 2.
  4. Should also be handled by faster_whisper, but I can't verify.
  5. Not sure what you mean by this? I've been running it like this for about 1/2 year and have had no problems with timestamp accuracy compared to vosk.

@FineFindus FineFindus marked this pull request as draft November 24, 2024 20:22
@GNO21
Copy link

GNO21 commented Nov 24, 2024

  1. I need to learn more about how faster_whisper works and how it modifies what whisper does :)

  2. There are different inputs to the WhisperModel( ) function to enable CUDA. Whisper itself has different flags to install for GPU support. I don't know enough about faster_whisper. I get 0% GPU usage with the current flags. I can get around 90% GPU usage with WhisperModel(model_size, device="cuda", compute_type="float32") on an nVidia GPU.

  3. If you look at the text output (.srt file), it puts a single timestamp every 30s or more with faster-whisper. Vosk puts a timestamp every 2-3s or so. This means that when the program goes through and searches for keywords, the closest timestamp when using the whisper model can be around 30s from the actual found "chapter marker" word.

@FineFindus
Copy link
Author

There are different inputs to the WhisperModel( ) function to enable CUDA. Whisper itself has different flags to install for GPU support. I don't know enough about faster_whisper. I get 0% GPU usage with the current flags. I can get around 90% GPU usage with WhisperModel(model_size, device="cuda", compute_type="float32") on an nVidia GPU.

The default device is set to auto, so I would've assumed it used CUDA when available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants