blya_bot

Blya - Russian expletive "shit" (figuratively). Pronounced like "bla", but with a softer "L". It's what you might say if your car won't start in the morning, and you're going to be late for work. urbandictionary

This as a telegram bot, which transcribes voice and video notes into text, using automatic speech recognition (ASR) models with one purpose - decide, who uses more "curse words"...

How it works?

This bot has 3 main parts:

Speech recognition engine (vosk and whisper)
Dictionary generator with custom DSL and morphological expansion (pymorphy2), which used to generate all variations of "abusive language" words
Pattern Searching (based on Aho-Corasick Algorithm ahocorapy), which used to find all "abusive language" words in transcribed text.

When bot starting, it loads dictionary file, which can be specified by user, and generates all words variants, by parsing dictionary's DSL.

Then, morphological analysis applied to each word. After this step, all morphological variants of this word will be added to dictionary.

Then, dictionary will be converted into Aho-Corasick automata, which will be used for pattern matching.

On next step, a speech recognition engine will be initialized. Application loads model into memory, and will be ready to process requests.

You can send voice or video note to this bot, and it will do following steps:

Transcribe voice into text
Search all "abusive language" words in transcribed text
Generate summary of used "abusive language" words
Bot Reply with "abusive language" words summary

You can also add your bot to the group, and it will automatically respond to all voice and video notes with "abusive language" summary.

Limitations

Started as simple joke, this project was written especially for russian language.

Although speech recognition models support many languages, bot source code has some internal limitations. At this state, only russian language tested. Check TODO section for more information about multi-language support progress.

As example, morphological analysis is done by pymorphy2, which provides only russian morphological models.

If you want to try this bot with other language, just remove pymorphy2 and pymorphy2-dicts-ru packages from installation. It will disable morphological analysis for dictionaries, but you will able to try bot on different speech recognition models.

Dictionary DSL

#<anything> - Comment
!<word> - to disable morphing for this word. This token is global, and can be placed in any part of word.
  Expansions will also contain this token, and morphing will be disabled for variants too
~<word> - excludes word from dictionary. Applied after variants generation and morphing. Must be placed at word start.
[...|...] - expand to word with extra variants (suffixes, prefixes). 
  This will also include word without this elements, like: he[llo|ll] -> he, hello, hell
  Can be used with single variant only, like: bad[ass] -> bad, badass
{...|...} - expand to variants
  This will not include word without this elements, like: he{llo|ll} -> hello, hell
  Using single element in {} group has no sense, example: he{llo} -> hello

Build & Run

Install app dependencies

You can install blya-bot locally, with poetry:

$ # Virtualenv recommended:
$ python3 -m venv ./.venv
$ ./.venv/bin/activate
$ # Install Poetry, if you don't have it
$ pip install poetry
$ # Install blya-bot dependencies
$ poetry install --no-root --all-extras

Obtaining speech recognition models

Next, you need to gather speech recognition model files.

Current "blya_bot" implementation supports multiple speech recognition engines:

vosk
whisper via faster-whisper
whisper via pywhispercpp bindings for whisper.cpp

Vosk

You can download models for vosk on this website - https://alphacephei.com/vosk/models

Place model in folder you like, and specify path to this model during configuration.

Faster-Whisper

Models will be downloaded automatically on first application start.

Default models folder is ~/.cache/huggingface/hub.

Pywhispercpp

Models will be downloaded automatically on first application start.

Default models folder is ~/.local/share/pywhispercpp/models.

Configuration

When all dependencies installed and model files obtained, you need to configure settings.

Create .env file in this directory, and populate required options:

Vosk

TELEGRAM_BOT_TOKEN="<YOUR TOKEN>"

RECOGNITION_ENGINE="vosk"
# At now, has only one option - `model_name`. Specify path to downloaded model
RECOGNITION_ENGINE_OPTIONS='{"model_path": "/path/to/vosk-model"}'

Faster-Whisper

TELEGRAM_BOT_TOKEN="<YOUR TOKEN>"

RECOGNITION_ENGINE="faster-whisper"
# Required fields: `model` and `language`
RECOGNITION_ENGINE_OPTIONS='{"model": "small", "language": "ru", "device": "cpu", "compute_type": "int8", "beam_size": 5}'

Pywhispercpp

TELEGRAM_BOT_TOKEN="<YOUR TOKEN>"

RECOGNITION_ENGINE="pywhispercpp"
# Required fields: `model` and `language`
RECOGNITION_ENGINE_OPTIONS='{"model": "small", "language": "ru"}'

When all required fields configured, you can run application:

$ python main.py # or python -m blya_bot

Docker images

This repository also includes some docker-files, which can be used to build all-in-one blya-bot images. This images will contain bot sources and desired speech recognition model.

Images can be distributed and will work without extra volumes.

Vosk - based

Build:

Pass url to vosk model as MODEL_URL build arg.

docker build --build-arg MODEL_URL="https://alphacephei.com/vosk/models/vosk-model-small-ru-0.22.zip" -t blya_bot:vosk -f dockerfiles/vosk.Dockerfile .

Run:

docker run --env TELEGRAM_BOT_TOKEN="..." blya_bot:vosk

Faster-Whisper - based

Build:

Pass MODEL and LANG build args.

docker build --build-arg MODEL=small --build-arg LANG=ru -t blya_bot:faster-whisper -f dockerfiles/faster-whisper.Dockerfile .

Run:

docker run --env TELEGRAM_BOT_TOKEN="..." blya_bot:faster-whisper

Pywhispercpp - based

Build:

Pass MODEL and LANG build args.

docker build --build-arg MODEL=small --build-arg LANG=ru -t blya_bot:pywhispercpp -f dockerfiles/pywhispercpp.Dockerfile .

Run:

docker run --env TELEGRAM_BOT_TOKEN="..." blya_bot:pywhispercpp

TODO

Contributing

Fork it
Clone it: git clone https://github.com/dokzlo13/blya_bot.git
Create your feature branch: git checkout -b my-new-feature
Make changes and add them: git add .
Commit: git commit -m 'My awesome feature'
Push: git push origin my-new-feature
Pull request

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.do		.do
blya_bot		blya_bot
dockerfiles		dockerfiles
fixtures		fixtures
models		models
tests		tests
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blya_bot

How it works?

Limitations

Dictionary DSL

Build & Run

Install app dependencies

Obtaining speech recognition models

Vosk

Faster-Whisper

Pywhispercpp

Configuration

Vosk

Faster-Whisper

Pywhispercpp

Docker images

Vosk - based

Faster-Whisper - based

Pywhispercpp - based

TODO

Contributing

About

Releases

Packages

Languages

License

dokzlo13/blya_bot

Folders and files

Latest commit

History

Repository files navigation

blya_bot

How it works?

Limitations

Dictionary DSL

Build & Run

Install app dependencies

Obtaining speech recognition models

Vosk

Faster-Whisper

Pywhispercpp

Configuration

Vosk

Faster-Whisper

Pywhispercpp

Docker images

Vosk - based

Faster-Whisper - based

Pywhispercpp - based

TODO

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages