The repository contains various cleaned Malayalam ASR ( Automated Speech Recognition ) corpus and points to several other openly available datasets which are directly usable for training Speech Recognition Models.
All credits go to respective owners who have put efforts to create them.
Please do contact for any ownership related issues or if you wish to contribute a new dataset.
Datasets which has got audio chunks and their corresponding transcriptions.
- Crowd-sourced
- ~25 speakers
- ~2+ Hours
- ~2200 utterances
- Audio has Unnecessary pauses at the beginning
- Repo contains cleaned version of Original dataset found here
- Can expect minor mismatch with very few audio files, even after cleanup.
- Studio recorded
- 2 speakers
- 8601 utterances
- 13 hours 58 minutes 20 seconds
- 48 kHz sampling rate
- Studio Recorded HQ
- ~40 speakers
- ~5 Hours
- 4100 utterances
- 48 kHz sampling rate
- 1 speaker
- 1000 utterances
- 1 hour 38 minutes
- 16 kHz sampling rate
- Recorded by volunteers in natural home/office/street environment with mobile devices:
- 75 speakers
- 1541 utterances
- 1 hour 38 minutes 16 seconds
- 48 kHz sampling rate
- Available on request
- Studio Recorded
- 1 Speaker
- Paid dataset ( sample available for free )
- ~18 hour+
- English content included
- Paid dataset ( sample available for free )
- ~17 hour+
- English content included
- Paid Dataset ( sample available for free )
- 164+ Hours
- Complete metadata available
- 458 speakers (231 Female and 227 Male)
- 43670 utterances
- Openly unavailable & Unreachable
- Sample is not available
- Domain specific - Agriculture
- Claims to have 250hr + data with proper meta info
Basically, chunks of meaningful Malayalam audio which has got no text transcriptions
- 600+ Hours
- Multiple Varieties of speakers
- Machine-generated