Malayalam Speech Datasets

The repository contains various cleaned Malayalam ASR ( Automated Speech Recognition ) corpus and points to several other openly available datasets which are directly usable for training Speech Recognition Models.

All credits go to respective owners who have put efforts to create them.

Please do contact for any ownership related issues or if you wish to contribute a new dataset.

Labelled Datasets

Datasets which has got audio chunks and their corresponding transcriptions.

NISP DATASET

Crowd-sourced
~25 speakers
~2+ Hours
~2200 utterances
Audio has Unnecessary pauses at the beginning
Repo contains cleaned version of Original dataset found here
Can expect minor mismatch with very few audio files, even after cleanup.

Indic TTS Malayalam corpus

Studio recorded
2 speakers
8601 utterances
13 hours 58 minutes 20 seconds
48 kHz sampling rate

OpenSLR Malayalam

Studio Recorded HQ
~40 speakers
~5 Hours
4100 utterances
48 kHz sampling rate

Festvox IIITH Malayalam database

1 speaker
1000 utterances
1 hour 38 minutes
16 kHz sampling rate

MSC Reviewed speech

Recorded by volunteers in natural home/office/street environment with mobile devices:
75 speakers
1541 utterances
1 hour 38 minutes 16 seconds
48 kHz sampling rate

IIIT-H

Available on request
Studio Recorded
1 Speaker

NPLT-MALE

Paid dataset ( sample available for free )
~18 hour+
English content included

NPLT-FEMALE

Paid dataset ( sample available for free )
~17 hour+
English content included

Malayalam Raw Speech Corpus

Paid Dataset ( sample available for free )
164+ Hours
Complete metadata available
458 speakers (231 Female and 227 Male)
43670 utterances

IIITM-K

Openly unavailable & Unreachable
Sample is not available
Domain specific - Agriculture
Claims to have 250hr + data with proper meta info

Un Labelled Datasets

Basically, chunks of meaningful Malayalam audio which has got no text transcriptions

ULCA

600+ Hours
Multiple Varieties of speakers
Machine-generated

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
NISP_MALAYALAM_CLEANED		NISP_MALAYALAM_CLEANED
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malayalam Speech Datasets

Labelled Datasets

NISP DATASET

Indic TTS Malayalam corpus

OpenSLR Malayalam

Festvox IIITH Malayalam database

MSC Reviewed speech

IIIT-H

NPLT-MALE

NPLT-FEMALE

Malayalam Raw Speech Corpus

IIITM-K

Un Labelled Datasets

ULCA

About

Releases

Packages

aswinpradeep/malayalam-asr-datasets

Folders and files

Latest commit

History

Repository files navigation

Malayalam Speech Datasets

Labelled Datasets

NISP DATASET

Un Labelled Datasets

About

Resources

Stars

Watchers

Forks