Skip to content

Repository contains various Malayalam ASR based resources curated from multiple sources

Notifications You must be signed in to change notification settings

aswinpradeep/malayalam-asr-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Malayalam Speech Datasets

The repository contains various cleaned Malayalam ASR ( Automated Speech Recognition ) corpus and points to several other openly available datasets which are directly usable for training Speech Recognition Models.

All credits go to respective owners who have put efforts to create them.

Please do contact for any ownership related issues or if you wish to contribute a new dataset.

Labelled Datasets

Datasets which has got audio chunks and their corresponding transcriptions.

NISP DATASET

  • Crowd-sourced
  • ~25 speakers
  • ~2+ Hours
  • ~2200 utterances
  • Audio has Unnecessary pauses at the beginning
  • Repo contains cleaned version of Original dataset found here
  • Can expect minor mismatch with very few audio files, even after cleanup.
  • Studio recorded
  • 2 speakers
  • 8601 utterances
  • 13 hours 58 minutes 20 seconds
  • 48 kHz sampling rate
  • Studio Recorded HQ
  • ~40 speakers
  • ~5 Hours
  • 4100 utterances
  • 48 kHz sampling rate
  • 1 speaker
  • 1000 utterances
  • 1 hour 38 minutes
  • 16 kHz sampling rate
  • Recorded by volunteers in natural home/office/street environment with mobile devices:
  • 75 speakers
  • 1541 utterances
  • 1 hour 38 minutes 16 seconds
  • 48 kHz sampling rate
  • Available on request
  • Studio Recorded
  • 1 Speaker
  • Paid dataset ( sample available for free )
  • ~18 hour+
  • English content included
  • Paid dataset ( sample available for free )
  • ~17 hour+
  • English content included
  • Paid Dataset ( sample available for free )
  • 164+ Hours
  • Complete metadata available
  • 458 speakers (231 Female and 227 Male)
  • 43670 utterances
  • Openly unavailable & Unreachable
  • Sample is not available
  • Domain specific - Agriculture
  • Claims to have 250hr + data with proper meta info

Un Labelled Datasets

Basically, chunks of meaningful Malayalam audio which has got no text transcriptions

  • 600+ Hours
  • Multiple Varieties of speakers
  • Machine-generated

About

Repository contains various Malayalam ASR based resources curated from multiple sources

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published