CKB Sentences Corpus for TTS and ASR

Overview

The CKB Sentences Corpus is a comprehensive dataset designed for various natural language processing (NLP) applications, specifically focusing on Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) systems. This dataset contains 1000 sentences in Central Kurdish (CKB), covering a wide range of topics. The corpus is structured to provide diverse linguistic content, making it an invaluable resource for training and evaluating TTS and ASR models.

Corpus Details

The corpus includes sentences from the following topics, each contributing a specific number of sentences:

Topic	Number of Sentences
History	50
Geography	50
Sports	100
Religious	50
General	50
News	50
Health	50
Weather	50
Arts	50
Science and Technology	50
Poetry	100
Economy	50
Very Common	50
Facebook Comment	50
Government	50
Normal	100
Total	1000

Usage

Text-to-Speech (TTS)

The corpus can be used to train TTS systems by providing diverse and phonetically rich sentences. It covers a wide range of topics, ensuring that the generated speech can handle various vocabulary and sentence structures. This diversity helps in creating a more natural and intelligible synthetic voice for CKB.

Automatic Speech Recognition (ASR)

For ASR, this corpus serves as a valuable resource for training and evaluating models. The sentences include a wide range of phonetic and syntactic structures, which are essential for developing robust ASR systems capable of understanding different accents and speaking styles in CKB.

Other Applications

In addition to TTS and ASR, the CKB Sentences Corpus can be utilized for:

Language Modeling: Developing models that can predict the next word or sentence in a sequence.
Speech Translation: Training models to translate spoken CKB into other languages.
Voice Conversion: Converting one speaker's voice to another within the CKB language.
Speech Synthesis Research: Analyzing and improving the quality of synthetic speech.

How to Access

You can access the corpus by cloning this repository or downloading the dataset directly from the provided links. Please adhere to the data usage policies and cite this repository if you use the data in your research.

git clone https://github.com/yourusername/ckb-sentences-corpus.git

Contribution

We welcome contributions to enhance the quality and scope of this corpus. If you have suggestions for new sentences, corrections, or additional topics, please submit a pull request or open an issue.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Citation

If you use this corpus in your research, please cite the following paper:

Abdullah, A.A., Veisi, H. and Rashid, T., 2024. Breaking Walls: Pioneering Automatic Speech Recognition for Central Kurdish: End-to-End Transformer Paradigm. arXiv preprint arXiv:2406.02561.

Contact

For any questions or additional information, please contact us at [info@asosoft.com].

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Filelist.txt		Filelist.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CKB Sentences Corpus for TTS and ASR

Overview

Corpus Details

Usage

Text-to-Speech (TTS)

Automatic Speech Recognition (ASR)

Other Applications

How to Access

Contribution

License

Citation

Contact

About

Releases

Packages

AsoSoft/CKB-Sentence-Dataset

Folders and files

Latest commit

History

Repository files navigation

CKB Sentences Corpus for TTS and ASR

Overview

Corpus Details

Usage

Text-to-Speech (TTS)

Automatic Speech Recognition (ASR)

Other Applications

How to Access

Contribution

License

Citation

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages