The CKB Sentences Corpus is a comprehensive dataset designed for various natural language processing (NLP) applications, specifically focusing on Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) systems. This dataset contains 1000 sentences in Central Kurdish (CKB), covering a wide range of topics. The corpus is structured to provide diverse linguistic content, making it an invaluable resource for training and evaluating TTS and ASR models.
The corpus includes sentences from the following topics, each contributing a specific number of sentences:
Topic | Number of Sentences |
---|---|
History | 50 |
Geography | 50 |
Sports | 100 |
Religious | 50 |
General | 50 |
News | 50 |
Health | 50 |
Weather | 50 |
Arts | 50 |
Science and Technology | 50 |
Poetry | 100 |
Economy | 50 |
Very Common | 50 |
Facebook Comment | 50 |
Government | 50 |
Normal | 100 |
Total | 1000 |
The corpus can be used to train TTS systems by providing diverse and phonetically rich sentences. It covers a wide range of topics, ensuring that the generated speech can handle various vocabulary and sentence structures. This diversity helps in creating a more natural and intelligible synthetic voice for CKB.
For ASR, this corpus serves as a valuable resource for training and evaluating models. The sentences include a wide range of phonetic and syntactic structures, which are essential for developing robust ASR systems capable of understanding different accents and speaking styles in CKB.
In addition to TTS and ASR, the CKB Sentences Corpus can be utilized for:
- Language Modeling: Developing models that can predict the next word or sentence in a sequence.
- Speech Translation: Training models to translate spoken CKB into other languages.
- Voice Conversion: Converting one speaker's voice to another within the CKB language.
- Speech Synthesis Research: Analyzing and improving the quality of synthetic speech.
You can access the corpus by cloning this repository or downloading the dataset directly from the provided links. Please adhere to the data usage policies and cite this repository if you use the data in your research.
git clone https://github.com/yourusername/ckb-sentences-corpus.git
We welcome contributions to enhance the quality and scope of this corpus. If you have suggestions for new sentences, corrections, or additional topics, please submit a pull request or open an issue.
This project is licensed under the MIT License. See the LICENSE file for more details.
If you use this corpus in your research, please cite the following paper:
Abdullah, A.A., Veisi, H. and Rashid, T., 2024. Breaking Walls: Pioneering Automatic Speech Recognition for Central Kurdish: End-to-End Transformer Paradigm. arXiv preprint arXiv:2406.02561.
For any questions or additional information, please contact us at [info@asosoft.com].