Code Switch Language Modeling for English and Malay

Code-switching refers to the practice of alternating between two or more languages or dialects within a conversation or communication context. It is a common linguistic phenomenon observed in multilingual societies or among bilingual individuals. Code-switching can occur for various reasons, such as cultural identity, social group membership, or convenience.

This repository contains code and resources for code-switched data generation and language modeling for the generated Code-Switched data. This work has been published in the Asian Conference on Intelligent Information and Database Systems 2023 which can be found here.

Introduction

Code-switched data generation refers to the process of creating or generating text or speech data that exhibits code-switching patterns. This task is often approached in natural language processing (NLP) and computational linguistics research to develop models and systems that can understand, generate, or analyze code-switched language. We use the NSC (National Speech Corpus) dataset to generate code-switched data for English and Malay. The generated data is then used to train and evaluate language models for code-switched data.

Installation

To use the code and scripts in this repository, please follow these steps:

Clone the repository:

git clone https://github.com/kjgpta/Code-Switch-Language-Modeling-for-English-and-Malay.git

Install the required dependencies. You can use pip:
```
pip install -r requirements.txt
```
Setup any additional configuration or environment variables as necessary.

Usage

This section describes how to use the code and scripts provided in this repository.

Data Generation:

The Data Generation directory contains scripts for normalizing, translating, and generating code-switched data. This includes tasks such as data cleaning, normalization, translation, and generation.
Language Modeling:

The Language Modeling directory contains code for training and evaluating language models for code-switched data.

Please refer to the individual directories for detailed instructions on how to run each script or module.

Data

The NSC dataset used in this project is not public and thus is not included in this repository. However, you can request IMDA Singapore to get the same. Once you have obtained the dataset, make sure to follow the preprocessing steps described in the repository to prepare the data for analysis.

Acknowledgement

This research is supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2(MOE2019-T2-1-084). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of Ministry of Education, Singapore.

License

This project is licensed under the CC0-1.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
Data Generation		Data Generation
Language Modeling		Language Modeling
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Switch Language Modeling for English and Malay

Introduction

Installation

Usage

Data

Acknowledgement

License

About

Contributors 2

Languages

License

kjgpta/Code-Switch-Language-Modeling-for-English-and-Malay

Folders and files

Latest commit

History

Repository files navigation

Code Switch Language Modeling for English and Malay

Introduction

Installation

Usage

Data

Acknowledgement

License

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages