Interlingual MFA

This repository contains Python scripts and workflow for a) taking an MFA forced alignment model that was trained for one language, and b) running that model onto a different language.

I tested the code with some Armenian data by aligning with an English model (and some other high-resource models). The alignment seems to work well.

Background

The rationale is that for low-resource languages, it takes a lot of data (sound files, transcriptions, pronunciation dictionaries) to create a high-quality alignment model. As a stepping stone, you can run a model from a high-resource language (like English) onto your low-resource language (like Armenian). The generated alignments seem to be quite sensible. In my anecdotal experience, the alignments I get from an English-based model (that's trained on over >1000hrs) are better than the alignments from a custom-made model (based on 1-20hrs of data).

Workflow

The following workflow explains the steps to running the scripts alongside MFA. There are example files in Examples. A lot of the background work was done thanks to TextGridTools.

Before you begin

Ensure that MFA is running on your system.
Ensure you have a high-resource acoustic model like the english_mfa.
Ensure you have an original pronunciation dictionary, called pronDictOriginal.txt. The dictionary should have the format of word IPA

Procedure

Review the [list of phones] in the english_mfa model.
Create a phone mapping file like phoneMapping.txt.

This file will map phones that exist in the low-resource language's pronunciation dictionary pronDictOriginal.txt but which are absent in english_mfa.

For every such non-English phone, write an approximate English phone. For example, a non-English trill /r/ can be approximated and mapped to an English flap /ɾ/. For example, see here.

Convert your original dictionary pronDictOriginal.txt into an intermediate dictionary pronDictIntermediate.txt by running the following command:

python convertPronDict.py pronDictOriginal.txt phoneMapping.txt pronDictIntermediate.txt

This command will replace the non-English phones with English phones. The script should return errors if there any issues in your original dictionary or phone-mapping file.

Keep note of the generated file wordTranscriptions.pkl which will be used to transfer information about the dictionary across the Python files.
Validate MFA on your dictionary and corpus to make sure there are no non-English phones.

mfa validate $CORPUS_DIRECTORY pronDictIntermediate.txt english_mfa --ignore_acoustics

Run the MFA aligner on your corpus with the intermediate dictionary.

mfa align $CORPUS_DIRECTORY pronDictIntermediate.txt english_mfa $OUTPUT_DIRECTORY --clean --overwrite

Convert the generated alignments from English phones back to non-English phones.

python convertAlignments.py wordTranscriptions.pkl $OUTPUT_DIRECTORY

Open issues and future work

There are measures to minimize variation in the data. But I haven't yet incorporated fixes for some likely common errors.

You can have words that have multiple possible pronunciations. However, the conversion codes currently cannot support converting an alignment where a segment was deleted. To allow this level of flexibility, the conversion would likely need to incorporate a type of shortest edit distance algorithm.
I haven't tested out the conversion scripts with funny edge cases like case-sensitivity.

It would be interesting to also use this workflow to examine how different high-resource language models handle the same data from different languages. Feel free to contact me if you have any ideas for collaboration or fixes.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Examples		Examples
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
WordTranscriptionAndChanges.py		WordTranscriptionAndChanges.py
convertAlignments.py		convertAlignments.py
convertPronDict.py		convertPronDict.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interlingual MFA

Background

Workflow

Before you begin

Procedure

Open issues and future work

About

Releases

Packages

Languages

License

jhdeov/interlingual-MFA

Folders and files

Latest commit

History

Repository files navigation

Interlingual MFA

Background

Workflow

Before you begin

Procedure

Open issues and future work

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages