chardistance

Would you rather watch a Catalan movie at la Cinémathèque française subtitled in French or a French movie at la Filmoteca de Catalunya subtitled in Catalan?

Motivation

As a spanish speaker, being in Barcelona is not an issue... unless you dare to go to La Filmoteca de Catalunya and watch a French movie subtitled in Catalan.
Given that I don't know any of both languages beyond a small set of words (not even half A1), I wondered how should I watch the movie if I wanted to understand the dialogues.
Should I focus on the speech sounds or the subtitles? Does it make any difference?
My intuition says that Catalan should be easier to understand -at least that's what I think when I read it-. However, it resembles French when I hear it.
In order to find out the difference, I decided to measure two different but related language aspects using the same metric: Levinshtein Distance on the ortographic transcription of the dialogues (i.e. subtitles) and on the phonemic transcription of the dialogues (i.e. speech sounds).

Research Question

What is the Levenshtein Distance betweeen Spanish, French, and Catalan, considering the ortographic and phonemic transcriptions?

Data

In order to compare these three languages, a small corpus was created using a fraction of Harry Potter: The Philosopher's stone.

A parallel corpus was manually created considering coincident sentence boundaries. Due to translation, this is not always possible. The main criteria when setting boundaries was to keep sentences short enough to make the comparison reliable, and long enough to convey a full idea. Examples of the alignment:

Spanish	French	Catalan
El señor Dursley era el director de una empresa llamada Grunnings, que fabricaba taladros.	Mr Dursley dirigeait la Grunnings, une entreprise qui fabriquait des perceuses.	El senyor Dursley era director d’una fàbrica anomenada Grunnings, que feia broques.
Nuestra historia comienza cuando el señor y la señora Dursley se despertaron un martes, con un cielo cubierto de nubes grises que amenazaban tormenta. Pero nada había en aquel nublado cielo que sugiriera los acontecimientos extraños y misteriosos que poco después tendrían lugar en toda la región.	Lorsque Mr et Mrs Dursley s'éveillèrent, au matin du mardi où commence cette histoire, il faisait gris et triste et rien dans le ciel nuageux ne laissait prévoir que des choses étranges et mystérieuses allaient bientôt se produire dans tout le pays.	Quan el senyor i la senyora Dursley es van aixecar el dimarts gris i trist en què comença la nostra història, no hi havia res al cel ennuvolat que pogués fer pensar que molt aviat succeirien coses estranyes i misterioses per tot el país.

The corpus comprises 45 sentences. Each language has roughly 950 tokens and each sentence is 20 tokens long on average.

Levenshtein Distance

Levenshtein Distance is a number that counts how (dis)similar two strings are. It does so by counting how many times you would need to insert, replace, or delete a character to get a match.
For example, the distance between hello and yellow will be of 2:

replace h by y
insert w or delete w

This metric considers the best way of going from one string to the other by doing the least possible changes.
The following table shows the same sentence translated to all three languages, and then the Levenshtein Distance applied to the ortographic transcription:

Language	Sentence
Spanish	El señor Dursley era el director de una empresa llamada Grunnings, que fabricaba taladros.
French	Mr Dursley dirigeait la Grunnings, une entreprise qui fabriquait des perceuses.
Catalan	El senyor Dursley era director d’una fàbrica anomenada Grunnings, que feia broques.

Comparison	Levenshtein Distance	Score
Spanish & French	`_r dursley __r__e_it __ ________s_ ___ ___r___is_ qu_ fabri__a__ ___ __r___s__.`	62
French & Catalan	`__ _____r dursley ___ _i_e_t__ __un_ _______ _n__e____ __u_______ qu_ _e__ _r__ues.`	60
Catalan & Spanish	`el se__or dursley era director d_una ___r__a ___m__ada grunnings, que f_ia ______s.`	30

Each score means that, in order to match the sentence between two languages, you will need to apply N changes.

Ortographic transcription

The books used to obtain ortographic transcriptions were translated by:

Catalan: Laura Escorihuela Martínez - Editorial Empúries (1999)
French: Jean-François Ménard - Gallimard (2007)
Spanish: Alicia Dellepiane - Emecé Editores España, S.A. (1999)

Phonemic transcription

Two resources were used to obtain standard phonemic transcriptions:

French: OpenIPA
Spanish and Catalan: TexAFon

Example of the transcription:

Spanish	French	Catalan
Era un hombre corpulento y rollizo, casi sin cuello, aunque con un bigote inmenso.	C'était un homme grand et massif, qui n'avait pratiquement pas de cou, mais possédait en revanche une moustache de belle taille.	Era un home gros i fort que gairebé no tenia coll, però que lluïa un bigoti enorme.
eɾa un ombɾe koɾpulento i roʝiθo, kasi sin kweʝo, awnke kon un Biɣote inmenso	setɛt‿ œ̃ ɔmə gɾɑ̃d‿e masif, ki nave pɾatikəmə pa də ku, me pɔsedɛt‿ɑ̃ ɾəvɑ̃ʃ ynə mustaʃə də bɛlə tajə.	eɾə un ɔmə gɾɔz i fɔɾt kə kəjɾəβe no təniə kɔʎ, pəɾɔ kə ʎuiə un pigɔti ənoɾmə

Preprocessing

Before measuring Levenshtein Distance, preprocessing and formatting was applied. Specifically, before ortographic comparison, all vowel's variations were reduced to one (i.e. â --> a). Before phonemic comparison, ‿ was deleted as it was added by OpenIPA to denote liaison. However, the inserted phoneneme was kept. In both cases, any punctuation was removed.

Results

The following table depicts the results. Each column represents the Levenshtein Distance between a pair of languages, and each row shows the ortographic and phonemic comparison.

Comparison	Spanish-French	Spanish-Catalan	French-Catalan
Ortographic	95,04	68,98	92,67
Phonemic	92,67	76,11	88,39

Considering the ortographic comparison, Spanish and Catalan are closer to each other (score of almost 69) than they are to French, which is equally away by roughly 25 characters (score higher than 92). The difference between Spanish-Catalan and Spanish-French comparison is a score of 26 (that is to say, you will make 26 more changes to get from Spanish to French than when going from Spanish to Catalan). The difference between both distances is of 21, which means that French is ortografically further away from both languages.

According to the phonemic comparison, Spanish is still closer to Catalan (score of 76) than it is to French (score of 92) by 16 changes. However, Spanish and Catalan are not equally away from French as the ortographic comparison showed. The phonemic difference between the Spanish-Catalan and Spanish-French comparison is of 15 changes (and not 26 as it was with the ortographic comparison), and the difference between both distances is of 8 characters (and not 21).

These results suggest that for a Spanish speaker, Catalan is closer both ortographically as well as phonemically. Nonetheless, given that ortographic and phonemic distances are not of the same scale, the overall reduction of phonemic distances indicates that languages are evenly away from each other with regards to speech: ortographically the distance is of 21, whereas phonemically the distance is of 8, which is closer to 0. Being closer in phonemic space is not synonym of easier hearing comprehension, but just equal comprehension distance.

Conclusions

Reading Catalan for a Spanish speaker will be easier to understand than listening, and understanding Catalan overall will be easier than French. However, the distance notoriously shrinks when moving to phonemic space.
With regards to Catalan, even though it is phisically closer to French, their phonemic distance is almost as that of Spanish and French.

Last, but not least, I would recommend watching a French movie at la Filmoteca de Catalunya subtitled in Catalan.

Future work

Use a bigger parallel corpora.
Compare MFCCs on the actual movie (speech, sounds, and music included, as everything affects comprehension).
Compare ortographic and phonemic overlapping between languages, and their predominance in their lexicon.

How to use this repository

Clone the repository: git clone git@github.com:melanchthon19/chardistance.git
Change directory: cd chardistance
Install packages: pip install -r requirements.txt

levenshtein.pycontains the Levenshtein class.
You can run levenshtein.py by itself passing two strings: ./levenshtein.py string1 string2

The main file is compare.py. It uses Levenshtein class to compare strings from a given file.
It accepts the following flags:
--file: file in data folder that stores sentences {hp.csv,hp-phonemic.csv}
--index: analyse sentence at given index only [optional]
--preprocess: {ortographic,phonemic} [optional]
--verbose: print each Levenshtein Distance [optional]

Usage example:
./compare.py -f data/hp.csv -p ortographic -v -i 2
./compare.py -f data/hp-phonemic.csv -p phonemic

Bibliography

OpenIPA: Free, informative IPA transcription for Lyric Diction
TexAFon: Herramienta de transcripción fonética
Catalan Phonemic Chart
Automatic Phonetic Transcription of dialectal variance in Catalan
Phonological Distance Measures
Applying the Levenshtein Distance to Catalan dialects
Levenshtein Distance
Harry Potter y la piedra filosofal. Translated by Alicia Dellepiane. Emecé Editores España, S.A. (1999).
Harry Potter i la pedra filosofal. Translated by Laura Escorihuela Martínez. Editorial Empúries (1999).
Harry Potter à l'Ecole des Sorciers. Translated by Jean-François Ménard. Gallimard (2007)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

chardistance

Motivation

Research Question

Data

Levenshtein Distance

Ortographic transcription

Phonemic transcription

Preprocessing

Results

Conclusions

Future work

How to use this repository

Bibliography

Files

README.md

Latest commit

History

README.md

File metadata and controls

chardistance

Motivation

Research Question

Data

Levenshtein Distance

Ortographic transcription

Phonemic transcription

Preprocessing

Results

Conclusions

Future work

How to use this repository

Bibliography