Skip to content

Latest commit

 

History

History
47 lines (37 loc) · 2.54 KB

README.md

File metadata and controls

47 lines (37 loc) · 2.54 KB

Clustering dialect varieties based on historical sound correspondences

Can we (meaningfully) cluster dialects based on sound correspondences? Research such as Wieling and Nerbonne (2011) uses phonetically transcribed doculect data that has been aligned with data from a reference doculect to investigate clustering based on the presence/absence of sound segment alignments. The results are doculect clusters as well as analyses of correlations between segment alignments and clusters.

Using data from Heggarty (2018), I follow a similar approach, but use Proto-Germanic data as reference doculect for a set of (continental) West Germanic doculects to explore how historical sound shifts are associated with the resulting clusters.

Details on this are in my Bachelor's thesis, which can be found here. A summary and a presentation are also available.

Abstract

While information on historical sound shifts plays an important role for examining the relationships between related language varieties, it has rarely been used for computational dialectology. This thesis explores the performance of two algorithms for clustering language varieties based on sound correspondences between Proto-Germanic and modern continental West Germanic dialects. Our experiments suggest that the results of agglomerative clustering match common dialect groupings more closely than the results of (divisive) bipartite spectral graph co-clustering. We also observe that adding phonetic context information to the sound correspondences yields clusters that are more frequently associated with representative and distinctive sound correspondences).

Errata

The last sentence of section 4.3.2 Bipartite Spectral Graph Co-clustering (p. 13) should read "The results from this method are hereafter referred to as BSGC-context and BSGC-nocontext."

Running the software

To run the scripts on Windows, start run.bat which sets the scripts' IO encodings and a hash seed for python (to get consistent results across runs) and runs cluster.py (which calls the other python scripts as needed).

On UNIX, run:

set pythonhashseed=123
python3 cluster.py