Manual Version Number: 1.0.0
Why do some words have more meanings than others? Some researchers have argued for an explanation based on efficient communication: words that are shorter, more frequent, and easier to pronounce get more meanings, allowing for a more compact organization of the lexicon. However, these explanations mostly address synchronic effects, while linguistic ambiguity is inherently diachronic. We propose a novel approach where we rely on the longevity of words to estimate their degree of ambiguity. Using data covering more than half of a millennium, we find that words used for longer periods become more ambiguous. Our results support the intuition that the process of meaning accumulation is time-driven, indicating that time signatures are important predictors of linguistic features.
- analysis.R: R script for the statistical analysis of the results.
Contains input and output data related to word age estimation and frequencies:
age_estimations.csv
: Main datasetage_estimation_1800.csv
: Cross-validated dataset
Contains the figures used both in the paper and the supplementary material.
Contains Jupyter notebooks used in the study:
change_point_detection.ipynb
: Notebook that validates the etymology extraction using Google n-gram and change point detection algorithms.semantic-snowball-model.ipynb
: Main notebook containing the cultural evolutionary model of the Semantic Snowball Effect.
Contains Python scripts for processing data:
etymology_extraction.py
: Python script that extracts etymology information for words.wiki_tokenizer.py
: Script to tokenize the Wiki40b dataset.wikitionnary_preprocessing.py
: Python script for preprocessing the Wiktionary dump.
- requirements.txt: Lists the necessary Python packages to run the project.
- run_preprocessing.sh: Bash script for preprocessing of the dataset.
All the analyses were run on R version 4.2.2 (2022-10-31) and Python 3.11.2.
Before running the project, ensure the following resources are available:
- Download the Wiki40b French dataset using the wiki-tokenizer tool. Place it in the
data/wiki-corpus/
folder. - Download the French Wiktionary dump from WiktionaryX and place the unpacked files in the
data/WikitionaryX
folder.
-
Install Dependencies: Install the required Python packages by running:
pip install -r requirements.txt
-
Preprocess Data:
- Make the
run_preprocessing.sh
script executable:
chmod +x run_preprocessing.sh
- Run the script to preprocess the Wiki40b dataset and compute word ages and meanings:
./run_preprocessing.sh
The results will be saved in
data/age_estimations.csv
. - Make the
-
Validity Check: Use the
change_point_detection.ipynb
notebook to validate the etymology extraction method and obtain the cross-validated dataset. -
Analyze the Results: Use the
analysis.R
script to reproduce the statistical analysis and generate the figures.