Abstract In this paper, we look at different measures of linguistic diversity in thirty one countries, mainly in Asia, based on richness, Shannon, and Greenberg entropic indices and transforming them to the associated effective numbers. Moreover, we examine unweighted and weighted alpha, gamma, beta diversities (effective numbers) using Hill numbers. We then look at MacArthur’s homogeneity and relative homogeneity. Finally, we put these countries into five regional groups and compute Sørensen-Dice and Jaccard indices in each regional group and between pairs of regional groups.
In 1956, Greenberg1 defined linguistic diversity as the probability that two
randomly chosen individuals (with replacement) from a population have different
first languages. In other words, if there are
The notion of diversity using infromation theory, at least in ecology, has not been without controversy. Due to lack of consensus on its definition, Hurlbert suggested that this approach to measuring biodiversity to be abandoned in 19715. In response, in 1974, Peet defended the practice and provided guidelines for when and how entropic measures to be used to measure biodiversity6. More recently, Jost3 clarified a common misconception among biologists and ecologists who use entropic indices as measures of diversity instead of transforming them into effective numbers, even though effective numbers were introduced by MacArthur in 19654 and its properties were discussed by Patil and Taillie in 1982 in their comprehensive summary7. For a more recent treatment of the material covered by Patil and Taillie, see Ginebra and Puig8.
The main focus of this work is to calculate effective numbers in thirty
one Central, Southern, and Western Asian countries using different
entropic indices. We also look at alpha, beta, and gamma diversities in
these countries. Similar to linguistic effective numbers, alpha, beta, gamma
diversities are computed based on alpha, beta, and gamma entropic indices,
denoted by
The thirty one Central, Southern, and Western Asian countries are the United Arab Emirates (AE), Afghanistan (AF), Armenia (AM), Azerbaijan (AZ), Bangladesh (BD), Bahrain (BH), Bhutan (BT), Cyprus (CY), Georgia (GE), India (IN), Iran (IR), Iraq (IQ), Isreal (IL), Jordan (JO), Kazakhstan (KZ), Kuwait (KW), Kyrgyzstan (KG), Lebanon (LB), Sri Lanka (LK), Nepal (NP), Oman (OM), Pakistan (PK), Palestine (PS), Qatar (QA), Saudi Arabia (SA), Syria (SY), Tajikistan (TJ), Turkmenistan (TM), Turkey (TR), Uzbekistan (UZ), and Yemen (YE). We divide these countries into five groups based on geographical location and proximity:
-
The Arabian Peninsula: AE, BH, KW, OM, QA, SA, YE;
-
Central Asia: AF, KG, KZ, TM, TJ, UZ.
-
Eastern Mediterranean: CY, IL, JO, LB, PS, SY;
-
Southern Asia: BD, BT, IN, LK, NP, PK;
-
Western Asia: AM, AZ, GE, IQ, IR, TR;
For our analysis, we use the following data sets:
-
From Ethnologue Global Dataset (22nd Edition)11, we use:
-
Table_of_Countries.tab
-
Table_of_LICs.tab
-
-
From Glottolog 4.612, we use:
languoid.csv
In this section, we discuss different entropic indices and how to
transform them into diversity measures, i.e., effective numbers. Recall
that, in the context of this study, the effective number is the number
of equiprobable languages in a population having the same entropic
index. Throughout this paper, we will use diversity and effective number
interchangeably. We also discuss alpha, beta, and gamma diversities in
more detail in this section. Throughout this section, we assume that
there are
Linguistic richness is the simplest measure of diversity and it counts
the number of languages in a country. In mathematical terms, the
entropic index is defined as
Shannon entropy, the very first measure of entropy in the context of
information theory, was introduced by Shannon in 194813. It is
defined as
We saw in Introduction that Greenberg's index of linguistic diversity and Gini-Simpson index are the same, and we discussed how to transform them into effective numbers. It should be pointed out that Lieberson15 extends the work of Greenberg1 to bilingual and multilingual populations and communities and defines a generalization of LDI; however, in this work we will not consider this generalization and will focus on L1 speakers, i.e., people who speak a language as their first language.
HCDT entropy was introduced by Tsallis in statistical mechanics16
and is defined as
Introduced by Rényi in 196118 as a generalization of Shannon's
entropy, Rényi entropy is defined as
Given a group of countries or communities, an alpha entropy
Let
For
Finally, for
Related to alpha, beta, gamma diversities are the notions of similarity
and relative homogeneity. Let us assume that we are studying
Not computed using entropic indices, Sørensen-Dice and Jaccard indices
are measures of similarity between two samples or populations. We are
including them in the study since we use MacArthur's homogeneity measure
and relative homogeneity, which are computed using entropic indices.
Sørensen-Dice index, introduced independently by Sørensen in 194823
and Dice in 194524, is computed as follows. Suppose
Jaccard index, introduced in 191225, computes similarity
between
For analysis, I used the open source statistical and programming
software R. And to run the code, I used Jupyter notebooks, using the R
package IRkernel
, to not only run the R code but also to create an
interactive narrative and lecture slides. The R packages that I used for
this project are as follows: tidyverse
, latex2exp
, maps
,
gganimate
, and gifski
.
As we see in the following tables, based on richness, Shannon, and Greenberg effective numbers, India is the most linguistically diverse country among these thirty one countries.
Country Code | Richness |
---|---|
IN | 423 |
NP | 123 |
PK | 79 |
IR | 64 |
BD | 43 |
KZ | 43 |
IL | 42 |
TR | 41 |
AF | 36 |
AZ | 34 |
UZ | 33 |
TM | 30 |
OM | 28 |
KG | 27 |
GE | 26 |
AE | 25 |
BT | 25 |
TJ | 25 |
SA | 22 |
SY | 22 |
IQ | 21 |
QA | 17 |
CY | 14 |
LB | 14 |
YE | 14 |
BH | 13 |
AM | 11 |
LK | 10 |
JO | 9 |
PS | 7 |
KW | 5 |
Country Code | Shannon |
---|---|
IN | 23.218441 |
AE | 11.172717 |
BT | 9.562095 |
NP | 9.468600 |
QA | 8.897871 |
IL | 7.772443 |
PK | 7.677709 |
OM | 7.122931 |
AF | 6.846279 |
IQ | 5.980274 |
IR | 5.786115 |
BH | 4.446231 |
YE | 3.635219 |
SA | 3.277666 |
KZ | 3.146846 |
UZ | 3.086781 |
GE | 2.921120 |
TM | 2.878044 |
JO | 2.865541 |
KG | 2.737652 |
CY | 2.551236 |
SY | 2.488467 |
KW | 2.466614 |
BD | 2.365351 |
TR | 2.202198 |
LK | 2.086926 |
AZ | 1.829342 |
TJ | 1.829291 |
LB | 1.795033 |
PS | 1.714478 |
AM | 1.164742 |
Country Code | Greenberg |
---|---|
IN | 9.969579 |
QA | 6.336755 |
AE | 5.835562 |
BT | 5.768101 |
AF | 4.912150 |
IQ | 4.250173 |
PK | 4.152624 |
NP | 4.078858 |
IL | 3.911446 |
OM | 3.689144 |
YE | 3.022755 |
BH | 2.755357 |
IR | 2.632791 |
KZ | 2.056855 |
JO | 1.990195 |
UZ | 1.865696 |
KG | 1.846733 |
TM | 1.842836 |
KW | 1.821256 |
CY | 1.793944 |
SA | 1.752183 |
GE | 1.742736 |
LK | 1.621985 |
SY | 1.589308 |
BD | 1.584692 |
TR | 1.435122 |
PS | 1.424603 |
TJ | 1.382107 |
LB | 1.302633 |
AZ | 1.262504 |
AM | 1.051925 |
Each of these tables is ordered in a descending order based on the values of each of these effective numbers. If we take the ranks (row numbers) in each of these sub-tables, calculate the average of these ranks, and order them in an ascending order, we get the following table.
Country Code | Average Overall Rank |
---|---|
IN | 1.000000 |
NP | 4.666667 |
PK | 5.666667 |
AE | 7.000000 |
IL | 7.333333 |
AF | 7.666667 |
BT | 8.000000 |
IR | 9.333333 |
QA | 9.666667 |
OM | 10.333333 |
KZ | 11.666667 |
IQ | 12.333333 |
UZ | 14.333333 |
TM | 16.000000 |
YE | 16.333333 |
BH | 16.666667 |
KG | 17.000000 |
BD | 18.000000 |
GE | 18.000000 |
SA | 18.000000 |
TR | 19.666667 |
JO | 21.000000 |
CY | 21.333333 |
SY | 22.000000 |
AZ | 22.333333 |
KW | 24.333333 |
TJ | 24.666667 |
LK | 25.666667 |
LB | 27.333333 |
PS | 29.000000 |
AM | 29.666667 |
Table: Average Overall Rank in Ascending Order Among These Thirty One Countries
Based on the above table, India, the United Arab Emirates, Israel, Afghanistan, and Iran are the most linguistically diverse countries in Southern Asia, the Arabian Peninsula, Eastern Mediterranean, Western Asia, and Central Asia, respectively. The following maps demonstrate richness, Shannon, and Greenberg effective numbers on logarithmic scale in these countries using color.
The maps on the left use
different overall ranges. When we use the same range for all three maps,
we get the maps on the right. Below is the animated map for seq(.001, 5, .2)
:
We now consider the effective numbers for each country based on region
for
We have included the effective numbers for each
country for
Based on the figure for Central Asia, we see
similar values of richness among these countries, with
Kazakhstan having the highest value of richness. However, as
In A Tidyverse Approach to Alpha, Beta and Gamma Diversities,
I have checked the validity of my code for the current paper by
comparing the results that I get here by applying my code to an
ecological data set and those that I get from using functions from the R
package
vegetarian
.
I should point out that vegetarian
is no longer available on the
CRAN (The Comprehensive R Archive
Network) repository. Since vegetarian
is no longer maintained on CRAN,
I am working on rewriting the code that uses another ecology R package
called
vegan
,
which is an alternative to vegetarian
.
In the table below, we have the weighted and unweighted gamma, alpha, and beta diversities. Moreover, the table contains the values of MacArthur's Homogeneity and Relative Homogeneity using richness, Shannon, and Greenberg effective unweighted beta diversities.
Type | Method | Richness | Shannon | Greenberg |
---|---|---|---|---|
Gamma | Unweighted | 790 | 59.5264 | 37.8207 |
Gamma | Weighted | 790 | 48.4439 | 19.3729 |
Alpha | Unweighted | 42.7742 | 3.85418 | 2.1383 |
Alpha | Weighted | 42.7742 | 11.5658 | 8.5082 |
Beta | Unweighted | 18.4690 | 15.4447 | 17.6870 |
Beta | Weighted | 18.4691 | 4.1885 | 2.2767 |
MacArthur's Homogeneity | Unweighted | 0.0542 | 0.0648 | 0.0565 |
MacArthur's Homogeneity | Weighted | -- | -- | -- |
Relative Homogeneity | Unweighted | 0.0226 | 0.0336 | 0.0251 |
Relative Homogeneity | Weighted | -- | 0.0665 | -- |
Table: Unweighted and Weighted Gamma, Alpha, and Beta Diversities, Along with MacArthur's Homogeneity and Relative Homogeneity, Using Richness, Shannon, and Greenberg
In the figures below, we have the plots for unweighted and weighted gamma, alpha, and beta diversities, respectively. As we see in the figures on the left, especially the bottom figure, it not appropriate to use unweighted gamma, alpha, and beta diversities because of the great population disparities between these countries.
As we see in the following figures, using both Sørensen-Dice and Jaccard indices, Central Asia has the highest level of language similarity and Southern Asia has the highest level of language dissimilarity on average.
Moreover, as we see in the following figures, using both indices, Western Asia and Southern Asia have the highest level of similarity and dissimilarity on average, respectively, based on language families.
The density functions when using both indices and using language level and family levels are given in the following figures.
Examining pairs of regions, based on both Sørensen-Dice and Jaccard indices, we see the highest level of average similarity between Central Asia and Western Asia and the highest level of average dissimilarity between Southern Asia and Eastern Mediterranean, as we see in the following figures.
One possibility is to use Rao's quadratic index along with linguistic
family trees to define similarity functions to measure similarity in
each country and all the thirty one countries combined. Rao's quadratic
index of biodiversity that measures similarity (and dissimilarity) in a
population26 is defined as
Note that when
Another future direction is to study different measures of linguistic diversity as time-series, similar to what Harmon and Loh have done using LDI27. It would be interesting to see how this approach can be applied using Rao's similarity measures, or other tree-based2829 or distance-based methods30 for regional linguistic studies, to make predictions about the status of more vulnerable languages in the future. Regarding tree-based methods, it should be pointed out that Rao's quadratic entropy is the basis for some of the approaches for computing phylogenetic diversities in ecology313233 and one possible direction is to use this available literature and apply it in linguistics. Moreover, in computing Rao's quadratic diversities, one can incorporate lexical or other linguistic similarities into the similarity (distance) functions.
I would like to thank the Office of the Dean of the Faculty at Saint Michael's College for the Merit-based Course Reduction Award in Fall 2020 that allowed me to continue my work on this paper.
Footnotes
-
J. H. Greenberg. The measurement of linguistic diversity. Language, 32(1):109--115, 1956. ↩ ↩2
-
E. H Simpson. Measurement of diversity. Nature, 163(4148):688--688, 1949. ↩
-
L. Jost. Entropy and diversity. Oikos, 113(2):363--375, 2006. ↩ ↩2 ↩3 ↩4 ↩5 ↩6 ↩7
-
R. H MacArthur. Patterns of species diversity. Biological Reviews, 40(4):510--533, 1965. ↩ ↩2 ↩3
-
St. H Hurlbert. The nonconcept of species diversity: a critique and alternative parameters. Ecology, 52(4):577--586, 1971. ↩
-
R. K. Peet. The measurement of species diversity. Annual Review of Ecology and Systematics, pages 285--307, 1974. ↩
-
G. P. Patil and C. Taillie. Diversity as a concept and its measurement. Journal of the American Statistical Association, 77(379):548--561, 1982. ↩
-
J. Ginebra and X. Puig. On the measure and the estimation of evenness and diversity. Computational Statistics & Data Analysis, 54(9):2187--2201, 2010. ↩
-
R. H. Whittaker. Vegetation of the Siskiyou Mountains, Oregon and California. Ecological Monographs, 30(3):279--338, 1960. ↩
-
R. H. Whittaker. Evolution and measurement of species diversity. Taxon, 21(2-3):213--251, 1972. ↩
-
D. M. Eberhard, G. F. Simons, and C. D. Fennig. Ethnologue: Languages of the world, 2019. ↩
-
H. Hammarström, R. Forkel, M. Haspelmath, and S. Bank. Glottolog 4.6., 2022. ↩
-
C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379--423, 1948. ↩
-
M. O. Hill. Diversity and evenness: a unifying notation and its consequences. Ecology, 54(2):427--432, 1973. ↩ ↩2 ↩3
-
S. Lieberson. An extension of Greenberg’s linguistic diversity measures. Language, 40(4):526--531, 1964. ↩
-
C. Tsallis. Possible generalization of Boltzmann-Gibbs statistics. Journal of Statistical Physics, 52(1):479--487, 1988. ↩
-
C. J. Keylock. Simpson diversity and the Shannon–Wiener index as special cases of a generalized entropy. Oikos, 109(1):203--207, 2005. ↩
-
Alfréd Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 547--561. University of California Press, Berkeley, California, USA, 1961. ↩
-
L. Jost. Partitioning diversity into independent alpha and beta components. Ecology, 88(10):2427--2439, 2007. ↩ ↩2 ↩3 ↩4 ↩5 ↩6
-
J. A. Veech and T. O. Crist. Diversity partitioning without statistical independence of alpha and beta. Ecology, 91(7):1964--1969, 2010. ↩
-
C. Ricotta. On beta diversity decomposition: trouble shared is not trouble halved. Ecology, 91(7):1981--1983, 2010. ↩
-
L. Jost. Independence of alpha and beta diversities. Ecology, 91(7):1969--1974, 2010. ↩
-
T. Sørensen. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Kongelige Danske Videnskabernes Selskab, pages 1--34, 1948. ↩
-
L. R Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297--302, 1945. ↩
-
P. Jaccard. The distribution of the flora in the Alpine zone. 1. New Phytologist, 11(2):37--50, 1912. ↩
-
C. R. Rao. Diversity and dissimilarity coefficients: a unified approach. Theoretical Population Biology, 21(1):24--43, 1982. ↩
-
D. Harmon and J. Loh. The index of linguistic diversity: A new quantitative measure of trends in the status of the world’s languages. Language Documentation & Conservation, 4:97--151, 2010. ↩
-
M. R. Helmus, T. J. Bland, C. K. Williams, and A. R. Ives. Phylogenetic measures of biodiversity. The American Naturalist, 169(3):E68--E83, 2007. ↩
-
A. R. Ives and M. R. Helmus. Phylogenetic metrics of community similarity. The American Naturalist, 176(5):E128–E142, 2010. ↩
-
S. Champely and D. Chessel. Measuring biological diversity using euclidean metrics. Environmental and Ecological Statistics, 9(2):167--177, 2002. ↩
-
M. W. Cadotte, T. J. Davies, J. Regetz, S. W. Kembel, E. Clevand, and T. Oakley. Phylogenetic diversity metrics for ecological communities: integrating species richness, abundance and evolutionary history. Ecology Letters, 13(1):96--105, 2010. ↩
-
A. Chao, C.-H. Chiu, and L. Jost. Phylogenetic diversity measures based on Hill numbers. Philosophical Transactions of the Royal Society B: Biological Sciences, 365(1558):3599--3609, 2010. ↩
-
C.-H. Chiu, L. Jost, and A. Chao. Phylogenetic beta diversity, similarity, and differentiation measures based on Hill numbers. Ecological Monographs, 84(1):21--44, 2014. ↩