Estimating the Probability of Historical Connections Between Languages

Kessler, Brett. 1999. Estimating the probability of historical connections between languages. Stanford, CA: Stanford dissertation.

Available from ProQuest Dissertations and Theses database (AAT 9924446).

Abstract

Historical linguistics has no generally accepted methodology for statistically estimating whether the connections it documents between languages are coincidental or statistically significant (likely to reflect historical realities). Currently the best proposals are very susceptible to errors which lead the researcher to falsely judge languages to be historically connected. I propose several improvements in the statistics of the testing. The new techniques are illustrated with a set of five languages having varying degrees of interrelatedness (English, German, French, Latin, Albanian) and three not believed to be related to that set or to each other (Hawai‘ian, Navajo, and Turkish). Statistically, the technique of Ringe (1992) suffers from an invalid use of multiple tests. I develop a single test that uses Monte Carlo techniques for estimating significance. The test takes less than a minute on a personal computer and is conceptually much simpler than traditional parametric statistics. My technique is compatible with a wide range of metrics, and I develop several variants in attempts to interpret algorithmically the traditional techniques of historical linguistics, which seek to discover recurrent pairings of sounds between semantically matched words in a set of languages. I begin with an implementation of the familiar chi squared statistic. That approach is satisfactory, but only permits the researcher to consider one sound in each word. The Monte Carlo technique permits a simpler, more traditional counting of the recurrent pairings, and with proper scaling that can be made to work for multiple sounds in a word. Although it is possible to consider all conceivable pairings of sounds, I show that a simple linear alignment is preferable because universal properties of word length interfere with the goal of finding particular, nonuniversal, connections between languages. I also explore the possibilities of comparing words at subsegmental levels. The greatest problem with the testing is the quality of the data. The tests are easily distorted by loans, recurring etyma, and nonarbitrary vocabulary. I show how prevalent such problems are among the items in the standard Swadesh list of 200 concepts, and introduce some mathematical techniques to help the linguist identify problem areas.

Data

The word lists. These take as point of departure Ringe’s word lists, with many expansions and annotations. For example, Albanian, French, German, Hawaiian, Navajo and Turkish have been expanded from 100 to 200 words; and for all the languages I mark known loans, nonarbitrary vocabulary, and cognates, including all language-internal ones. Available as XML source (Unicode character set using UTF-8 encoding; but the non-ASCII characters are coded as entities, so should be readable in all editors). Warning: a lot of browsers do not handle large XML files well; it may be best to copy this without viewing in your browser. A human-readable version exists as an Appendix in the book.

Tables from Ringe’s papers.

Source	Table	Languages	Environ	N Words
1992	8	English-German	C_1.1	100
1992	10	English-German	C_1.2	100
1992	12	English-German	C_2.1	100
1992	14	English-German	C_2.2	100
1992	15	English-German	final rime	100
1992	18	English-Latin	C_1.1	100
1992	20	English-Latin	C_1.2	100
1992	24	English-Turkish	C_1.1	100
1992	29	English-Latin	C_1.1	200
1993	2	English-French	C_1.1	100
1993	4	Albanian-French	C_1.1	100

APA citation:

Kessler, B. (1999). Estimating the probability of historical connections between languages (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (AAT 9924446)