Kessler, Brett. 1999.
Estimating the probability of historical
connections between languages.
Stanford, CA: Stanford dissertation.
Available from ProQuest Dissertations and Theses database (AAT 9924446).
Abstract
Historical linguistics has no generally accepted methodology for
statistically estimating whether the connections it documents between
languages are coincidental or statistically significant (likely to
reflect historical realities). Currently the best proposals are very
susceptible to errors which lead the researcher to falsely judge
languages to be historically connected. I propose several
improvements in the statistics of the testing. The new techniques are
illustrated with a set of five languages having varying degrees of
interrelatedness (English, German, French, Latin, Albanian) and three
not believed to be related to that set or to each other (Hawai‘ian,
Navajo, and Turkish). Statistically, the technique of Ringe (1992)
suffers from an invalid use of multiple tests. I develop a single
test that uses Monte Carlo techniques for estimating significance.
The test takes less than a minute on a personal computer and is
conceptually much simpler than traditional parametric statistics. My
technique is compatible with a wide range of metrics, and I develop
several variants in attempts to interpret algorithmically the
traditional techniques of historical linguistics, which seek to
discover recurrent pairings of sounds between semantically matched
words in a set of languages. I begin with an implementation of the
familiar chi squared statistic. That approach is satisfactory, but
only permits the researcher to consider one sound in each word. The
Monte Carlo technique permits a simpler, more traditional
counting of the recurrent pairings, and with proper scaling that can
be made to work for multiple sounds in a word. Although it is possible
to consider all conceivable pairings of sounds, I show that a simple
linear alignment is preferable because universal properties of word
length interfere with the goal of finding particular, nonuniversal,
connections between languages. I also explore the possibilities of
comparing words at subsegmental levels. The greatest problem with the
testing is the quality of the data. The tests are easily distorted
by loans, recurring etyma, and nonarbitrary vocabulary. I show how
prevalent such problems are among the items in the standard Swadesh
list of 200 concepts, and introduce some mathematical techniques to
help the linguist identify problem areas.
Data
The word lists. These take as point of departure Ringe’s word
lists, with many expansions and annotations. For example, Albanian,
French, German, Hawaiian, Navajo and Turkish have been expanded from
100 to 200 words; and for all the languages I mark known loans,
nonarbitrary vocabulary, and cognates, including all language-internal
ones. Available as XML source (Unicode
character set using UTF-8 encoding; but the non-ASCII characters are
coded as entities, so should be readable in all editors). Warning: a
lot of browsers do not handle large XML files well; it may be best to
copy this without viewing in your browser. A
human-readable version exists as an Appendix in the book.
Tables from Ringe’s papers.
Source | Table | Languages | Environ | N Words |
1992 |
8 |
English-German |
C1.1 |
100 |
1992 |
10 |
English-German |
C1.2 |
100 |
1992 |
12 |
English-German |
C2.1 |
100 |
1992 |
14 |
English-German |
C2.2 |
100 |
1992 |
15 |
English-German |
final rime |
100 |
1992 |
18 |
English-Latin |
C1.1 |
100 |
1992 |
20 |
English-Latin |
C1.2 |
100 |
1992 |
24 |
English-Turkish |
C1.1 |
100 |
1992 |
29 |
English-Latin |
C1.1 |
200 |
1993 |
2 |
English-French |
C1.1 |
100 |
1993 |
4 |
Albanian-French |
C1.1 |
100 |
APA citation:
Kessler, B.
(1999).
Estimating the probability of historical
connections between languages (Doctoral dissertation). Available from ProQuest Dissertations and Theses database. (AAT 9924446)