WELCOME TO WESLALEX

A lexical database of the West Slavic languages

Funded by

WESLALEX is the first cross-linguistic database of words in children’s school books in the West Slavic languages: Czech, Slovak, and Polish. The database contains morphologically and phonetically tagged words extracted from the most widely used instructional textbooks for grades 1 to 3/4/5, and it allows, via a simple WWW interface, searches for useful statistics about words and their morphological and phonological attributes.

Inspired by the British-English Children’s Printed Word Database (CPWD) (Masterson et al., 2003), a major feature of WESLALEX is that it generates comparable information about the words that primary school children encounter in reading in Czech, Slovak, and (soon also in) Polish, as well as in British English (CPWD). A unique extension of WESLALEX is the wealth of grammatical information that it can generate, which is a component of key importance for the inflected Slavic languages. WESLALEX can serve a wide variety of single-language and cross-linguistic research and educational purposes.

Files in this directory are in character encoding UTF-8. If your browser keeps serving up apparent garbage, try setting its character encoding. E.g. in Firefox, select View / Character Encoding / Unicode (UTF-8) or try View / Character Encoding / Auto-detect / Universal.

Word frequency lists for each language as downloadable files. The CSV files are tab-delimited plaintext files that can be manipulated by any spreadsheet program, text editor, or simple text-processing scripts. The Excel files are for Excel 2007 and take advantage of its search and filtering techniques for structured tables. A user’s guide for the Excel 2007 version is also available as a Microsoft Word 2007 file.

Czech Wordforms CSV; Lemmas CSV; Excel (wordforms and lemmas)

Polish Wordforms CSV; Wordforms Excel

Slovak Wordforms CSV; Lemmas CSV; Excel (wordforms and lemmas)

Fields are as follows. Additional information is given as comments in the Excel files.
1. spell - token, all in lowercase
2. lemma - heading as from the tagger dictionary
3. morpho - morphological analysis tags.
4. g1F - frequency in Grade 1 texts (literal count)
5. g1D - dispersion acrosss Grade 1 texts
6. g1U - U (freq as words per million, adjusted by dispersion)
7. g1SFI - standardized frequency index within Grade 1
8. g2F, g2D, g2U, g2SFI - same, Grade 2 (Czech has g1–g5; Slovak has g1–g4; Polish has g0–g3
9. freq - raw frequency across entire corpus
10. D - dispersion across corpus, taking book as the unit of dispersion
11. U - freq as wpm, adjusted by dispersion across corpus
12. SFI - standardized frequency index
13. nlett - number of letters in spell (Unicode characters in NFC)
14. align - letter–sound alignment e.g.: a=a d=d r=r e=e s=s o=o u=w
15. pron - pronunciation e.g. adresow
16. syll - pronunciation with each syllable in braces
17. nsyll - number of syllables
18. nphon - number of phonemes in spelling
19. cv - CV structure, with each syll in braces
20. pos - major part of speech
21. subpos- detailed word class
22. gender
23. number
24. case
25. possgender - gender of possessor
26. possnumber - number of possessor
27. person
28. tense
29. grade - degree of comparison
30. negation
31. voice
32. var - variant
The Polish wordforms are not disambiguated. Instead of a separate lemma and morpho column, and morphosyntactic fields pos through var, that list has a single column caled analysis, which lists all possible analyses for the token.
Catalog of books.
Paper presented at Slovko 2007.

Principal investigator for this project is Markéta Caravolas, Bangor University. Support for this project was provided by a grant from the British Academy.

APA citation:

Kessler, B., & Caravolas, M. (2011). Weslalex: West Slavic lexicon of child-directed printed words. Retrieved from http://spell.psychology.wustl.edu/weslalex

Webster: Brett Kessler
Last change 2011-04-20T09:36:19-0500

Czech	Wordforms CSV;	Lemmas CSV;	Excel (wordforms and lemmas)
Polish	Wordforms CSV;		Wordforms Excel
Slovak	Wordforms CSV;	Lemmas CSV;	Excel (wordforms and lemmas)