next up previous
Next: The programme Up: Dictionary Previous: Dictionary text file

Dictionary index file

The dictionary is accessed via an index of pronunciations. The index file is organized as a trie, with all paths fully pursued to the end. For general information on trie structures, see [Fredkin 1960] or [Knuth 1973], section 6.3. Briefly, the root node of the trie points to another node for each of the 39 possible initial phonemes a word may have. Each of those nodes likewise contains a pointer to a node for each possible second phoneme, and so forth. These are here called continuations. Although in many applications tries are compressed to include only minimally distinctive differences, our trie includes every phone in the pronunciation. The continuation node for the last phone of a word will contain a termination, which points to the first dictionary text file entry which has the pronunciation that was pursued. Note that a node may have both continuations and terminations, since some whole words are initial segments of others. For example, if one goes from the root node to the /A/ node, the latter will have pointers to all possible phones that can follow initial /A/, but it will also have a termination for /A/ as a complete word ( a, ae, eh...).

Physically, each node has a header: one byte is a binary number telling how many different continuations there are; the next three bytes are a binary number which, if the node has a termination, constitute the byte offset from the beginning of the text file where entries with the pursued pronunciation begin. If the node has no such termination, the number has all 1 bits. If the node has continuations (that is, that first one-byte binary number was non-zero), there follow that many continuation pointers. Each pointer contains a one-byte binary number that tells which phoneme the continuation is for. This is an arbitrary number that is not the same as the ASCII encoding of the phoneme codes shown above; the interested reader can find the codes in programme file Phones.h. This byte is followed by three bytes which are a binary number telling the byte offset from the beginning of the (same) trie file where the continuation node begins. Note that nodes are of variable length, but always some multiple of 4 bytes.

It might be noted that since the pronunciation is embodied in the index, there is really no strict reason why it should also be included in the text file. It will be seen later that all that is really needed is a flag indicating whether an entry has the same pronunciation as the previous one. Production units of correct could save a great deal of space (here, 2.7 Megabytes) by omitting the pronunciation in the text. This is an all-around win, because that would also (marginally) speed up the programme.

If space is very critical, there are also two ways to save dictionary text space that would however slow down the programme, perhaps marginally. One would be to compress it by Huffman encoding, which would of course require time to decompress the entries at run time. Another way would be to encode very common, regular affix patterns in the index (such as regular inflexions and a few common derivational affixes). The trie could contain a path for the inflected form, but have a pointer to the base from in the text, along with a code telling the required inflexional pattern. The full spelling would then of course have to be generated on the fly, using code such as that in AddInfl.c.



next up previous
Next: The programme Up: Dictionary Previous: Dictionary text file



Brett Kessler
Wed Dec 27 22:16:48 PST 1995