A final task before constructing the spelling corrector is to determine the sound-spelling correspondences. Although some work has been done on deriving these automatically [Lucassen 1983], I preferred a semi-automated (human-assisted) approach, as undertaken by Lawrence and Kaye [Lawrence 1986]. A programme called align used a list of rules to align the spellings and pronunciations in oneDict. The rules file was created by myself via successive refinement, as a result of judging the alignments produced by previous iterations, with particular attention given to the failures. It was not of course possible to check all 329,116 alignments, and many are no doubt questionable, but by and large they appear adequate for this project. The rules are essentially sound-spelling correspondences, but to control somewhat their applicability, some rules add left or right contexts to either the spelling or the pronunciation. These contexts are the immediately adjacent letters or phones required for a match, and may include word boundaries and the symbols C or V (for any consonant or vowel); but the contexts may be ignored if alignment would otherwise fail entirely. The applicability of rules can also be controlled somewhat by ordering them: if two correspondences vary by one being an initial segment of the other, and either would produce an alignment, the first one listed will be used. Alignment proceeds by a recursive algorithm much like that described for Recur in module Respell, except that the limiting factor is a particular pronunciation (instead of the universe of legal words), and candidate rules for any substring of the word are pursued in the order listed, as just discussed, rather than in order of increasing length.
Table 6: Extract from the rules file.
After a file of alignments has been created, the programme countAlignments counts each sound-spelling correspondence to produce a file which is called counts. This file is actually very much like rules, in that it lists all possible sound-spelling correspondences. The difference is that this file totally ignores context and so is somewhat shorter (77 of the original 1130 rules differed only in context or were not used), and it also contains a count of how often each correspondence was used. A programme called genRuleTrie takes this information and compiles it into C source code for a static structure that embodies the information in trie form for use by Respell. This source is called RuleTrie.c and is a little byzantine, since it was necessary to express essentially variable length structures (each spelling can have an indefinite number of pronunciations) in fixed length arrays so that they could be compiled into RuleTrie, pre-initialized so that the correct programme can start up instantly.
At this point we can assemble the various component modules of correct and run the programme.