US 7,610,189 B2
Method and apparatus for efficient segmentation of compound words using probabilistic breakpoint traversal
Andrew William Mackie, Los Gatos, Calif. (US)
Assigned to Nuance Communications, Inc., Burlington, Mass. (US)
Filed on Oct. 18, 2001, as Appl. No. 10/42,528.
Prior Publication US 2003/0097252 A1, May 22, 2003
Int. Cl. G06F 17/28 (2006.01)
U.S. Cl. 704—9 10 Claims
OG exemplary drawing
 
10. A method for segmenting compound words in an unrestricted natural-language input, the method comprising:
receiving a natural-language input consisting of a plurality of characters;
constructing a set of breakpoints in the natural-language input;
combining weights of tetragraph contexts that precede and follow each breakpoint to assign a weight to the breakpoint in the natural-language input;
traversing substrings of the natural-language input in an order determined by the weights assigned to the breakpoints;
identifying a plurality of linkable components by the traversal of substrings wherein a linkable component is identified by locating the component in a lexicon; and
returning a segmented string consisting of a plurality of linkable components spanning the natural-language input, wherein the segmented string is interpreted as a compound word.