US 7,502,739 B2
Intonation generation method, speech synthesis apparatus using the method and voice server
Takashi Saito, Tokyo-to (Japan); and Masaharu Sakamoto, Yokohama (Japan)
Assigned to International Business Machines Corporation, Armonk, N.Y. (US)
Filed on Jan. 24, 2005, as Appl. No. 10/784,044.
Prior Publication US 2005/0114137 A1, May 26, 2005
Int. Cl. G10L 13/00 (2006.01); G10L 13/06 (2006.01)
U.S. Cl. 704—260  [704/266; 704/268] 2 Claims
OG exemplary drawing
 
1. A speech synthesis apparatus for performing a text-to-speech synthesis to generate synthesized speech, comprising:
a text analysis unit for performing linguistic analysis of input text as a processing target and acquiring language information therefrom and providing speech output to a prosody control unit;
a first database for storing intonation patterns of actual speech;
a prosody control unit for receiving speech output from the text analysis unit and for generating a prosody comprising determining pitch, length and intensity of a sound for each phoneme comprising said speech and a rhythm of speech with positions of pauses for audibly outputting the text and providing the prosody to a speech generation unit; and
a speech generation unit for receiving the prosody from the prosody control unit and for generating synthesized speech based on the prosody generated by the prosody control unit,
wherein the prosody control unit includes:
an outline estimation section for estimating an outline of an intonation for each assumed accent phrase configuring the text based on language information acquired by the text analysis unit, wherein the outline estimation section defines the outline of the intonation at least by a maximum value of a frequency level in a segment of the assumed accent phrase and relative level offsets in a starting end and termination end of the segment;
a shape element selection section for selecting an intonation pattern from the database based on the outline of the intonation, the outline having been estimated by the outline estimation section and wherein the shape element selection section selects an intonation pattern approximate in shape to the outline of the information, the outline having been estimated by the outline intonation section, among the intonation patterns of the actual speech, the intonation patterns having been accumulated in the database; and
a shape element connection section for connecting the intonation pattern for each assumed accent phrase to the intonation pattern for another assumed accent phrase, each intonation pattern having been selected by the shape element selection section, to generate an intonation pattern of an entire body of the text, wherein the shape element connection section connects the intonation pattern for each assumed accent phrase to the other, the intonation pattern having been selected by the shape element selection section, after adjusting a frequency level of the assumed accent phrase based on the outline of the intonation, the outline having been estimated by the outline estimation section.