US 11,705,105 B2
	Speech synthesizer for evaluating quality of synthesized speech using artificial intelligence and method of operating the same
Jonghoon Chae, Seoul (KR)
Assigned to LG ELECTRONICS INC., Seoul (KR)
Appl. No. 16/500,021
Filed by LG ELECTRONICS INC., Seoul (KR)
PCT Filed May 15, 2019, PCT No. PCT/KR2019/005840 § 371(c)(1), (2) Date Oct. 1, 2019, PCT Pub. No. WO2020/203926, PCT Pub. Date Nov. 19, 2020.
Prior Publication US 2021/0217403 A1, Jul. 15, 2021
Int. Cl. G10L 13/02 (2013.01); G10L 13/033 (2013.01); G10L 13/08 (2013.01); G10L 25/60 (2013.01); G10L 25/30 (2013.01); G10L 25/51 (2013.01); G06N 3/08 (2023.01)

CPC G10L 13/02 (2013.01) [G06N 3/08 (2013.01); G10L 13/033 (2013.01); G10L 13/08 (2013.01); G10L 25/30 (2013.01); G10L 25/51 (2013.01); G10L 25/60 (2013.01)]

9 Claims

1. A speech synthesizer for evaluating quality of a synthesized speech using artificial intelligence, the speech synthesizer comprising: a data base configured to store a synthesized speech corresponding to text, a correct speech corresponding to the text and a speech quality evaluation model for evaluating the quality of the synthesized speech; and a processor configured to:

compare a first speech feature set indicating a feature of the synthesized speech and a second speech feature set indicating a feature of the correct speech, wherein each of the first speech feature set and the second speech feature set includes a pitch of voiceless sound of a speech, a pitch of voiced sound of the speech, a frequency band of the speech, a break index of a word configuring the speech, a pitch of the speech, an utterance speed of the speech, or a pitch contour of the speech, acquire a quality evaluation index set including indices used to evaluate the quality of the synthesized speech according to a result of the comparing, wherein the quality evaluation index set includes an FO Frame Error (FFE), a Gross Pitch Error (GPE), a Voicing Decision Error (VDE), a Mel Cepstral Distortion (MCD), a Formant Distance (FD), a Speaker Verification Error (SVE), a Break Index Error (BIE) and a Word Error (WE), and determine weights as model parameters of the speech quality evaluation model using the acquired quality evaluation index set and the speech quality evaluation model, wherein the processor differently determines the weights according to a synthesis purpose of the synthesized speech and updates the speech quality evaluation model based on the weights to generate an updated speech quality evaluation model, wherein a weight of the GPE and a weight of the FD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is a normal synthesis for maintaining a tone,

wherein a weight of the VDE and a weight of the FD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is an emotional synthesis for outputting an emotional synthesis speech,

wherein a weight of the FFE and a weight of the MCD are set to be learned to have greater values than weights of other quality evaluation indexes when the synthesis purpose is a personalization synthesis for outputting the synthesized speech suiting a tone of a specific speaker, and

wherein the updated speech quality evaluation model is applied to recognize a wake-up word for activating speech recognition or to generate the synthesized speech from the text.