US 11,705,107 B2
	Real-time neural text-to-speech
Sercan O. Arik, San Francisco, CA (US); Mike Chrzanowski, Palo Alto, CA (US); Adam Coates, Mountain View, CA (US); Gregory Diamos, San Jose, CA (US); Andrew Gibiansky, Mountain View, CA (US); John Miller, Palo Alto, CA (US); Andrew Ng, Mountain View, CA (US); Jonathan Raiman, Palo Alto, CA (US); Shubhahrata Sengupta, Menlo Park, CA (US); and Mohammad Shoeybi, Los Altos, CA (US)
Assigned to Baidu USA LLC, Sunnyvale, CA (US)
Filed by Baidu USA, LLC, Sunnyvale, CA (US)
Filed on Oct. 1, 2020, as Appl. No. 17/61,433.
Application 17/061,433 is a continuation of application No. 15/882,926, filed on Jan. 29, 2018, granted, now 10,872,598.
Claims priority of provisional application 62/463,482, filed on Feb. 24, 2017.
Prior Publication US 2021/0027762 A1, Jan. 28, 2021
Int. Cl. G10L 13/08 (2013.01); G10L 13/027 (2013.01); G10L 25/30 (2013.01); G06N 3/082 (2023.01); G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/02 (2006.01); G06F 40/242 (2020.01); G06N 3/047 (2023.01)

CPC G10L 13/08 (2013.01) [G06N 3/044 (2023.01); G06N 3/045 (2023.01); G06N 3/082 (2013.01); G10L 13/027 (2013.01); G10L 25/30 (2013.01); G06F 40/242 (2020.01); G06N 3/02 (2013.01); G06N 3/047 (2023.01)]

20 Claims

1. A computer-implemented method for using a text-to-speech (TTS) system to synthesize human speech from text, comprising:

using a trained grapheme-to-phoneme model to convert written text to phonemes corresponding to the written text;

inputting the phonemes into either: (1) a trained phoneme duration and fundamental frequency model or (2) a trained phoneme duration model and a trained fundamental frequency model, to output for a phoneme:

a phoneme duration;

a probability that the phoneme is voiced; and

a fundamental frequency profile; and

using a trained neural network audio synthesis model that receives the phonemes, the phoneme durations, the fundamental frequency profiles for the phonemes, and for each phoneme, a probability whether the phoneme is voiced as an input to generate a signal representing synthesized human speech of the written text.