US 11,721,319 B2
	Artificial intelligence device and method for generating speech having a different speech style
Minook Kim, Seoul (KR); Yongchul Park, Seoul (KR); Sungmin Han, Seoul (KR); Siyoung Yang, Seoul (KR); Sangki Kim, Seoul (KR); and Juyeong Jang, Seoul (KR)
Assigned to LG ELECTRONICS INC., Seoul (KR)
Filed by LG ELECTRONICS INC., Seoul (KR)
Filed on Feb. 27, 2020, as Appl. No. 16/803,941.
Claims priority of application No. 10-2019-0162622 (KR), filed on Dec. 9, 2019.
Prior Publication US 2021/0174782 A1, Jun. 10, 2021
Int. Cl. G10L 13/10 (2013.01); G06N 5/04 (2023.01); G10L 13/047 (2013.01); G06N 20/00 (2019.01)

CPC G10L 13/10 (2013.01) [G06N 5/04 (2013.01); G06N 20/00 (2019.01); G10L 13/047 (2013.01)]

8 Claims

1. A method for generating a synthesized speech having a different speech style, the method comprising:

acquiring audio data having different speech styles for different emotions;

generating a condition vector relating to a condition for determining the speech style of the audio data;

reducing a dimension of the condition vector to a predetermined reduction dimension;

acquiring a sparse code vector based on a dictionary vector acquired through sparse dictionary coding with respect to the condition vector having the predetermined reduction dimension;

changing a vector element value included in the sparse code vector;

acquiring the condition vector having the predetermined reduction dimension from the sparse code vector having the changed vector element value based on the dictionary vector;

acquiring the condition vector in which the condition for determining the speech style is changed by extending the dimension of the condition vector having the predetermined dimension;

acquiring a prosody vector representing each of at least one speech style;

generating a prosody embedding vector having a changed speech style using the prosody vector and the condition vector having the changed condition for determining the speech style;

acquiring text data; and

generating a synthesized speech based on the text data and the prosody embedding vector,

wherein the reduced dimension of the condition vector is determined by discarding an eigen vector of the dimension with variance smaller than a reference variance based on the order of the eigen values;

wherein the acquired sparse code vector is determined by using the dictionary vector with significant distinctive elements and the sparse code vector includes a plurality of vector element values having at least one valid vector element value and the remainder,

wherein the speech style is changed based on the sparse code vector mixed a first valid vector element value determining first emotion and a second valid vector element value determining a second emotion different from the first emotion, and the second valid vector element value is different from the first valid vector element value,

wherein the predetermined reduced dimension of the condition vector having a first dimension is acquired from the sparse code vector having the changed vector element value, and

wherein the extended dimension of the condition vector having a second dimension which is twice the first dimension of the condition vector is determined by Inverse Principal Component Analysis (IPCA).