US 7,464,031 B2
Speech recognition utilizing multitude of speech features
Scott E. Axelrod, Mount Kisco, N.Y. (US); Sreeram Viswanath Balakrishnan, Los Altos, Calif. (US); Stanley F. Chen, Yorktown Heights, N.Y. (US); Yuging Gao, Mount Kisco, N.Y. (US); Ramesh A. Gopinath, Millwood, N.Y. (US); Hong-Kwang Kuo, Pleasantville, N.Y. (US); Benoit Maison, White Plains, N.Y. (US); David Nahamoo, White Plains, N.Y. (US); Michael Alan Picheny, White Plains, N.Y. (US); George A. Saon, Old Greenwich, Conn. (US); and Geoffrey G. Zweig, Ridgefield, Conn. (US)
Assigned to International Business Machines Corporation, Armonk, N.Y. (US)
Filed on Nov. 28, 2003, as Appl. No. 10/724,536.
Prior Publication US 2005/0119885 A1, Jun. 02, 2005
Int. Cl. G10L 15/00 (2006.01); G10L 15/20 (2006.01)
U.S. Cl. 704—236  [704/240; 704/251] 12 Claims
OG exemplary drawing
 
1. A speech recognition system, comprising:
a features extractor that extracts a multitude of speech features directly from input speech;
a log-linear function that receives the multitude of speech features obtained from the input speech and determines a posterior probability of each of a plurality of hypothesized linguistic units given the extracted multitude of speech features by applying the formula:

OG Complex Work Unit Drawing
where:
Hj is a jth hypothesis that contains a sequence of word (or other linguist unit) sequence w1k=w1w2 . . . wk
i is an index pointing to the ith word (or unit)
k is a number of words (units) in the hypothesis
T is a length of the speech signal
w1k is a sequence of words associated with the hypothesis Hj, and
o1T is a sequence of acoustic observations,
with conditional probabilities represented by a maximum entropy log-linear model:

OG Complex Work Unit Drawing
where:
λ1 are parameters of the log-linear model,
f1 are a multitude of features extracted, and
Z is a normalization factor that ensures that Equation 2 is a true probability (will sum up to 1); and
a search device that analyzes the posterior probabilities determined by the log-linear function to determine a recognized output of unknown utterances.