| US 7,464,031 B2 | ||
| Speech recognition utilizing multitude of speech features | ||
| Scott E. Axelrod, Mount Kisco, N.Y. (US); Sreeram Viswanath Balakrishnan, Los Altos, Calif. (US); Stanley F. Chen, Yorktown Heights, N.Y. (US); Yuging Gao, Mount Kisco, N.Y. (US); Ramesh A. Gopinath, Millwood, N.Y. (US); Hong-Kwang Kuo, Pleasantville, N.Y. (US); Benoit Maison, White Plains, N.Y. (US); David Nahamoo, White Plains, N.Y. (US); Michael Alan Picheny, White Plains, N.Y. (US); George A. Saon, Old Greenwich, Conn. (US); and Geoffrey G. Zweig, Ridgefield, Conn. (US) | ||
| Assigned to International Business Machines Corporation, Armonk, N.Y. (US) | ||
| Filed on Nov. 28, 2003, as Appl. No. 10/724,536. | ||
| Prior Publication US 2005/0119885 A1, Jun. 02, 2005 | ||
| Int. Cl. G10L 15/00 (2006.01); G10L 15/20 (2006.01) | ||
| U.S. Cl. 704—236 [704/240; 704/251] | 12 Claims |

| 1. A speech recognition system, comprising:
a features extractor that extracts a multitude of speech features directly from input speech;
a log-linear function that receives the multitude of speech features obtained from the input speech and determines a posterior
probability of each of a plurality of hypothesized linguistic units given the extracted multitude of speech features by applying
the formula:
![]() where:
Hj is a jth hypothesis that contains a sequence of word (or other linguist unit) sequence w1k=w1w2 . . . wk
i is an index pointing to the ith word (or unit)
k is a number of words (units) in the hypothesis
T is a length of the speech signal
w1k is a sequence of words associated with the hypothesis Hj, and
o1T is a sequence of acoustic observations,
with conditional probabilities represented by a maximum entropy log-linear model:
![]() where:
λ1 are parameters of the log-linear model,
f1 are a multitude of features extracted, and
Z is a normalization factor that ensures that Equation 2 is a true probability (will sum up to 1); and
a search device that analyzes the posterior probabilities determined by the log-linear function to determine a recognized
output of unknown utterances.
|