US 11,817,101 B2
Speech recognition using phoneme matching
Wilson Hsu, Waterloo (CA); Kaheer Suleman, Cambridge (CA); and Joshua Pantony, New York, NY (US)
Assigned to Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed by Microsoft Technology Licensing, LLC, Redmond, WA (US)
Filed on Nov. 4, 2020, as Appl. No. 17/089,228.
Application 17/089,228 is a continuation of application No. 14/490,321, filed on Sep. 18, 2014, granted, now 10,885,918.
Claims priority of provisional application 61/879,796, filed on Sep. 19, 2013.
Prior Publication US 2021/0074297 A1, Mar. 11, 2021
Int. Cl. G10L 15/26 (2006.01); G10L 15/08 (2006.01); G10L 15/22 (2006.01); G10L 15/187 (2013.01); G10L 15/32 (2013.01); G10L 15/06 (2013.01)
CPC G10L 15/26 (2013.01) [G10L 15/06 (2013.01); G10L 15/187 (2013.01); G10L 15/22 (2013.01); G10L 15/32 (2013.01); G10L 2015/088 (2013.01)] 20 Claims
OG exemplary drawing
 
1. A computer implemented method of converting an audio input into a text representation associated with a user, the method comprising:
receiving, by a conversion processor, the audio input;
generating, by a first speech recognition processor, a first text representation of the audio input, wherein the first speech recognition processor uses a first natural language model for speech recognition of a first language;
generating, by a second speech recognition processor, a second text representation of the audio input using a second natural language model, wherein the second natural language model recognizes a word not in the first natural language model for speech recognition, the second text representation includes the word not in the first natural language model, the second text representation includes a phoneme of the word, and the second natural language model is distinct from the first natural language model;
aligning, by the conversion processor, based on a phoneme sequence associated with the first and second text representations, the first text representation and the second text representation;
generating, by the conversion processor, based at least on the aligned first and second text representations and a likelihood of the phoneme of the word being a part of the first text representation, a third text representation; and
outputting the third text representation as a personalized recognized text representation of the audio input.