US 9,812,146 B1
Synchronization of inbound and outbound audio in a heterogeneous echo cancellation system
Pushkaraksha Gejji, Cupertino, CA (US); and Arvind Mandhani, San Francisco, CA (US)
Assigned to AMAZON TECHNOLOGIES, INC., Seattle, WA (US)
Filed by Amazon Technologies, Inc., Seattle, WA (US)
Filed on Feb. 16, 2016, as Appl. No. 15/44,495.
Int. Cl. H04B 3/20 (2006.01); G10L 21/02 (2013.01); G10L 21/0208 (2013.01); G10L 15/22 (2006.01); G10L 21/0216 (2013.01); H04B 3/23 (2006.01); H04M 9/08 (2006.01)
CPC G10L 21/0205 (2013.01) [G10L 2015/223 (2013.01); G10L 2021/02082 (2013.01); G10L 2021/02163 (2013.01); H04B 3/23 (2013.01); H04M 9/082 (2013.01)] 18 Claims
OG exemplary drawing
 
1. A computer-implemented method for synchronizing microphone inputs with speaker outputs and removing an echo from an audio signal to isolate received speech, the method comprising:
sending, by the device to a first speaker using a first sampling rate, a first outgoing audio frame;
operating, by the device, a first counter to store a first value, the first counter configured to store a natural number for the first outgoing audio frame and increment the natural number with each subsequent outgoing audio frame that is sent to the first speaker;
sending, by the device to the first speaker using the first sampling rate, a second outgoing audio frame;
incrementing the first counter to a second value;
sending, by the device to the first speaker using the first sampling rate, a third outgoing audio frame comprising first audio data;
incrementing the first counter to a third value;
receiving, by the device from a first microphone of the device, a first incoming audio frame at a second sampling rate lower than the first sampling rate, the first incoming audio frame comprising second audio data, the second audio data including speech input and a first representation of audible sound output by the first speaker;
determining, by the device, the third value output by the first counter;
storing, by the device, the third value in the first incoming audio frame;
associating the first incoming audio frame with the third outgoing audio frame;
subtracting, by the device, the first audio data from the second audio data to generate the speech input;
performing, by a server, speech recognition processing on the speech input to determine a command; and
executing, by the device, the command.