US 11,756,570 B2
	Audio-visual separation of on-screen sounds based on machine learning models
Efthymios Tzinis, Urbana, IL (US); Scott Wisdom, Boston, MA (US); Aren Jansen, Mountain View, CA (US); and John R Hershey, Winchester, MA (US)
Assigned to Google LLC, Mountain View, CA (US)
Filed by Google LLC, Mountain View, CA (US)
Filed on Mar. 26, 2021, as Appl. No. 17/214,186.
Prior Publication US 2022/0310113 A1, Sep. 29, 2022
Int. Cl. G10L 25/57 (2013.01); G06N 3/088 (2023.01); G10L 25/30 (2013.01); G06V 20/40 (2022.01); G06F 18/214 (2023.01)

CPC G10L 25/57 (2013.01) [G06F 18/214 (2023.01); G06N 3/088 (2013.01); G06V 20/40 (2022.01); G10L 25/30 (2013.01)]

21 Claims

1. A computer-implemented method, comprising:

receiving, by a computing device, an audio waveform associated with a plurality of video frames;

estimating, by a neural network and from the audio waveform, one or more audio sources associated with the plurality of video frames;

generating, by the neural network, one or more audio embeddings corresponding to the one or more estimated audio sources;

generating, by the neural network and for each audio embedding corresponding to the one or more estimated audio sources and based on a video embedding of the plurality of video frames, a spatio-temporal audio-visual embedding based on an attention operation that aligns the one or more estimated audio sources with spatio-temporal positions of on-screen objects in the plurality of video frames;

determining, by the neural network and based on the spatio-temporal audio-visual embedding, whether one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames; and

predicting, by the neural network, a version of the audio waveform comprising audio sources that correspond to the on-screen objects in the plurality of video frames.