CPC G10L 25/57 (2013.01) [G06F 18/214 (2023.01); G06N 3/088 (2013.01); G06V 20/40 (2022.01); G10L 25/30 (2013.01)] | 21 Claims |
1. A computer-implemented method, comprising:
receiving, by a computing device, an audio waveform associated with a plurality of video frames;
estimating, by a neural network and from the audio waveform, one or more audio sources associated with the plurality of video frames;
generating, by the neural network, one or more audio embeddings corresponding to the one or more estimated audio sources;
generating, by the neural network and for each audio embedding corresponding to the one or more estimated audio sources and based on a video embedding of the plurality of video frames, a spatio-temporal audio-visual embedding based on an attention operation that aligns the one or more estimated audio sources with spatio-temporal positions of on-screen objects in the plurality of video frames;
determining, by the neural network and based on the spatio-temporal audio-visual embedding, whether one or more audio sources of the one or more estimated audio sources correspond to the on-screen objects in the plurality of video frames; and
predicting, by the neural network, a version of the audio waveform comprising audio sources that correspond to the on-screen objects in the plurality of video frames.
|