One of the problems in performing signal processing operations on sound clips stems from noise added to the measurement device. Noise can drastically damage the performance of accurate analysis of an audio signal. One method to deal with this problem is to use multimodal observations so that one of the modalities is not affected by the noise of the other. An example is a video source independent of noise added to the audio, thus unaffected by this noise. This way, one can try to extract information lost in the audio due to the noise, using the video.
The purpose of this project is to perform spatial and temporal detection and recognition of speech, both in audio and video.
To achieve this goal, we present an implementation of a deep learning model using an extension of canonical correlation analysis (CCA) with an L0 regularization. The regularization will sparsify the input features, thus allowing us to identify the shared audio signal. We demonstrate the applicability of the approach to real videos including individual speakers.