ProjectsProject Details

Seeing Sound: Estimating Image From Sound

Project ID: 6680-2-21
Year: 2022
Student/s: Sagy Gersh, Yahav Vinokur
Supervisor/s: Yair Moshe

The goal of this work is to train a deep neural network so that it can receive an audio signal as input, and output a reconstructed image of the source from which that audio signal was produced. Under the assumption that an audio signal contains spatial properties of the object that produced it, we tried to use an audio classifier to extract these properties and transform them into a feature vector from which we can reconstruct the image from which the source was produced using a deep network with the GAN architecture. This project is a follow-up project with the same goal. Project A ended with the conclusion that the transition network fails to learn the transformation between the audio classifier and GAN when the variance of our dataset examples is large. During this project, we tried to apply more advanced methods to solve the problem. We tried to connect the audio classifier to the GAN through an architecture that can learn non-injective transformations to allow ambiguity in the transformation. In addition, we created our own dataset, with variance in the examples that is not as large as in ImageNet but not as small as in URMP. Next, we looked at different ways of training the network, in all of which we ran into mode collapse or divergence. Finally, we examined the representation space of the GAN we used and concluded that the network is unable to learn the required transformation since the image space the GAN spans is too limited.

Poster for Seeing Sound: Estimating Image From Sound