In the last decade, deep learning has been expanding and taking over many areas of signal processing of different kinds - image, audio and text among others.
With a set of diverse architectures such as neural networks, convolutional networks and lately, transformers, deep learning showcases better results than seen with classical methods in many signal processing tasks in general, and audio processing specifically.
In the last few years, convolutional architectures rule the audio world especially in classification, emotion detection and feature extraction. Similar to the computer vision area, the learned audio features can be optimized on a broad spectrum of datasets and labels. For this reason, it could be that architectures are successful in computer vision could also be successful in audio.
In computer vision and natural language processing (NLP), a new kind of architecture is taking over the world of deep learning transformers. In this work we reproduce an article that claims for designing an audio classification model based on only transformers, without the use of convolution, while reaching results that almost compete with state-of-the-art models.
We do that on a large dataset FSD50K, holding more than 50,000 labeled audio samples, FSD50K.