This work aims to create a bass accompaniment track for a solo vocal track. This is achieved using a machine learning model trained on a comprehensive dataset of bass and vocal tracks that sound good together. The system first processes the vocal track and converts it into a spectrogram, a graphical representation of the signal's frequency spectrum over time. A generative diffusion model is then used for producing a corresponding bass track that aligns with the input vocal spectrogram as a conditioning input to the model. Throughout the project, an extensive literature review was conducted to select appropriate models, including mel-gan and HIFI gan. From this we concluded that the quality of the output would be better using a diffusion model, among other considerations.
After selecting the model and beginning the training process, several training sessions were performed, adjusting various parameters (such as segment length). Tools were developed for measuring the quality of the generated bass tracks by listening to the audio and analyzing the frequency composition of the produced signals.