
Automatic Speech Recognition (ASR) systems have seen major improvements through deep learning, yet they remain sensitive to variations in speaking rate, background noise, and speaker identity—factors that commonly degrade real-world performance. To address this, data augmentation is widely used to increase model robustness. This project evaluates the effectiveness of ScalerGAN, a generative augmentation technique, compared to classical fixed transformations such as time wrapping.
ScalerGAN leverages a Generative Adversarial Network (GAN) to create realistic, faster-speaking mel-spectrograms aimed at enriching the training data with more natural variability. To enable comparison, we developed and trained a Listen-Attend-Spell (LAS) model on the LibriSpeech dataset using three setups: baseline (no augmentation), time-warping, and ScalerGAN-augmented data. Through experiments, we evaluated the performance using Word Error Rate (WER), Binary Cross-Entropy (BCE) loss, and qualitative analysis supported by ChatGPT-4, a large language model (LLM) used to classify linguistic errors such as hallucinations, phonemic confusion, and structural drift.
The results demonstrate that ScalerGAN outperforms traditional augmentation by producing lower WER and BCE loss, and generating outputs that are more fluent and semantically consistent—highlighting the potential of generative augmentation in building more robust ASR systems.