
In this project we developed an automated data preparation system to improve Torah reading recognition with cantillation (te'amim). The main challenge was to adapt existing datasets, which typically consist of long recordings, to the requirements of training models like Whisper, which necessitate short audio segments (under 30 seconds).
The developed system includes: VAD-based audio segmentation, initial transcription using Whisper, and a unique iterative text alignment mechanism bridging the gap between modern transcription and biblical text. Additionally, tools were created for automated data collection from YouTube, utilizing an LLM for Parasha/Aliyah identification.
Using these tools, the dataset was expanded from 123 to 337 hours, with various cantillation styles and readers. The comparison revealed dramatic improvement: with the best model, the F1 score increased from 0.153 to 0.842 for Ashkenazi cantillation, and the WER decreased from 91% to 7.3%. These results confirm the system's effectiveness in generating high-quality data and significantly enhancing Torah reading recognition capabilities with cantillation.
The innovative approach to handling biblical text through gradual alignment proved particularly effective, with the medium-sized model offering an excellent balance between performance and computational requirements. The project paves the way for promising future directions, including training dedicated models for different cantillation styles and developing more efficient mechanisms for predicting cantillation marks and creating educational tools.