Seminar: Graduate Seminar

ECE Women Community

Neural Forced Alignment

Date: February,24,2026 Start Time: 14:30 - 15:30
Location: 506, New Zisapel Building
Add to:
Lecturer: Rotem Rousso

Forced Alignment (FA), the task of temporally aligning a speech signal with its corresponding textual transcription, is a fundamental component of many speech processing applications, including Automatic Speech Recognition (ASR), speech synthesis, prosody analysis, and automatic corpus annotation. Recent advances in sequence modelling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, forced alignment has not experienced comparable progress, and traditional HMM-GMM frameworks remain widely adopted and highly competitive. This work addresses this gap by systematically evaluating existing paradigms and introducing new neural alignment methods designed to rival current state-of-the-art approaches.
To establish the empirical necessity for a new architecture, we first conducted a comparative benchmarking analysis evaluating the temporal precision of state-of-the-art FA models. This research concluded that HMM-GMM-based models are the most accurate. At the same time, more modern approaches based on ASR are typically optimized to improve recognition accuracy rather than the precise temporal localization required for alignment, and thus fail to achieve the GMM-HMM precision that was trained at the phone-frame level.
Motivated by these findings, we initially propose a method for accurate multilingual word-level forced alignment that consists of an alignment encoder and a learned alignment decoder. The encoder integrates two self-supervised multilingual representations and estimates word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the representations to infer final word boundaries. Trained iteratively on American English hand-annotated datasets, the proposed approach outperforms all previous approaches on American English and unseen languages (Dutch, German, and Hebrew).
The primary contribution of this work is the development of an end-to-end, fully differentiable neural framework specifically designed for phoneme alignment. The proposed architecture consists of an encoder that processes the input signals and a decoder that produces alignment decisions. The encoder includes two branches: one dedicated to validating phoneme identity and the other to learning phoneme boundaries. The decoder is a trainable module based on a differentiable soft dynamic programming formulation. The entire system is optimized end-to-end using a novel contrastive loss function that encourages discrimination between steady-state phoneme regions and transition points. The proposed approach surpasses current state-of-the-art phoneme alignment methods on English benchmarks, performs competitively on word-level tasks, and demonstrates strong generalization in multilingual settings.
This work is based on the following papers:
Rotem Rousso, Eyal Cohen, Joseph Keshet, Eleanor Chodroff, Tradition or Innovation: A Comparison of Modern ASR Methods for Forced Alignment, Interspeech 2024
Roy Weberโˆ—, Meidan Zehaviโˆ—, Rotem Roussoโˆ—, Joseph Keshet, Word-Level Forced Alignment for Multilingual Data using Self-Supervised Embeddings, Learned Encoder, and Dynamic Programming Decoder, Interspeech 2026 (under submission)
Rotem Rousso, Joseph Keshet, Neural Fully-Differential Phoneme Alignment, IEEE Trans. on Audio, Speech, and Langugae, 2026 (under submission)

M.Sc. student under the supervision of Prof. Joseph Keshet.

 

All Seminars
Skip to content