Seminar: Graduate Seminar

ECE Women Community

Adapting a Non-Causal Pre-Trained Speech Recognition Model for Streaming Tasks

Date: December,03,2025 Start Time: 10:30 - 11:30
Location: 608, Zisapel Building
Add to:
Lecturer: Tomer Krichli
Automatic speech recognition (ASR) has advanced rapidly with the rise of encoder-decoder Transformer based models, particularly models such as Whisper and NVIDIA Canary, which offer strong multilingual and noise-robust performance. However, despite their effectiveness in offline transcription, these models are fundamentally non-causal: the encoder processes the entire input sequence using full bidirectional self-attention, and the decoder conditions on a complete, fixed-length encoder representation. This design makes the transformer encoder-decoder model highly effective for offline inference but unsuitable for streaming scenarios, where the model must operate incrementally and cannot rely on future acoustic context. Simply enforcing causality is non-trivial because the encoder’s representations were learned under the assumption of full global context; removing access to future frames disrupts the internal feature geometry, leading to severe degradation unless the model is carefully adapted. Moreover, unlike decoder-only architectures or Recurrent Neural Network (RNN) based transducers, encoder–decoder models require tight coupling between audio and text attention patterns, making causal adaptation even more challenging.

Recent attempts to adapt Transformer encoder-decoder to streaming inference illustrate these difficulties. Some works avoid fine-tuning entirely and introduce heuristic mechanisms—such as buffer-based chunking, alignment-based token emission criteria, or local agreement between sub-sequences—but these approaches do not yield true streaming behavior and require inefficiencies such as repeatedly padding inputs to the model’s 30-second window. Other methods introduce architectural changes, such as adding a Connectionist Temporal Classification (CTC) head or training the encoder to detect special markers, requiring additional parameters and multi-stage decoding pipelines. These approaches improve latency but still place substantial overhead on top of the original model.

We propose a method that directly transforms a pre-trained transformer encoder-decoder model into a fully streaming, low-latency ASR system without auxiliary heads, architectural modifications, or multi-pass decoding. Our approach converts the model’s encoder into a causal encoder and fine-tunes both encoder and decoder jointly using lightweight Low-Rank Adaptation (LoRA). LoRA layers are injected specifically into the attention modules of both components, enabling the model to learn causal feature representations that preserve the model’s linguistic and acoustic strengths while removing dependence on future frames. Training is performed with cross-entropy on weakly aligned speech–text pairs, where alignments are obtained through forced alignment rather than sequence-level objectives such as CTC. We introduce an efficient streaming training strategy that does not require iterating over every possible audio sub-segment. This substantially reduces computation while ensuring the model receives realistic streaming-style context. For inference, we design a decoding procedure that leverages the fine-tuned causal encoder–decoder to perform stable greedy or beam search decoding in real time. Our analysis shows that the resulting system produces locally optimal predictions with competitive accuracy while achieving significantly lower latency and computational cost than prior Transformer-based streaming methods.

This work demonstrates that a transformer encoder–decoder ASR model can be transformed into a practical streaming recognizer through targeted causal adaptation and lightweight fine-tuning—without architectural expansion, auxiliary losses, or multi-step inference.

M.Sc. student under the supervision of Prof. Joseph Keshet.

 

All Seminars
Skip to content