Seminar: Graduate Seminar
Self-Supervision and Adaptation for Robust Speech Processing
Speech-processing systems are increasingly deployed in conditions that differ substantially from the data on which they were developed. Domain mismatch, adverse acoustic environments, and unreliable communication channels can all cause significant performance degradation. Addressing these challenges is difficult because collecting labeled data for every new condition is often impractical, and adapting large pretrained models can be computationally expensive while risking the loss of previously learned capabilities. We proposed self-supervision and lightweight adaptation to improve robustness across several speech-processing tasks without relying on extensive annotation or full-system retraining.
First, we study speaker diarization without speaker identity labels or diarization annotations. By exploiting temporal structure in unlabeled audio, we learn speaker-discriminative embeddings and use them to build a label-free diarization pipeline. Second, we address packet loss concealment using test-time self-supervision. Instead of treating a pretrained concealment model as fixed, we adapt it on the received portions of the same corrupted signal, using synthetic packet masks to create a training objective without clean references or external data. Third, we improve automatic speech recognition under packet loss, noise, and reverberation while keeping a large pretrained ASR model frozen. A small front-end adaptation network is trained on LibriSpeech-style data to transform corrupted spectra into inputs that are more useful for recognition, and is evaluated across substantially different acoustic domains.
Across these problems, we show the effectiveness of self-supervision and lightweight adaptation in enabling models to learn from unlabeled data and adjust to challenging conditions with minimal additional parameters, demonstrating practical and efficient paths to robustness that reduce the need for labeled data, large-scale retraining, and task-specific model redesign.
Ph.D. student Under the supervision of Prof. Joseph Keshet.

