סמינר: Graduate Seminar
Estimates of Speech Separation in Far-Field Domain
Speech Source separation is a crucial pre-processing step for various speech process-ing tasks, such as multi talker automatic speech recognition (ASR) and audio assistive devices. Separating individual speech signals from overlapping audio is crucial for ad-
dressing the ”cocktail party problem.” This process significantly enhances the robustness and performance of speech processing systems. Traditionally, the evaluation metrics for speech separation rely on the matched reference audios and corresponding tran-
scriptions to assess audio quality and intelligibility. However, they cannot be used to evaluate real-world mixtures for which no reference exists. In this thesis, we introduce a text-free reference-free evaluation framework based on self-supervised learning (SSL)
representations. The proposed framework utilizes the mixture and separated tracks to predict jointly audio quality and speech intelligibility, through the Scale Invariant Signal to Noise Ratio (SI-SNR) metric, and the Word Error Rate (WER) metric. In addition, experiments on classifying which words are correct or errors were also con-ducted, for which an additional text embedder and text features were included.
In an attempt to evaluate the separation performance without an existing model, we defined two unsupervised metrics for speech separation that are independent of ref-erence audios or transcriptions and are computed directly at the signal level. We show that using these metrics and our framework can be used to evaluate the performance of speech separation models in real-world scenarios. Moreover, we show that the es-timators and metrics can be used to guide the training process of a teacher-student model.
We conducted experiments on the WHAMR! dataset, which shows a WER estima-tion with a mean absolute error (MAE) of 17% and a Pearson correlation coefficient (PCC) of 0.77 with the ground truth WER; and SI-SNR estimation with an MAE of 1.38 and PCC of 0.95 with the ground truth SISNR values. Word classification as Correct/Error has an accuracy of 74.6%. We further demonstrate the robustness of our estimator by using various SSL representations.
M.Sc. student under the supervision of Prof. Oren Kurland.

