This thesis introduces a novel approach to Speaker Diarization (SD), diverging from the traditional reliance on audio signals by exclusively leveraging text-based methodologies. It includes comprehensive evaluation methods tailored to textual data. By employing the T5-3B model within both the Single Prediction Model (SPM) and Multiple Prediction Model (MPM) frameworks, and incorporating data processing pipelines designed to enhance the model's performance on transcripts generated by Automatic Speech Recognition (ASR) models, this study assesses the feasibility and effectiveness of text-based SD in distinguishing "who speaks what" across various two-speaker dialogues via sentence-level Speaker Change Detection and aggregation mechanism. Furthermore, this research proposes and validates two new evaluation metrics: the Text-based Diarization Error Rate (TDER) and Diarization F1 (DF1). These metrics are specifically tailored to address the unique challenges of text-based SD and the joint assessment of ASR and SD errors. Alongside these metrics, we also propose a sequence alignment algorithm designed to align different transcripts effectively and efficiently, particularly in situations with overlapping speech.
Experiments conducted on a curated dataset, which encompasses 7 open-domain conversational contexts, demonstrate that text-based methods can perform comparably to—and, notably, for short conversations under 15 minutes, even outperform—traditional audio-based diarization systems by 2.5% to 10%. The newly proposed text-based metrics, tested on the CallHome dataset through both manual inspection and error type analysis, show an enhanced ability to accurately assess the performance of text-based SD and joint ASR and SD systems in providing informative transcription results. Moreover, the proposed multiple sequence alignment algorithm achieves better alignment results (0.99 accuracy) compared to previous dynamic programming-based methods (0.92 accuracy). These findings not only challenge existing paradigms within the field of SD but also pave the way for further advancements in conversational analysis and AI, highlighting the untapped potential of textual information in SD tasks.
Computer Science / Emory University
BS / Spring 2024
Jinho D. Choi, Computer Science and QTM, Emory University (Chair)
Davide Fossati, Computer Science, Emory University
Alissa Bans, Physics, Emory University
Anthology | Paper | Presentation
Alissa Bans, Peilin Wu, Jinho Choi, Davide Fossati