We propose a convolutional neural network model for text-based speaker identification on multiparty dialogues extracted from the TV show, Friends. While most previous works on this task rely heavily on acoustic features, our approach attempts to identify speakers in dialogues using their speech patterns as captured by transcriptions to the TV show. It has been shown that different individual speakers exhibit distinct idiolec- tal styles. Several convolutional neural network models are developed to discriminate between differing speech patterns. Our results confirm the promise of text-based ap- proaches, with the best performing model showing an accuracy improvement of over 6% upon the baseline CNN model.
Anthology | Paper | Presentation | BibTeX