Developing high-performing dialogue systems benefits from the automatic identification of undesirable behaviors in system responses. However, detecting such behaviors remains challenging, as it draws on a breadth of general knowledge and understanding of conversational practices. Although recent research has focused on building specialized classifiers for detecting specific dialogue behaviors, the behavior coverage is still incomplete and there is a lack of testing on real-world human-bot interactions. This paper investigates the ability of large language models (LLMs), such as GPT, to perform dialogue behavior detection for nine categories in real human-bot dialogues. We aim to assess whether LLMs can match specialized models and approximate human performance, thereby reducing the overall cost of a behavior detection task. Our findings reveal that neither specialized models nor LLMs have yet achieved satisfactory results for this task, falling short of human performance. Nevertheless, LLMs show promising potential and often outperform specialized detection models. We conclude with an in-depth examination of the prevalent shortcomings of LLMs, offering guidance for future research to enhance their capabilities.
Proceedings of the Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) / 2023
Anthology | Paper | Presentation | BibTeX