
Date: 2022-09-09 / 4:00 ~ 5:00 PM
Location: MSC E306 (https://emory.zoom.us/j/99364825782)
Chat-oriented dialogue evaluation is notoriously challenging due to the inherent subjectivity in defining the quality of a human-computer chat. Evaluations that employ human judges to rate conversation quality have become a de-facto standard in this area; however, current human evaluation methodologies are largely unstandardized and often produce unreliable evaluation metrics due to high variance in human judgements. In addition, existing evaluation methodologies fail to illuminate the prevalence of undesirable behaviors that are made by current dialogue systems, rendering it difficult to clearly understand the outstanding challenges. In this work, we propose a novel human evaluation method - ABC-Eval - that seeks to remedy this limitation by having annotators judge chatbot responses on important defined behaviors, such as self-contradiction, topic switching, and more.