Date: 2022-03-18 / 4:00 ~ 5:00 PM
Location: MSC E306 (https://emory.zoom.us/j/99364825782)
Despite the recent popularity of conversational AI research, the evaluation of chat models remains a significant challenge in the field. Likert scale human judgements are the most prominent evaluation methodology, but high variance and a lack of standardization of these evaluation labels make it difficult to make high-confidence comparisons between models. To address this, we investigate alternative human judgement methodologies for chatbot evaluation, including comparative judgements and a novel behavior coding scheme. We apply these various evaluation methodologies to four chatbots that have each achieved a state-of-the-art result in some aspect of conversational ability. In this presentation, we present our current progress, including the design of our behavior coding scheme, bot selection and replication methods, and results from our pilot studies.