High-quality multi-turn dialogue datasets are a scarce commodity in the field of Natural Language Processing, and with the recent rise of chat bots powered by seq2seq models that train on these datasets, they have become more important than ever. This thesis describes work done on a model built to deconstruct Reddit posts and sequence the fragments to create high-quality, multi-turn, topic-specific conversations. The model works by using a post's content as a beginning framework for a single speaker's statements in a conversation, filling in the second speaker's utterances with comments left on the same post. A dialogue dataset with 951 dialogues was generated using this method comprising conversations across two topics: movies and books. This dataset, HuHu, was then manually evaluated against DailyDialog, Topical-Chat, and MultiWOZ, three high-quality datasets with ~10,000 dialogues constructed in varying ways. The results showed that our generated dialogues were overall considered more natural in 46% of cases and considered at least as natural in 73% of comparisons. This is an incredible result given that our model can generate millions of dialogues across any number of topics, limited only by the number of related Reddit posts. Future work in the task of dialogue assembly models appears to be very promising and could result in dialogues at a near-human level within the near future.
Computer Science / Emory University
BS / Spring 2022
Jinho D. Choi, Computer Science and QTM, Emory University (Chair)
Lauren Klein, English and QTM, Emory University
Ting Li, Computer Science, Oxford College of Emory University
Anthology | Paper | Presentation
Mack Hutsell (top-left), Jinho Choi (top-right), Ting Li (bottom-left), Lauren Klein (bottom-right)