Honors Thesis 2025 - Andrew Chung

Retrieval-Augmented Generation: From Data Processing to Information Systems in Low Resource Domains

Andrew Chung

Highest Honor in Computer Science


Abstract

Large Language Models (LLMs) have emerged as a transformative force in technological development over the past few years. These models have been widely integrated across educational, research, and business applications, serving as tools to enhance learning, a source of curiosity for research exploration, and streamline business operations in both internal and customer-facing systems. While LLMs offer diverse capabilities, one of their most sought-after application across all different sectors has been their potential to provide precise, contextual information and insights from domain-specific knowledge bases. In this context, Retrieval-Augmented Generation (RAG) has emerged as the leading framework for leveraging LLMs’ capabilities while maintaining accuracy and reliability.

To advance the understanding and development of successful retrieval-augmented generation systems, we examine various components to identify essential elements and potential performance improvements across different methodologies. Through collaboration with Hyundai, we develop a low-resource domain retrieval-augmented generation system designed to answer questions about automotive safety collision tests using information from multimodal slides. Our approach introduces a novel, language model-centric data processing pipeline that effectively transforms slide information into textual content suitable for retrieval and answer generation. We evaluate the performance of different state-of-the-art retrieval-augmented generation frameworks on our processed data, as well as different variations of embedding models. To assess our system’s effectiveness, we generate synthetic question-answer pairs from our refined data to test the accuracy of different retrieval models. Furthermore, we create additional synthetic question-answer pairs specifically targeting the multimodal table and chart information extracted from the slides. Our findings indicate that utilizing fine-tuned embedding models and language models with the original retrieval- augmented generation framework achieves the highest accuracy. We conclude by outlining next steps to encourage research toward developing open-source retrieval- augmented generation frameworks for low-resource domains.

Department / School

Computer Science / Emory University

Degree / Year

BS / Spring 2025

Committee

Jinho D. Choi, Computer Science, Emory University (Chair)
Joyce C. Ho, Computer Science, Emory University
Sharon Sonenblum, School of Nursing, Emory University

Links

Anthology | Paper | Presentation

Joyce Ho, Jinho Choi, Andrew Chung, Sharon Sonenblum