[2024F] Grace Byun (PhD)

LLM Evaluation

Grace Byun

Date: 2024-11-22 / 3:00 - 4:00 PM
Location: White Hall 100


Abstract

In this presentation, we explore recent advancements in large language model (LLM) evaluation, focusing on LLM-as-a-judge methods and open evaluator models. Traditional evaluation metrics often fail to address task-specific requirements, prompting the development of LLM-based evaluators.First, LLM-as-a-Judge (Zheng et al., 2023) is introduced, which highlights the scalability and explainability of LLMs for evaluation while addressing biases like verbosity and position bias. Next, we go through G-Eval (Liu et al., 2023), a framework leveraging Auto CoT for evaluation efficiency and proposing a probability summation scoring method to overcome challenges like low variance in scores. Finally, we examine Prometheus 2 (Kim et al., 2024)), an advanced open-source evaluator model, demonstrating high alignment with human and GPT-4 evaluations through weight-merging training strategy.

Link

Presentation