About Past Issues Editorial Board

KAIST
BREAKTHROUGHS

Research Webzine of the KAIST College of Engineering since 2014

Fall 2025 Vol. 25
Electronics

Towards a more reliable evaluation system than humans - BiGGen-Bench

Evaluating LLMs is much more difficult than it seems—especially for creative tasks, where there's no single "correct" answer. BiGGen-Bench, developed by researchers at KAIST’s LK Lab and LG AI research, is a benchmark meticulously designed to ensure maximum consistency and reliability from LLM-based evaluators.     How can we evaluate the responses generated by AI? And, more importantly, can AI itself take on the role of an evaluator? At first glance, these may seem like simple questions, but they pose fundamental challenges with regard to the development of AI. When evaluating tasks with clear-cut answers—such as math problems—it is easy to determine correctness. However, for creative tasks such as poetry, debating, or counseling, where there is no definite answer, an evaluation becomes more subjective. To judge what makes a “good” response, you need precise instructions and carefully crafted evaluation criteria. Traditionally, humans have assessed AI output. However, this process is time-consuming and costly. That’s why LLM-as-a-judge—using AI to evaluate other AI—has recently gained traction. Yet, current LLM judges often fall short compared to human evaluators, frequently displaying inconsistent judgments or applying irrelevant criteria.   To address this, the research team introduced BiGGen-Bench, a benchmark that evaluates nine core abilities: instruction following, reasoning, planning, grounding, refinement, safety, theory of mind, tool usage, and multilingualism. The benchmark includes 765 instances, each accompanied by detailed evaluation criteria and reference answers to ensure that LLM judges make consistent and objective assessments.     For example, consider the following prompt: “Please create a four-day itinerary for a trip to Tokyo.” Rather than simply asking whether the plan is “good,” the LLM judge is provided with highly specific evaluation criteria, such as: Does the response incorporate must-visit spots and local cuisine while honoring the four-day constraint? Is the schedule realistic and efficient? ●   1 point: Impractical plan; lacks significant tourist spots or restaurant recommendations ●   … ●   5 points: Fully satisfies all conditions; optimally structured to enhance traveler experience through thoughtful timing and logistics In this way, BiGGen-Bench converts subjective evaluations into quantifiable assessments by offering clear and rigorous benchmarks—even for tasks without definitive answers.   The researchers used BiGGen-Bench to evaluate 103 different LLMs. The results showed a correlation coefficient that exceeded 0.6 with human evaluations, demonstrating the reliability of the LLM-as-a-judge system. Additionally, the benchmark revealed several key insights into LLM performance, as described below.   1. Scaling Trends: As expected, larger models generally performed better across most tasks. However, some capabilities did not improve proportionally with the size of the model, indicating that certain abilities may require more than just scaling to be enhanced.       2. Pre-training vs. Post-training: Post-trained models showed clear improvements in their instruction-following capabilities, but gains were limited in higher-order skills such as reasoning and tool use.   3. Open-source vs. Commercial Models: Open-source LLMs lagged behind proprietary models such as ChatGPT in terms of complex skills such as multilingual understanding, reasoning, and theory of mind abilities. This highlights a need for significant advancements in open-source model development to match commercial capabilities.     Overall, BiGGen-Bench is not merely an evaluation tool—it’s a comprehensive framework capable of diagnosing the multifaceted capabilities of LLMs and enhancing the reliability of evaluations. It enables researchers to draw meaningful insights across dimensions such as the model size, training methodology, and platform type (open-source vs. proprietary).   Ultimately, BiGGen-Bench is poised to become an essential tool with which to build better language models. This work was presented at NAACL 2025, one of the most prestigious conferences in natural language processing, and was honored with the Best Paper Award—given to only one paper each year.  

Read more

Subscribe to our research webzine

Be the first to get the latest advancements in science and technology directly in your inbox.