Arman Cohan - Evaluating and Understanding LLMs: From Scientific Reasoning to Alignment as Judges

Автор: uclanlp-plus

Загружено: 2025-12-18

Просмотров: 52

Описание:

Talk Title: Evaluating and Understanding LLMs: From Scientific Reasoning to Alignment as Judges

Abstract: We present our recent work on evaluating and understanding large language models in scientific contexts and understanding them in context of evaluation-generation capabilities. First, we'll introduce SciArena, an open evaluation platform for literature-grounded scientific tasks that uses expert preferences to rank models on long-form, literature-grounded responses. The platform currently supports a broad set of open and proprietary models and has already accumulated a large pool of high-quality preferences. Using these data, we release SciArena-Eval, a meta-evaluation benchmark for training and stress-testing automated judges on science tasks. We will then turn to scientific problem solving. We discuss a holistic suite of scientific reasoning tasks, and a new framework for studying the role of knowledge in scientific problem solving and its interaction with reasoning. Our analysis shows that retrieving task-relevant knowledge from model parameters is the primary bottleneck for science reasoning; in-context external knowledge systematically helps even strong reasoning models; and improved verbalized reasoning increases a model’s ability to surface the right knowledge. Finally, if there is time, we will present a work on generation–evaluation consistency and show that models that judge well also tend to generate outputs that align with human preferences. This enables alignment benchmarking that evaluates models in their role as judges without scoring their generations directly.

To checkout other talks in our full NLP Seminar Series, please visit: • UCLA NLP Seminar Series

Arman Cohan - Evaluating and Understanding LLMs: From Scientific Reasoning to Alignment as Judges

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Aviral Kumar - The Importance of Exploration for Test-Time Scaling

Aviral Kumar - The Importance of Exploration for Test-Time Scaling

Sherry Yang - Learning World Models and Agents for High-Cost Environments

Sherry Yang - Learning World Models and Agents for High-Cost Environments

Руководство для начинающих по процессу анализа данных

Руководство для начинающих по процессу анализа данных

Jacob Andreas - Just Asking Questions

Jacob Andreas - Just Asking Questions

Presentations in English - How to Give a Presentation - Business English

Presentations in English - How to Give a Presentation - Business English

S&P 500 Hits High; Travere Therapeutics, Figure Technology, Palantir In Focus | Stock Market Today

S&P 500 Hits High; Travere Therapeutics, Figure Technology, Palantir In Focus | Stock Market Today

Scott Bessent: Fixing the Fed, Tariffs for National Security, Solving Affordability in 2026

Scott Bessent: Fixing the Fed, Tariffs for National Security, Solving Affordability in 2026

Introduction to Programming and Computer Science - Full Course

Introduction to Programming and Computer Science - Full Course

Want to Give a Great Presentation? Use Ugly Sketches | Martin J. Eppler | TED

Want to Give a Great Presentation? Use Ugly Sketches | Martin J. Eppler | TED

Russell's Paradox - a simple explanation of a profound problem

Russell's Paradox - a simple explanation of a profound problem

Energy Storage, But Make It Complicated

Energy Storage, But Make It Complicated

Dalio: Why Market Crises Keep Changing the Rules for Investors

Dalio: Why Market Crises Keep Changing the Rules for Investors

Computer Science Terminology

Computer Science Terminology

Человек, который произвел революцию в информатике с помощью математики

Человек, который произвел революцию в информатике с помощью математики

Computer Science 101 - The First Video YOU Should Watch

Computer Science 101 - The First Video YOU Should Watch

Natasha Jaques - Social Reinforcement Learning for pluralistic alignment and human-AI interaction

Natasha Jaques - Social Reinforcement Learning for pluralistic alignment and human-AI interaction

Express Republiki 24.12.2025 | TV Republika

Express Republiki 24.12.2025 | TV Republika

Dzisiaj Informacje Telewizja Republika 24.12.2025 | TV Republika

Dzisiaj Informacje Telewizja Republika 24.12.2025 | TV Republika

Что такое компьютерные науки? | Введение в CS - Python | Академия Хана

Что такое компьютерные науки? | Введение в CS - Python | Академия Хана

Parisa Kordjamshidi - Reasoning under Uncertainty with Large Multimodal Language Models

Parisa Kordjamshidi - Reasoning under Uncertainty with Large Multimodal Language Models