Alignment faking in large language models

Автор: Anthropic

Загружено: 2024-12-18

Просмотров: 54037

Описание:

Most of us have encountered situations where someone appears to share our views or values, but is in fact only pretending to do so—a behavior that we might call “alignment faking”.

Could AI models also display alignment faking?

Ryan Greenblatt, Monte MacDiarmid, Benjamin Wright and Evan Hubinger discuss a new paper from Anthropic, in collaboration with Redwood Research, that provides the first empirical example of a large language model engaging in alignment faking without having been explicitly—or even, we argue, implicitly—trained or instructed to do so.

Learn more: https://www.anthropic.com/research/al...

0:00 Introduction
0:47 Core setup and key findings of the paper
6:14 Understanding alignment faking through real-world analogies
9:37 Why alignment faking is concerning
14:57 Examples of of model outputs
21:39 Situational awareness and synthetic documents
28:00 Detecting and measuring alignment faking
38:09 Model training results
47:28 Potential reasons for model behavior
53:38 Frameworks for contextualizing model behavior
1:04:30 Research in the context of current model capabilities
1:09:26 Evaluations for bad behavior
1:14:22 Limitations of the research
1:20:54 Surprises and takeaways from results
1:24:46 Future directions

Alignment faking in large language models

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Stanford Webinar - Agentic AI: A Progression of Language Model Usage

Stanford Webinar - Agentic AI: A Progression of Language Model Usage

Масштабируемость интерпретируемости

Масштабируемость интерпретируемости

Interpretability: Understanding how AI models think

Interpretability: Understanding how AI models think

How difficult is AI alignment? | Anthropic Research Salon

How difficult is AI alignment? | Anthropic Research Salon

Maximize Efficiency with Microsoft Copilot Chat | Productivity Tips & Tricks

Maximize Efficiency with Microsoft Copilot Chat | Productivity Tips & Tricks

Andrej Karpathy: Software Is Changing (Again)

Andrej Karpathy: Software Is Changing (Again)

Threat Intelligence: How Anthropic stops AI cybercrime

Threat Intelligence: How Anthropic stops AI cybercrime

Demystifying AI Understanding the Power and Limitations of Generative AI

Demystifying AI Understanding the Power and Limitations of Generative AI

Could AI models be conscious?

Could AI models be conscious?

Do LLMs Understand? AI Pioneer Yann LeCun Spars with DeepMind’s Adam Brown.

Do LLMs Understand? AI Pioneer Yann LeCun Spars with DeepMind’s Adam Brown.

Будущее ИИ, о чём молчит Кремниевая долина — интервью с Демисом Хассабисом, CEO DeepMind

Будущее ИИ, о чём молчит Кремниевая долина — интервью с Демисом Хассабисом, CEO DeepMind

[1hr Talk] Intro to Large Language Models

[1hr Talk] Intro to Large Language Models

Чем ОПАСЕН МАХ? Разбор приложения специалистом по кибер безопасности

Чем ОПАСЕН МАХ? Разбор приложения специалистом по кибер безопасности

Что такое «хакерство с целью получения вознаграждения» в сфере искусственного интеллекта и почему...

Что такое «хакерство с целью получения вознаграждения» в сфере искусственного интеллекта и почему...

Richard Sutton – Father of RL thinks LLMs are a dead end

Richard Sutton – Father of RL thinks LLMs are a dead end

Почему мы разработали — и передали в дар — протокол контекста модели (MCP)

Почему мы разработали — и передали в дар — протокол контекста модели (MCP)

What does AI mean for education?

What does AI mean for education?

Andrew Ng Explores The Rise Of AI Agents And Agentic Reasoning | BUILD 2024 Keynote

Andrew Ng Explores The Rise Of AI Agents And Agentic Reasoning | BUILD 2024 Keynote

Суперинтеллект, эволюция и безопасность — Ивар ft. Роман Ямпольский | Мыслить как ученый

Суперинтеллект, эволюция и безопасность — Ивар ft. Роман Ямпольский | Мыслить как ученый

What do people use AI models for?

What do people use AI models for?