AppWorld: Reliable Evaluation of Interactive Agents in a Controllable World of Apps and People

Автор: Ai2

Загружено: 2024-11-04

Просмотров: 422

Описание:

Speaker:
Harsh Trivedi

Abstract:
We envision a world where AI agents (assistants) are widely used for complex tasks in our digital and physical worlds and are broadly integrated into our society. To move towards such a future, we need an environment for a robust evaluation of agents' capability, reliability, and trustworthiness.

In this talk, I'll introduce AppWorld, which is a step towards this goal in the context of day-to-day digital tasks. AppWorld is a high-fidelity simulated world of people and their digital activities on nine apps like Amazon, Gmail, and Venmo. On top of this fully controllable world, we build a benchmark of complex day-to-day tasks such as splitting Venmo bills with roommates, which agents have to solve via interactive coding and API calls. One of the fundamental challenges with complex tasks lies in accounting for different ways in which the tasks can be completed. I will describe how we address this challenge using a reliable and programmatic evaluation framework. Our benchmarking evaluations show that even the best LLMs, like GPT-4o, can only solve ~30% of such tasks, highlighting the challenging nature of the AppWorld benchmark.

I will conclude by laying out future research that can be conducted on the foundation of AppWorld, such as evaluation and development of multimodal, collaborative, safe, socially intelligent, resourceful, and fail-tolerant agents that can plan, adapt, and learn from environment feedback.

Project Website: https://appworld.dev/

Bio: https://harshtrivedi.me/
Harsh Trivedi is a final year PhD student at Stony Brook University, advised by Niranjan Balasubramanian. He is broadly interested in the development of reliable, explainable AI systems and their rigorous evaluation. Specifically, his research spans the domains of AI agents, multi-step reasoning, AI safety, and efficient NLP. He has interned at AI2 and was a visiting researcher at NYU. His recent work, AppWorld, received a Best Resource Paper award at ACL’24, and his work on AI safety via debate received a Best Paper award at the ML Safety workshop at NeurIPS’22.

AppWorld: Reliable Evaluation of Interactive Agents in a Controllable World of Apps and People

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

array(10) { [0]=> object(stdClass)#4518 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "a_WOFdHWbR0" ["related_video_title"]=> string(57) "AMA with AI Pioneers Raj Reddy and Andries "Andy" van Dam" ["posted_time"]=> string(27) "7 месяцев назад" ["channelName"]=> string(3) "Ai2" } [1]=> object(stdClass)#4491 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "DxVvF8xzk1I" ["related_video_title"]=> string(61) "AI Scaffolding Systems for the Academic Peer Review Ecosystem" ["posted_time"]=> string(25) "4 месяца назад" ["channelName"]=> string(3) "Ai2" } [2]=> object(stdClass)#4516 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "2L6t6t-w4cw" ["related_video_title"]=> string(50) "Evaluating and Enhancing Language Model Factuality" ["posted_time"]=> string(25) "3 месяца назад" ["channelName"]=> string(3) "Ai2" } [3]=> object(stdClass)#4523 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "Ei6iirrc1kE" ["related_video_title"]=> string(53) "Building the Foundations of Self-Improving LLM Agents" ["posted_time"]=> string(25) "4 месяца назад" ["channelName"]=> string(3) "Ai2" } [4]=> object(stdClass)#4502 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "g5FhXCargqM" ["related_video_title"]=> string(100) "Sumant Kachru presentation at - ASQ - India Section webinar on construction quality & excellence" ["posted_time"]=> string(21) "7 дней назад" ["channelName"]=> string(25) "digiQC : focus on quality" } [5]=> object(stdClass)#4520 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "3cJk1C1aHl0" ["related_video_title"]=> string(104) "Путин предложил остановить войну / Президент достиг цели" ["posted_time"]=> string(24) "13 часов назад" ["channelName"]=> string(10) "NEXTA Live" } [6]=> object(stdClass)#4515 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "efWRv3UzcOc" ["related_video_title"]=> string(173) "Пока бомбардировщики B-2 летят к цели, в Вашингтоне появился “Русский след” /№965/ Юрий Швец" ["posted_time"]=> string(23) "5 часов назад" ["channelName"]=> string(54) "Юрий Швец -- официальный канал" } [7]=> object(stdClass)#4525 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "blWdjRUPP6E" ["related_video_title"]=> string(72) "Разведчик о том, как использовать людей" ["posted_time"]=> string(25) "3 недели назад" ["channelName"]=> string(18) "Коллектив" } [8]=> object(stdClass)#4501 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "6OB9XiZUKJI" ["related_video_title"]=> string(172) "СРОЧНО! "НАЧАЛИ!": США вступили в войну с Ираном. Б-2 атаковали ядерные объекты. ЧТО ТЕПЕРЬ БУДЕТ?" ["posted_time"]=> string(21) "3 часа назад" ["channelName"]=> string(24) "И Грянул Грэм" } [9]=> object(stdClass)#4519 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "22tkx79icy4" ["related_video_title"]=> string(55) "RAG | САМОЕ ПОНЯТНОЕ ОБЪЯСНЕНИЕ!" ["posted_time"]=> string(23) "1 месяц назад" ["channelName"]=> string(8) "AI RANEZ" } }

AMA with AI Pioneers Raj Reddy and Andries

AMA with AI Pioneers Raj Reddy and Andries "Andy" van Dam

AI Scaffolding Systems for the Academic Peer Review Ecosystem

AI Scaffolding Systems for the Academic Peer Review Ecosystem

Evaluating and Enhancing Language Model Factuality

Evaluating and Enhancing Language Model Factuality

Building the Foundations of Self-Improving LLM Agents

Building the Foundations of Self-Improving LLM Agents

Sumant Kachru presentation at - ASQ - India Section webinar on construction quality & excellence

Sumant Kachru presentation at - ASQ - India Section webinar on construction quality & excellence

Путин предложил остановить войну / Президент достиг цели

Путин предложил остановить войну / Президент достиг цели

Пока бомбардировщики B-2 летят к цели, в Вашингтоне появился “Русский след” /№965/ Юрий Швец

Пока бомбардировщики B-2 летят к цели, в Вашингтоне появился “Русский след” /№965/ Юрий Швец

Разведчик о том, как использовать людей

Разведчик о том, как использовать людей

СРОЧНО! "НАЧАЛИ!": США вступили в войну с Ираном. Б-2 атаковали ядерные объекты. ЧТО ТЕПЕРЬ БУДЕТ?

RAG | САМОЕ ПОНЯТНОЕ ОБЪЯСНЕНИЕ!

RAG | САМОЕ ПОНЯТНОЕ ОБЪЯСНЕНИЕ!