APEX–Agents
Автор: AI Papers Podcast Daily
Загружено: 2026-01-25
Просмотров: 12
Researchers introduced the APEX-Agents benchmark to evaluate whether AI agents are capable of performing complex professional tasks used in fields like investment banking, management consulting, and law. This test was built by industry experts who designed realistic scenarios where the AI must use various tools and files to complete work that would typically take a human one to two hours. The study tested eight different AI models, and the results showed that Gemini 3 Flash performed the best with a success rate of 24%, followed closely by GPT-5.2. Despite these achievements, the low success rates indicate that while AI agents are becoming more capable, they are still not consistent enough to reliably handle the difficult daily work of human professionals.
https://arxiv.org/pdf/2601.14242
https://huggingface.co/datasets/merco...
https://github.com/Mercor-Intelligenc...
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: