Building the Next Generation of Conversational AI
Автор: a16z
Загружено: 15 мар. 2025 г.
Просмотров: 10 681 просмотр
Inside the Code: Ankit Kumar (Sesame) & Anjney Midha (a16z) on the Future of Voice AI
What goes into building a truly natural-sounding AI voice? In this episode, Sesame’s cofounder and CTO, Ankit Kumar, joins a16z’s Anjney Midha for a deep dive into the research and engineering behind their voice technology.
They discuss the technical challenges of real-time speech generation, the trade-offs in balancing personality with efficiency, and why the team is open-sourcing key components of their model. Ankit breaks down the complexities of multimodal AI, full-duplex conversation modeling, and the computational optimizations that enable low-latency interactions. They also explore the evolution of natural language as a user interface and its potential to redefine human-computer interaction.
Plus, we take audience questions on everything from scaling laws in speech synthesis to the role of in-context learning in making AI voices more expressive.
Key Takeaways:
How Sesame achieves natural voice interactions through real-time speech generation.
The impact of open-sourcing their speech model and what it means for AI research.
The role of full-duplex modeling in improving AI responsiveness.
How computational efficiency and system latency shape AI conversation quality.
The growing role of natural language as a user interface in AI-driven experiences.
For anyone interested in AI and voice technology, this episode offers an in-depth look at the latest advancements pushing the boundaries of human-computer interaction.
Follow everyone on X:
Ankit Kumar - https://x.com/_apkumar
Anjney Midha - https://x.com/anjneymidha
Check out everything a16z is doing with artificial intelligence, including articles, projects, and more podcasts here – https://a16z.com/ai/
Chapters:
0:00 - 00:51 | Intro
00:52 - 04:58 | Challenges Of Building
04:59 - 07:45 | Q + A: What Was Done To Bridge Transcription And Text Processing?
07:46 - 09:57 | How Is Sesame So Much Better Than Others?
09:58 - 12:42 | Challenges In| Making AI Accessible To All
12:43 - 14:10 | Great Researchers Prioritize User Experience
14:11 - 15:47 | What Is Good Taste In ML?
15:48 - 17:45 | Problems That Can Be Solved That Add Value To The World
17:46 - 26:25 | Open Source Audio For Speech Generation
26:26 - 34:00 | Contextual Speech vs Text to Speech, Differences
34:01 - 35:50 | Value Proposition Of Glasses With No Friction
35:51 - 38:00 | General Purpose API vs Open Source Model
38:01 - 40:47 | Creating High Quality APIs
40:48 - 45:54 | Companions And How Sesame Will Handle Context Retention In Long Conversations
45:55 - 46:59 | Talent: What It Takes To Become A Part Of The Sesame Team
47:00 - 54:37 | How Scaling Laws For Speech Differ From Text
54:38 - 58:33 | How An Organic Conversation Be Preserved Using A Voice Companion
58:34 - 1:03:52 | App Building Technology: Roadmap
1:03:53 - 1:09:09 | Architectures and Transformers
1:09:10 - 1:15:56 | The Focus On Personality, And The Differences In Products
1:15:57 - 1:25:25 | New AI Interface: Interacting With AI Companion
1:25:26 - 1:26:56 | Companion Challenges
1:26:57 - 1:29:22 | Computing Interface Of The Future
1:29:23 - 1:31:45 | Focused Product Experience Built By Small Teams
1:31:46 - 1:36:13 | Join Sesame If You Want To Make A Consumer Product People Love

Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: