How ChatGPT Turns the Internet Into Tokens (LLM Training Explained)
Автор: Lecture Distilled
Загружено: 2026-01-13
Просмотров: 20
---
Ever wondered how ChatGPT was trained on "the internet" but its dataset fits on a $200 hard drive?
In this video, we break down the first stage of LLM training:
How 2.7 billion web pages get filtered down to 44 terabytes
Why tokenization matters (and why capitalization breaks things)
The design decisions that determine what your AI can and can't do
Key concepts covered:
00:00 - The paradox: Internet-scale training on a hard drive
00:30 - Data filtering pipeline (URL filtering, text extraction, deduplication)
02:30 - Why neural networks need tokens, not text
04:00 - Byte Pair Encoding explained
05:30 - Tokenization gotchas ("hello" vs "Hello")
06:30 - Practical takeaways
---
📚 ORIGINAL SOURCE
This video distills concepts from Andrej Karpathy's excellent deep dive:
"Deep Dive into LLMs like ChatGPT"
• Deep Dive into LLMs like ChatGPT
All credit for the original content goes to Andrej Karpathy. This is an educational summary designed to make key concepts more accessible.
---
🎓 About Lecture Distilled
We transform long-form educational content into focused, digestible videos. Subscribe for more distilled knowledge!
#LLM #ChatGPT #MachineLearning #AI #Tokenization #DeepLearning #ArtificialIntelligence #AndrejKarpathy
```
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: