L-3 | Building LLM Tokenizers From Scratch (With Code!)
Автор: Code With Aarohi Hindi
Загружено: 2025-12-09
Просмотров: 121
In the last lecture, we built our own TinyGPT LLM from scratch using manual tokenization.
Today, we upgrade that system using real, production-level tokenizers.
GitHub: ( both links have same code )
https://github.com/codewithaarohi/Bui...
https://github.com/AarohiSingla/Build...
📧 You can also reach me at: [email protected]
📸 Follow me on Instagram (English) : @codewithaarohi
🔗 / codewithaarohi
📸 Follow me on Instagram (Hindi) : @codewithaarohihindi
🔗 / codewithaarohihindi
If you haven’t watched the previous lecture
I highly recommend watching it first—we built the entire TinyGPT model step-by-step.
In this video, you will learn:
What tokenizers really do
How LLMs convert text → tokens → numbers
How to use SentencePiece
How to use BPE (Byte Pair Encoding)
How to use pretrained tokenizers like GPT-2, BERT, LLaMA, T5
How to train your own tokenizer from your own dataset
How vocabulary size, domain-specific text, and language mix affect tokens
How embedding layers convert token IDs into vectors
How to integrate everything into our TinyGPT model
Libraries Covered
sentencepiece (train your own tokenizer)
tokenizers (BPE, ByteLevelBPETokenizer)
gensim (Word2Vec, FastText embeddings)
transformers (HuggingFace tokenizers)
👍 Support the Channel
Your support pushes me to create even better videos.
Please Like, Comment, Share, and Subscribe ❤️
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: