How DeepSeek Rewrote the Transformer [MLA]

Автор: Welch Labs

Загружено: 2025-03-05

Просмотров: 826411

Описание:

Thanks to KiwiCo for sponsoring today’s video! Go to https://www.kiwico.com/welchlabs and use code WELCHLABS for 50% off your first monthly club crate or for 20% off your first Panda Crate!

MLA/DeepSeek Poster at 17:12 (Free shipping for a limited time with code DEEPSEEK):
https://www.welchlabs.com/resources/m...

Limited edition MLA Poster and Signed Book:
https://www.welchlabs.com/resources/d...

Imaginary Numbers book is back in stock!
https://www.welchlabs.com/resources/i...

Special Thanks to Patrons / welchlabs

Juan Benet, Ross Hanson, Yan Babitski, AJ Englehardt, Alvin Khaled, Eduardo Barraza, Hitoshi Yamauchi, Jaewon Jung, Mrgoodlight, Shinichi Hayashi, Sid Sarasvati, Dominic Beaumont, Shannon Prater, Ubiquity Ventures, Matias Forti, Brian Henry, Tim Palade, Petar Vecutin, Nicolas baumann, Jason Singh, Robert Riley, vornska, Barry Silverman, Jake Ehrlich

References
DeepSeek-V2 paper: https://arxiv.org/pdf/2405.04434
DeepSeek-R1 paper: https://arxiv.org/abs/2501.12948
Great Article by Ege Erdil: https://epoch.ai/gradient-updates/how...
GPT-2 Visualizaiton: https://github.com/TransformerLensOrg...
Manim Animations: https://github.com/stephencwelch/mani...

Technical Notes

1. Note that DeepSeek-V2 paper claims a KV cache size reduction of 93.3%. They don’t exactly publish their methodology, but as far as I can tell it’s something likes this: start with Deepseek-v2 hyperparameters here: https://huggingface.co/deepseek-ai/De.... num_hidden_layers=30, num_attention_heads=32, v_head_dim = 128. If DeepSeek-v2 was implemented with traditional MHA, then KV cache size would be 2*32*128*30*2=491,520 B/token. With MLA with a KV cache size of 576, we get a total cache size of 576*30=34,560 B/token. The percent reduction in KV cache size is then equal to (491,520-34,560)/492,520=92.8%. The numbers I present in this video follow the same approach but are for DeepSeek-v3/R1 architecture: https://huggingface.co/deepseek-ai/De.... num_hidden_layers=61, num_attention_heads=128, v_head_dim = 128. So traditional MHA cache would be 2*128*128*61*2 = 3,997,696 B/token. MLA reduces this to 576*61*2=70,272 B/token. Tor the DeepSeek-V3/R1 architecture, MLA reduces the KV cache size by a factor of 3,997,696/70,272 =56.9X.
2. I claim a couple times that MLA allows DeepSeek to generate tokens more than 6x faster than a vanilla transformer. The DeepSeek-V2 paper claims a slightly less than 6x throughput improvement with MLA, but since the V3/R1 architecture is heavier, we expect a larger lift, which is why i claim “more than 6x faster than a vanilla transformer” - in reality it’s probably significantly more than 6x for the V3/R1 architecture.
3. In all attention patterns and walkthroughs, we’re ignoring the |beginning of sentence| token. “The American flag is red, white, and” actually maps to 10 tokens if we include this starting token, and may attention patterns do assign high values to this token.
4. We’re ignoring bias terms matrix equations.
5. We’re ignoring positional embeddings. These are fascinating. See DeepSeek papers and ROPE.

How DeepSeek Rewrote the Transformer [MLA]

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Самая сложная модель из тех, что мы реально понимаем

Самая сложная модель из тех, что мы реально понимаем

Как Сделать Настольный ЭЛЕКТРОЭРОЗИОННЫЙ Станок?

Как Сделать Настольный ЭЛЕКТРОЭРОЗИОННЫЙ Станок?

The Misconception that Almost Stopped AI [How Models Learn Part 1]

The Misconception that Almost Stopped AI [How Models Learn Part 1]

DeepSeek V3.2 Just Broke SoTA Again… But How?

DeepSeek V3.2 Just Broke SoTA Again… But How?

NotebookLM: Таблицы из всего. 4 Способа применения

NotebookLM: Таблицы из всего. 4 Способа применения

The Future of Veritasium

The Future of Veritasium

How do Graphics Cards Work? Exploring GPU Architecture

How do Graphics Cards Work? Exploring GPU Architecture

Вы думали, что допинг — это плохо? Подождите, пока не услышите об электромагнитных велосипедах.

Вы думали, что допинг — это плохо? Подождите, пока не услышите об электромагнитных велосипедах.

Microchip Breakthrough: We're Moving Beyond Silicon

Microchip Breakthrough: We're Moving Beyond Silicon

Момент, когда мы перестали понимать ИИ [AlexNet]

Момент, когда мы перестали понимать ИИ [AlexNet]

The most complex model we actually understand

The most complex model we actually understand

Самый важный алгоритм в истории [Veritasium]

Самый важный алгоритм в истории [Veritasium]

Цепи Маркова — математика предсказаний [Veritasium]

Цепи Маркова — математика предсказаний [Veritasium]

How CATL’s Reinforced Sodium Battery is Insanely Cheap

How CATL’s Reinforced Sodium Battery is Insanely Cheap

There Is Something Faster Than Light

There Is Something Faster Than Light

ChatGPT is made from 100 million of these [The Perceptron]

ChatGPT is made from 100 million of these [The Perceptron]

Почему «Трансформеры» заменяют CNN?

Почему «Трансформеры» заменяют CNN?

AI без хайпа: как всё работает на самом деле? Александр Машрабов и первый казахстанский единорог

AI без хайпа: как всё работает на самом деле? Александр Машрабов и первый казахстанский единорог

The F=ma of Artificial Intelligence [Backpropagation, How Models Learn Part 2]

The F=ma of Artificial Intelligence [Backpropagation, How Models Learn Part 2]

Объяснение DeepSeek-OCR

Объяснение DeepSeek-OCR