How To Reduce LLM Decoding Time With KV-Caching!

Автор: The ML Tech Lead!

Загружено: 2024-11-04

Просмотров: 2576

Описание:

The attention mechanism is known to be pretty slow! If you are not careful, the time complexity of the vanilla attention can be quadratic in the number of tokens in the input sequence! So, we need to be smart about the computations we are doing when we are decoding text sequences. When we decode text, there are actually many tensors that we recompute over and over, so instead of recomputing them, we are going to cache them to save on computation. Let me show you how!

How To Reduce LLM Decoding Time With KV-Caching!

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео