Introduction to Swin transformer

Автор: Vizuara

Загружено: 2025-12-12

Просмотров: 1969

Описание:

Join transformers for vision pro: https://vizuara.ai/courses/transforme...

In this lecture, I am taking you through the intuition behind the Swin Transformer architecture, which is honestly one of the most fascinating yet slightly complicated ideas that emerged in computer vision research, and because of this complexity I decided to divide the topic into two parts so that you get enough time to absorb the intuition today and come fully prepared next week when we begin coding the Swin Transformer from scratch. The implementation is not straightforward at all, it is more complex than anything we have done so far in this course, so it will really help if you listen carefully today and revisit the key ideas once before we start coding.

The Swin Transformer paper from Microsoft Asia is extremely dense and has close to thirty eight thousand citations, and even though the idea is brilliant and the results were groundbreaking, I personally felt that the writing was very terse which makes it difficult for someone reading it for the first time, therefore in this lecture I am breaking down every component slowly and clearly so that you genuinely understand what is happening rather than mechanically following the paper. The core idea behind Swin is based on shifted windows, and today we spend time understanding what windows are, how they differ from patches, why attention is restricted locally, and what problems arise because of this shift from global attention to window based attention.

We begin by revising Vision Transformer ideas like patching, linear embedding, attention complexity and the role of the CLS token, and from there we move into the main computational problem of ViT, where the attention complexity scales quadratically with the number of pixels, which makes ViT extremely expensive for high resolution tasks like detection and segmentation. From this point, we slowly build the motivation for Swin, because Swin overcomes this issue by restricting attention to small non overlapping windows, which immediately brings down the cost from quadratic to linear in the number of pixels, a change that is very significant for practical computer vision systems.

Once the motivation is understood, we explore how window based attention is constructed, how the architecture becomes hierarchical across stages, how patch merging reduces the number of tokens while increasing the channel dimension, and how this hierarchy resembles multiscale CNN like behaviour that helps Swin work across classification, detection and segmentation. You will see clearly why stage wise reduction of spatial resolution and increase of channel width is essential, and how these ideas flow together to form a consistent architecture.

Towards the second half of the lecture, we go deep into the most difficult component of the entire architecture, which is shifted window based multi head self attention, and this is where most people face difficulty because the paper does not explain it in a very gentle way. I show you exactly how shifting works, why cyclic shifting is needed, how window IDs are assigned before and after the shift, how the mask is created to prevent cross window attention, and how the combination of regular window attention and shifted window attention allows patches that could not attend to each other earlier to now interact indirectly, thereby restoring long range dependency without increasing the computational cost.

By the end of this lecture, you will have a strong and clear intuition of how Swin Transformer is built from scratch, why it scales linearly, how hierarchical features emerge naturally in the model, and why this architecture became a powerful general purpose backbone for modern computer vision tasks. Next week, we will extend this understanding into a full from scratch PyTorch implementation where every step will be explained line by line.

If you want a deeper dive into the Swin paper itself, I will also be releasing a separate paper review where I walk through the original paper in detail, discuss its strengths and weaknesses, and highlight the parts that are not immediately obvious when you read it the first time.

Introduction to Swin transformer

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Coding Swin transformer from scratch

Coding Swin transformer from scratch

Swin Transformer paper animated and explained

Swin Transformer paper animated and explained

Does AI Understand — Or Is It Just an Illusion?

Does AI Understand — Or Is It Just an Illusion?

Swin Transformer — бумажное объяснение

Swin Transformer — бумажное объяснение

Transformers architecture mastery | Full 7 hour compilation

Transformers architecture mastery | Full 7 hour compilation

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

LLM fine-tuning или ОБУЧЕНИЕ малой модели? Мы проверили!

Самая большая головоломка в информатике: P против NP

Самая большая головоломка в информатике: P против NP

В чем разница между матрицами и тензорами?

В чем разница между матрицами и тензорами?

ВОЗВРАЩЕНИЕ - САМЫЙ БЫСТРЫЙ в мире дрон V4

ВОЗВРАЩЕНИЕ - САМЫЙ БЫСТРЫЙ в мире дрон V4

Теренс Тао о том, как Григорий Перельман решил гипотезу Пуанкаре | Лекс Фридман

Теренс Тао о том, как Григорий Перельман решил гипотезу Пуанкаре | Лекс Фридман

Build Vision Transformer ViT From Scratch - Intuition and coding

Build Vision Transformer ViT From Scratch - Intuition and coding

How Schrödinger Derived It

How Schrödinger Derived It

Как Сделать Настольный ЭЛЕКТРОЭРОЗИОННЫЙ Станок?

Как Сделать Настольный ЭЛЕКТРОЭРОЗИОННЫЙ Станок?

How can a photon have momentum?

How can a photon have momentum?

Самая сложная модель из тех, что мы реально понимаем

Самая сложная модель из тех, что мы реально понимаем

Detect small objects with high accuracy | with Python

Detect small objects with high accuracy | with Python

Microsoft begs for mercy

Microsoft begs for mercy

Lecture 1: Introduction to Superposition

Lecture 1: Introduction to Superposition

Сделал визуализацию 4D, 5D, 6D. Как выглядит 6D мир?

Сделал визуализацию 4D, 5D, 6D. Как выглядит 6D мир?

Момент, когда мы перестали понимать ИИ [AlexNet]

Момент, когда мы перестали понимать ИИ [AlexNet]