Swin Transformer
This repo is the official implementation of «Swin Transformer: Hierarchical Vision Transformer using Shifted Windows» as well as the follow-ups. It currently includes code and models for the following tasks:
Image Classification: Included in this repo. See get_started.md for a quick start.
Object Detection and Instance Segmentation: See Swin Transformer for Object Detection.
Semantic Segmentation: See Swin Transformer for Semantic Segmentation.
Video Action Recognition: See Video Swin Transformer.
Semi-Supervised Object Detection: See Soft Teacher.
SSL: Contrasitive Learning: See Transformer-SSL.
SSL: Masked Image Modeling: See get_started.md#simmim-support.
Mixture-of-Experts: See get_started for more instructions.
Feature-Distillation: See Feature-Distillation.
Updates
12/29/2022
- Nvidia‘s FasterTransformer now supports Swin Transformer V2 inference, which have significant speed improvements on
T4 and A100 GPUs
.
11/30/2022
- Models and codes of Feature Distillation are released. Please refer to Feature-Distillation for details, and the checkpoints (FD-EsViT-Swin-B, FD-DeiT-ViT-B, FD-DINO-ViT-B, FD-CLIP-ViT-B, FD-CLIP-ViT-L).
09/24/2022
-
Merged SimMIM, which is a Masked Image Modeling based pre-training approach applicable to Swin and SwinV2 (and also applicable for ViT and ResNet). Please refer to get started with SimMIM to play with SimMIM pre-training.
-
Released a series of Swin and SwinV2 models pre-trained using the SimMIM approach (see MODELHUB for SimMIM), with model size ranging from SwinV2-Small-50M to SwinV2-giant-1B, data size ranging from ImageNet-1K-10% to ImageNet-22K, and iterations from 125k to 500k. You may leverage these models to study the properties of MIM methods. Please look into the data scaling paper for more details.
07/09/2022
News
:
- SwinV2-G achieves
61.4 mIoU
on ADE20K semantic segmentation (+1.5 mIoU over the previous SwinV2-G model), using an additional feature distillation (FD) approach, setting a new recrod on this benchmark. FD is an approach that can generally improve the fine-tuning performance of various pre-trained models, including DeiT, DINO, and CLIP. Particularly, it improves CLIP pre-trained ViT-L by +1.6% to reach89.0%
on ImageNet-1K image classification, which is the most accurate ViT-L model. - Merged a PR from Nvidia that links to faster Swin Transformer inference that have significant speed improvements on
T4 and A100 GPUs
. - Merged a PR from Nvidia that enables an option to use
pure FP16 (Apex O2)
in training, while almost maintaining the accuracy.
06/03/2022
- Added Swin-MoE, the Mixture-of-Experts variant of Swin Transformer implemented using Tutel (an optimized Mixture-of-Experts implementation). Swin-MoE is introduced in the TuTel paper.
05/12/2022
- Pretrained models of Swin Transformer V2 on ImageNet-1K and ImageNet-22K are released.
- ImageNet-22K pretrained models for Swin-V1-Tiny and Swin-V2-Small are released.
03/02/2022
- Swin Transformer V2 and SimMIM got accepted by CVPR 2022. SimMIM is a self-supervised pre-training approach based on masked image modeling, a key technique that works out the 3-billion-parameter Swin V2 model using
40x less labelled data
than that of previous billion-scale models based on JFT-3B.
02/09/2022
- Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo
10/12/2021
- Swin Transformer received ICCV 2021 best paper award (Marr Prize).
08/09/2021
- Soft Teacher will appear at ICCV2021. The code will be released at GitHub Repo.
Soft Teacher
is an end-to-end semi-supervisd object detection method, achieving a new record on the COCO test-dev:61.3 box AP
and53.0 mask AP
.
07/03/2021
- Add Swin MLP, which is an adaption of
Swin Transformer
by replacing all multi-head self-attention (MHSA) blocks by MLP layers (more precisely it is a group linear layer). The shifted window configuration can also significantly improve the performance of vanilla MLP architectures.
06/25/2021
- Video Swin Transformer is released at Video-Swin-Transformer.
Video Swin Transformer
achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including action recognition (84.9
top-1 accuracy on Kinetics-400 and86.1
top-1 accuracy on Kinetics-600 with~20x
less pre-training data and~3x
smaller model size) and temporal modeling (69.6
top-1 accuracy on Something-Something v2).
05/12/2021
- Used as a backbone for
Self-Supervised Learning
: Transformer-SSL
Using Swin-Transformer as the backbone for self-supervised learning enables us to evaluate the transferring performance of the learnt representations on down-stream tasks, which is missing in previous works due to the use of ViT/DeiT, which has not been well tamed for down-stream tasks.
04/12/2021
Initial commits:
- Pretrained models on ImageNet-1K (Swin-T-IN1K, Swin-S-IN1K, Swin-B-IN1K) and ImageNet-22K (Swin-B-IN22K, Swin-L-IN22K) are provided.
- The supported code and models for ImageNet-1K image classification, COCO object detection and ADE20K semantic segmentation are provided.
- The cuda kernel implementation for the local relation layer is provided in branch LR-Net.
Introduction
Swin Transformer (the name Swin
stands for Shifted window) is initially described in arxiv, which capably serves as a
general-purpose backbone for computer vision. It is basically a hierarchical Transformer whose representation is
computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention
computation to non-overlapping local windows while also allowing for cross-window connection.
Swin Transformer achieves strong performance on COCO object detection (58.7 box AP
and 51.1 mask AP
on test-dev) and
ADE20K semantic segmentation (53.5 mIoU
on val), surpassing previous models by a large margin.
Main Results on ImageNet with Pretrained Models
ImageNet-1K and ImageNet-22K Pretrained Swin-V1 Models
name | pretrain | resolution | acc@1 | acc@5 | #params | FLOPs | FPS | 22K model | 1K model |
---|---|---|---|---|---|---|---|---|---|
Swin-T | ImageNet-1K | 224×224 | 81.2 | 95.5 | 28M | 4.5G | 755 | — | github/baidu/config/log |
Swin-S | ImageNet-1K | 224×224 | 83.2 | 96.2 | 50M | 8.7G | 437 | — | github/baidu/config/log |
Swin-B | ImageNet-1K | 224×224 | 83.5 | 96.5 | 88M | 15.4G | 278 | — | github/baidu/config/log |
Swin-B | ImageNet-1K | 384×384 | 84.5 | 97.0 | 88M | 47.1G | 85 | — | github/baidu/config |
Swin-T | ImageNet-22K | 224×224 | 80.9 | 96.0 | 28M | 4.5G | 755 | github/baidu/config | github/baidu/config |
Swin-S | ImageNet-22K | 224×224 | 83.2 | 97.0 | 50M | 8.7G | 437 | github/baidu/config | github/baidu/config |
Swin-B | ImageNet-22K | 224×224 | 85.2 | 97.5 | 88M | 15.4G | 278 | github/baidu/config | github/baidu/config |
Swin-B | ImageNet-22K | 384×384 | 86.4 | 98.0 | 88M | 47.1G | 85 | github/baidu | github/baidu/config |
Swin-L | ImageNet-22K | 224×224 | 86.3 | 97.9 | 197M | 34.5G | 141 | github/baidu/config | github/baidu/config |
Swin-L | ImageNet-22K | 384×384 | 87.3 | 98.2 | 197M | 103.9G | 42 | github/baidu | github/baidu/config |
ImageNet-1K and ImageNet-22K Pretrained Swin-V2 Models
name | pretrain | resolution | window | acc@1 | acc@5 | #params | FLOPs | FPS | 22K model | 1K model |
---|---|---|---|---|---|---|---|---|---|---|
SwinV2-T | ImageNet-1K | 256×256 | 8×8 | 81.8 | 95.9 | 28M | 5.9G | 572 | — | github/baidu/config |
SwinV2-S | ImageNet-1K | 256×256 | 8×8 | 83.7 | 96.6 | 50M | 11.5G | 327 | — | github/baidu/config |
SwinV2-B | ImageNet-1K | 256×256 | 8×8 | 84.2 | 96.9 | 88M | 20.3G | 217 | — | github/baidu/config |
SwinV2-T | ImageNet-1K | 256×256 | 16×16 | 82.8 | 96.2 | 28M | 6.6G | 437 | — | github/baidu/config |
SwinV2-S | ImageNet-1K | 256×256 | 16×16 | 84.1 | 96.8 | 50M | 12.6G | 257 | — | github/baidu/config |
SwinV2-B | ImageNet-1K | 256×256 | 16×16 | 84.6 | 97.0 | 88M | 21.8G | 174 | — | github/baidu/config |
SwinV2-B* | ImageNet-22K | 256×256 | 16×16 | 86.2 | 97.9 | 88M | 21.8G | 174 | github/baidu/config | github/baidu/config |
SwinV2-B* | ImageNet-22K | 384×384 | 24×24 | 87.1 | 98.2 | 88M | 54.7G | 57 | github/baidu/config | github/baidu/config |
SwinV2-L* | ImageNet-22K | 256×256 | 16×16 | 86.9 | 98.0 | 197M | 47.5G | 95 | github/baidu/config | github/baidu/config |
SwinV2-L* | ImageNet-22K | 384×384 | 24×24 | 87.6 | 98.3 | 197M | 115.4G | 33 | github/baidu/config | github/baidu/config |
Note:
- SwinV2-B* (SwinV2-L*) with input resolution of 256×256 and 384×384 both fine-tuned from the same pre-training model using a smaller input resolution of 192×192.
- SwinV2-B* (384×384) achieves 78.08 acc@1 on ImageNet-1K-V2 while SwinV2-L* (384×384) achieves 78.31.
ImageNet-1K Pretrained Swin MLP Models
name | pretrain | resolution | acc@1 | acc@5 | #params | FLOPs | FPS | 1K model |
---|---|---|---|---|---|---|---|---|
Mixer-B/16 | ImageNet-1K | 224×224 | 76.4 | — | 59M | 12.7G | — | official repo |
ResMLP-S24 | ImageNet-1K | 224×224 | 79.4 | — | 30M | 6.0G | 715 | timm |
ResMLP-B24 | ImageNet-1K | 224×224 | 81.0 | — | 116M | 23.0G | 231 | timm |
Swin-T/C24 | ImageNet-1K | 256×256 | 81.6 | 95.7 | 28M | 5.9G | 563 | github/baidu/config |
SwinMLP-T/C24 | ImageNet-1K | 256×256 | 79.4 | 94.6 | 20M | 4.0G | 807 | github/baidu/config |
SwinMLP-T/C12 | ImageNet-1K | 256×256 | 79.6 | 94.7 | 21M | 4.0G | 792 | github/baidu/config |
SwinMLP-T/C6 | ImageNet-1K | 256×256 | 79.7 | 94.9 | 23M | 4.0G | 766 | github/baidu/config |
SwinMLP-B | ImageNet-1K | 224×224 | 81.3 | 95.3 | 61M | 10.4G | 409 | github/baidu/config |
Note: access code for baidu
is swin
. C24 means each head has 24 channels.
ImageNet-22K Pretrained Swin-MoE Models
- Please refer to get_started for instructions on running Swin-MoE.
- Pretrained models for Swin-MoE can be found in MODEL HUB
Main Results on Downstream Tasks
COCO Object Detection (2017 val)
Backbone | Method | pretrain | Lr Schd | box mAP | mask mAP | #params | FLOPs |
---|---|---|---|---|---|---|---|
Swin-T | Mask R-CNN | ImageNet-1K | 3x | 46.0 | 41.6 | 48M | 267G |
Swin-S | Mask R-CNN | ImageNet-1K | 3x | 48.5 | 43.3 | 69M | 359G |
Swin-T | Cascade Mask R-CNN | ImageNet-1K | 3x | 50.4 | 43.7 | 86M | 745G |
Swin-S | Cascade Mask R-CNN | ImageNet-1K | 3x | 51.9 | 45.0 | 107M | 838G |
Swin-B | Cascade Mask R-CNN | ImageNet-1K | 3x | 51.9 | 45.0 | 145M | 982G |
Swin-T | RepPoints V2 | ImageNet-1K | 3x | 50.0 | — | 45M | 283G |
Swin-T | Mask RepPoints V2 | ImageNet-1K | 3x | 50.3 | 43.6 | 47M | 292G |
Swin-B | HTC++ | ImageNet-22K | 6x | 56.4 | 49.1 | 160M | 1043G |
Swin-L | HTC++ | ImageNet-22K | 3x | 57.1 | 49.5 | 284M | 1470G |
Swin-L | HTC++* | ImageNet-22K | 3x | 58.0 | 50.4 | 284M | — |
Note: * indicates multi-scale testing.
ADE20K Semantic Segmentation (val)
Backbone | Method | pretrain | Crop Size | Lr Schd | mIoU | mIoU (ms+flip) | #params | FLOPs |
---|---|---|---|---|---|---|---|---|
Swin-T | UPerNet | ImageNet-1K | 512×512 | 160K | 44.51 | 45.81 | 60M | 945G |
Swin-S | UperNet | ImageNet-1K | 512×512 | 160K | 47.64 | 49.47 | 81M | 1038G |
Swin-B | UperNet | ImageNet-1K | 512×512 | 160K | 48.13 | 49.72 | 121M | 1188G |
Swin-B | UPerNet | ImageNet-22K | 640×640 | 160K | 50.04 | 51.66 | 121M | 1841G |
Swin-L | UperNet | ImageNet-22K | 640×640 | 160K | 52.05 | 53.53 | 234M | 3230G |
Citing Swin Transformer
@inproceedings{liu2021Swin,
title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2021}
}
Citing Local Relation Networks (the first full-attention visual backbone)
@inproceedings{hu2019local,
title={Local Relation Networks for Image Recognition},
author={Hu, Han and Zhang, Zheng and Xie, Zhenda and Lin, Stephen},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
pages={3464--3473},
year={2019}
}
Citing Swin Transformer V2
@inproceedings{liu2021swinv2,
title={Swin Transformer V2: Scaling Up Capacity and Resolution},
author={Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}
}
Citing SimMIM (a self-supervised approach that enables SwinV2-G)
@inproceedings{xie2021simmim,
title={SimMIM: A Simple Framework for Masked Image Modeling},
author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han},
booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}
}
Citing SimMIM-data-scaling
@article{xie2022data,
title={On Data Scaling in Masked Image Modeling},
author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Wei, Yixuan and Dai, Qi and Hu, Han},
journal={arXiv preprint arXiv:2206.04664},
year={2022}
}
Citing Swin-MoE
@misc{hwang2022tutel,
title={Tutel: Adaptive Mixture-of-Experts at Scale},
author={Changho Hwang and Wei Cui and Yifan Xiong and Ziyue Yang and Ze Liu and Han Hu and Zilong Wang and Rafael Salas and Jithin Jose and Prabhat Ram and Joe Chau and Peng Cheng and Fan Yang and Mao Yang and Yongqiang Xiong},
year={2022},
eprint={2206.03382},
archivePrefix={arXiv}
}
Getting Started
- For Image Classification, please see get_started.md for detailed instructions.
- For Object Detection and Instance Segmentation, please see Swin Transformer for Object Detection.
- For Semantic Segmentation, please see Swin Transformer for Semantic Segmentation.
- For Self-Supervised Learning, please see Transformer-SSL.
- For Video Recognition, please see Video Swin Transformer.
Third-party Usage and Experiments
In this pargraph, we cross link third-party repositories which use Swin and report results. You can let us know by raising an issue
(Note please report accuracy numbers and provide trained models in your new repository to facilitate others to get sense of correctness and model behavior
)
[12/29/2022] Swin Transformers (V2) inference implemented in FasterTransformer: FasterTransformer
[06/30/2022] Swin Transformers (V1) inference implemented in FasterTransformer: FasterTransformer
[05/12/2022] Swin Transformers (V1) implemented in TensorFlow with the pre-trained parameters ported into them. Find the implementation,
TensorFlow weights, code example here in this repository.
[04/06/2022] Swin Transformer for Audio Classification: Hierarchical Token Semantic Audio Transformer.
[12/21/2021] Swin Transformer for StyleGAN: StyleSwin
[12/13/2021] Swin Transformer for Face Recognition: FaceX-Zoo
[08/29/2021] Swin Transformer for Image Restoration: SwinIR
[08/12/2021] Swin Transformer for person reID: https://github.com/layumi/Person_reID_baseline_pytorch
[06/29/2021] Swin-Transformer in PaddleClas and inference based on whl package: https://github.com/PaddlePaddle/PaddleClas
[04/14/2021] Swin for RetinaNet in Detectron: https://github.com/xiaohu2015/SwinT_detectron2.
[04/16/2021] Included in a famous model zoo: https://github.com/rwightman/pytorch-image-models.
[04/20/2021] Swin-Transformer classifier inference using TorchServe: https://github.com/kamalkraj/Swin-Transformer-Serve
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct.
For more information see the Code of Conduct FAQ or
contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
Microsoft’s Trademark & Brand Guidelines.
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party’s policies.
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.
PDF
Abstract
ICCV 2021 PDF
ICCV 2021 Abstract
Code
Tasks
Datasets
Results from the Paper
Task | Dataset | Model | Metric Name | Metric Value | Global Rank | Uses Extra Training Data |
Result | Benchmark |
---|---|---|---|---|---|---|---|---|
Semantic Segmentation | ADE20K |
Swin-L (UperNet, ImageNet-22k pretrain) |
Validation mIoU | 53.50 | # 64 |
|
|
Compare |
Test Score | 62.8 | # 1 |
|
|
Compare |
|||
Semantic Segmentation | ADE20K |
Swin-B (UperNet, ImageNet-1k pretrain) |
Validation mIoU | 49.7 | # 103 |
|
|
Compare |
Semantic Segmentation | ADE20K val |
Swin-B (UperNet, ImageNet-1k pretrain) |
mIoU | 49.7 | # 50 |
|
|
Compare |
Semantic Segmentation | ADE20K val |
Swin-L (UperNet, ImageNet-22k pretrain) |
mIoU | 53.5 | # 36 |
|
|
Compare |
Object Detection | COCO minival |
Swin-L (HTC++, single scale) |
box AP | 57.1 | # 36 |
|
|
Compare |
Instance Segmentation | COCO minival |
Swin-L (HTC++, multi scale) |
mask AP | 50.4 | # 20 |
|
|
Compare |
Object Detection | COCO minival |
Swin-L (HTC++, multi scale) |
box AP | 58 | # 32 |
|
|
Compare |
Instance Segmentation | COCO minival |
Swin-L (HTC++, single scale) |
mask AP | 49.5 | # 23 |
|
|
Compare |
Instance Segmentation | COCO test-dev |
Swin-L (HTC++, single scale) |
mask AP | 50.2 | # 19 |
|
|
Compare |
Instance Segmentation | COCO test-dev |
Swin-L (HTC++, multi scale) |
mask AP | 51.1 | # 17 |
|
|
Compare |
Object Detection | COCO test-dev |
Swin-L (HTC++, multi scale) |
box mAP | 58.7 | # 28 |
|
|
Compare |
Object Detection | COCO test-dev |
Swin-L (HTC++, single scale) |
box mAP | 57.7 | # 31 |
|
|
Compare |
Semantic Segmentation | FoodSeg103 |
Swin-Transformer (Swin-Small) |
mIoU | 41.6 | # 4 |
|
|
Compare |
Image Classification | ImageNet |
Swin-L (384 res, ImageNet-22k pretrain) |
Top 1 Accuracy | 87.3% | # 92 |
|
|
Compare |
Number of params | 197M | # 851 |
|
|
Compare |
|||
GFLOPs | 103.9 | # 439 |
|
|
Compare |
|||
Image Classification | ImageNet |
Swin-B (384 res, ImageNet-22k pretrain) |
Top 1 Accuracy | 86.4% | # 135 |
|
|
Compare |
Number of params | 88M | # 788 |
|
|
Compare |
|||
GFLOPs | 47 | # 408 |
|
|
Compare |
|||
Image Classification | ImageNet |
Swin-T |
Top 1 Accuracy | 81.3% | # 553 |
|
|
Compare |
Number of params | 29M | # 602 |
|
|
Compare |
|||
GFLOPs | 4.5 | # 206 |
|
|
Compare |
|||
Thermal Image Segmentation | MFN Dataset |
SwinT |
mIOU | 49.0 | # 25 |
|
|
Compare |
Instance Segmentation | Occluded COCO |
Swin-B + Cascade Mask R-CNN |
Mean Recall | 62.90 | # 2 |
|
|
Compare |
Instance Segmentation | Occluded COCO |
Swin-T + Mask R-CNN |
Mean Recall | 58.81 | # 6 |
|
|
Compare |
Instance Segmentation | Occluded COCO |
Swin-S + Mask R-CNN |
Mean Recall | 61.14 | # 5 |
|
|
Compare |
Image Classification | OmniBenchmark |
SwinTransformer |
Average Top-1 Accuracy | 46.4 | # 2 |
|
|
Compare |
Instance Segmentation | Separated COCO |
Swin-S + Mask R-CNN |
Mean Recall | 33.67 | # 5 |
|
|
Compare |
Instance Segmentation | Separated COCO |
Swin-B + Cascade Mask R-CNN |
Mean Recall | 36.31 | # 2 |
|
|
Compare |
Instance Segmentation | Separated COCO |
Swin-T + Mask R-CNN |
Mean Recall | 31.94 | # 6 |
|
|
Compare |
Methods
Become a DeepAI PRO
500 AI generator calls per month + $5 per 500 more (includes images)
1750 AI Chat messages per month + $5 per 1750 more
60 Genius Mode messages per month + $5 per 60 more
HD image generator access
Private image generation
Complete styles library
API access
Ad-free experience
*
This is a recurring payment that will happen monthly
*
If you exceed number of images or messages listed, they will be charged at a rate of $5
Pay as you go
100 AI Generator Calls (includes images)
350 AI Chat messages
Does not include Genius Mode
Private image generation
Complete styles library
API access
Ad-free experience
A Summary of the Swin Transformer: A Hierarchical Vision Transformer using Shifted Windows
In this post, we will review and summarize the Swin Transformer paper, titled as Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Some of the code used here will be obtained from this Github Repo, so you better clone it in case you want to test some of this work, However, the aim of this post is to better simplify and summarize the Swin transformer paper. Soon, there will be another Post explaining how to implement the Swin Transformer in details.
Overview
The «Swin Transformer: Hierarchical Vision Transformer using Shifted Windows» is a research paper that proposes a new architecture for visual recognition tasks using a hierarchical transformer model. The architecture, called the Swin Transformer, uses a combination of local and global attention mechanisms to process images and improve the accuracy of image classification and object detection tasks. The Swin Transformer uses a series of shifted window attention mechanisms to enable the model to focus on different parts of the image at different scales, and a hierarchical structure to allow the model to learn and reason about the relationships between different image regions. The authors of the paper claim that the Swin Transformer outperforms existing transformer-based models on a number of benchmark datasets and tasks.
To be clear and at the same time not oversimplify this work, there are a few key concepts that Swin Transformer proposed and they are built on top of ViT to get their complete grasp of the new model’s architecture. The two concepts are:
- Shifted Window Attention
- Patch Merging
The rest of the Swin Transformer’s architecture seems pretty much as the same as ViT (with some small modifications). Hence, what are the two concepts? We will explain them later in this blog post.
First, let’s get a deeper overview of the architecture.
What makes it different than ViT?
The Swin Transformer is an extension of the Vision Transformer (ViT) model, which was introduced in the paper «An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale» (https://arxiv.org/abs/2010.11929). Like ViT, the Swin Transformer is a Transformer-based architecture that processes images as a sequence of patches, rather than using convolutional layers as in traditional image recognition models. However, the Swin Transformer introduces several key changes to the ViT architecture to improve performance on visual recognition tasks.
One of the main differences between the Swin Transformer and ViT is the use of shifted window attention mechanisms. In the Swin Transformer, the attention mechanisms operate over a series of shifted windows of different sizes, rather than over the full image as in ViT. This allows the model to attend to different parts of the image at different scales and better capture local relationships between image patches.
The Swin Transformer also introduces a hierarchical structure, where the output of the shifted window attention mechanisms at each scale is passed through a separate transformer layer before being combined and passed to the next scale. This hierarchical structure allows the model to learn and reason about the relationships between different image regions at different scales.
The Swin Transformer architecture is shown in the figure above, with the tiny version (SwinT) depicted. Like the Vision Transformer (ViT), it begins by dividing an input RGB image into non-overlapping patches using a patch-splitting module. Each patch is treated as a «token» and its features are the concatenated RGB values of the raw pixels. In this implementation, the patches are 4×4 in size and therefore have a feature dimension of 4x4x3=48. A linear embedding layer is then applied to this raw-valued feature to project it to a different dimension (denoted as C).
Patch Partition (From section 3.1 of the paper):
The first step in the process is to input an image and convert it to patch embeddings, which is the same as in ViT. The difference is that the patch size in the Swin Transformer is 4×4 instead of 16×16 as in ViT. Patch embeddings have previously been explained here.
from timm.models.layers import PatchEmbed
x = torch.randn(1, 3, 224, 224)
patch_embed = PatchEmbed(img_size=224, patch_size=4, embed_dim=96)
patch_embed(x).shape
torch.Size([1, 3136, 96])
As can be seen, the output of the Patch Embedding layer is of shape (1,3136,96), that is (1, H/4, W/4, 96), where 96 is the embedding dimension C.
stage_1 = BasicLayer(dim=96, out_dim=192,
input_resolution=(56, 56),
depth=2)
inp = torch.randn(1, 56*56, 96)
stage_1(inp).shape
torch.Size([1, 3136, 96])
As shown in the code snippet, the dimensions of the input do not change as it passes through «Stage 1». In fact, the dimensions remain constant as the input passes through every stage. It is only between stages that a patch merging layer is applied to reduce the number of tokens as the network becomes deeper.
Path Merging Layer
The first patch merging layer combines the features of groups of 2×2 neighboring patches and applies a linear layer on the concatenated features, which have a dimension of 4C. This reduces the number of tokens by a factor of 4 (corresponding to a 2x downsampling of resolution), and the output dimension is set to 2C. In this case, C is the number of channels (embedding dimension) and is equal to 96 for the tiny version of the Swin Transformer described in this blog post.
The patch-merging layer merges four patches at a time, so with each merge the height and width of the image are reduced by a factor of 2. For example, in stage 1 the input resolution is (H/4, W/4), but after patch merging the resolution becomes (H/8, W/8), which is the input for stage 2. Similarly, the input resolution for stage 3 is (H/16, W/16) and for stage 4 it is (H/32, W/32). The patch-merging process can be understood by examining the inputs and outputs in the code.
from timm.models.swin_transformer import PatchMerging
x = torch.randn(1, 56*56, 96)
l = PatchMerging(input_resolution=(56, 56), dim=96, out_dim=192, norm_layer=nn.LayerNorm)
l(x).shape
torch.Size([1, 784, 192]) # (1, 28x28, 192)
As shown, the output width and height are both reduced by a factor of 2, and the number of output channels is 2C where C is the number of input channels. In the case of the Swin-T model, C=96. The source code for patch merging can be examined to further understand its function.
class PatchMerging(nn.Module):
def __init__(self, input_resolution, dim, out_dim=None, norm_layer=nn.LayerNorm):
super().__init__()
self.input_resolution = input_resolution
self.dim = dim
self.out_dim = out_dim or 2 * dim
self.norm = norm_layer(4 * dim)
self.reduction = nn.Linear(4 * dim, self.out_dim, bias=False)
def forward(self, x):
"""
x: B, H*W, C
B: Batch size
"""
H, W = self.input_resolution
B, L, C = x.shape
x = x.view(B, H, W, C)
x0 = x[:, 0::2, 0::2, :] # B H/2 W/2 C
x1 = x[:, 1::2, 0::2, :] # B H/2 W/2 C
x2 = x[:, 0::2, 1::2, :] # B H/2 W/2 C
x3 = x[:, 1::2, 1::2, :] # B H/2 W/2 C
x = torch.cat([x0, x1, x2, x3], -1) # B H/2 W/2 4*C
x = x.view(B, -1, 4 * C) # B H/2*W/2 4*C
x = self.norm(x)
x = self.reduction(x)
return x
Deeper Dive into Shifted Windows Mechanism (From section 3.2 of the paper):
In the Swin Transformer, the attention mechanisms operate over a series of shifted windows of different sizes, rather than over the full image as in the original Vision Transformer (ViT) model. Each window consists of a set of image patches, and the model uses attention to weight the importance of each patch within the window. The size of the window and the number of patches it contains can vary, and the model can use different window sizes for different scales of image information.
The use of shifted windows allows the model to attend to different parts of the image at different scales, rather than just processing the entire image at a single scale as in ViT. This enables the model to better capture local relationships between image patches, as it can focus on a small region of the image and attend to the patches within that region.
The attention mechanisms in the Swin Transformer are similar to those used in other transformer models. They use a dot product between the query and key vectors to compute the attention weights for each patch within the window. The model then uses these attention weights to compute a weighted sum of the value vectors for each patch, which is used as input to the next layer of the model.
As shown in the figure, the first module uses a regular window partitioning strategy that begins at the top-left pixel and evenly divides the 8×8 feature map into 2×2 windows of size 4×4 (M=4). The next module uses a windowing configuration that is shifted from the previous layer by [M/2] pixels in both the x and y dimensions ([M/2], [M/2]). This is illustrated in the figure below:
As seen, the left image shows an 8×8 feature map that is evenly divided into 4 windows of size 4×4. The window size is M=4. In the first part of the two successive blocks, attention is calculated within these windows. However, the network also needs cross-window attention to learn better, because it is no longer using a global context. To achieve this, the second part of the Swin Transformer block shifts the windows by ([M/2], [M/2]) pixels from their regular positions and performs attention between the new windows, leading to cross-window connections. In this case, since M=4, the windows are displaced by (2,2). Self-attention is then performed inside the shifted local windows.
To implement the shifted window mechanism in PyTorch, you can use the torch.nn.UniformPatch2d function to extract patches from the input tensor and apply a sliding window operation with a specified stride. Here is an example of how this could be done:
import torch
# Input tensor with size (batch_size, channels, height, width)
x = torch.randn(batch_size, channels, height, width)
# Set the patch size and stride
patch_size = 4
stride = 2
# Create a UniformPatch2d object with the specified patch size and stride
patch_sampler = torch.nn.UniformPatch2d(patch_size, stride)
# Extract patches from the input tensor using the patch sampler
patches = patch_sampler(x)
# Patches has size (batch_size, channels, num_patches_height, num_patches_width, patch_size, patch_size)
Swin Transformer Experiments
It seems that the Swin Transformer models outperform other vision transformer models (such as DeiT and ViT) and are comparable with EfficientNet and RegNet models when trained from scratch on the ImageNet-1K dataset. Additionally, the Swin Transformer models may have a slightly better speed-accuracy trade-off compared to EfficientNet and RegNet models. This suggests that the Swin Transformer architecture is effective for visual recognition tasks and may be a promising alternative to other state-of-the-art models.
It appears that the Swin Transformer model is a modification of the standard transformer architecture and has some potential for further improvement. This is in contrast to EfficientNet and RegNet models, which are the result of extensive architecture searches. This suggests that the Swin Transformer architecture may be able to achieve even better performance with further optimization or modifications.
At last, the shifted window attention mechanism in the Swin Transformer allows the model to attend to different parts of the image at different scales and better capture local relationships between image patches, which can improve the accuracy of visual recognition tasks.
TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.
read less
Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.
…read moreread less
Citations
Open Access
More filters
Peize Sun1, Rufeng Zhang2, Yi Jiang, Tao Kong, Chenfeng Xu3, Wei Zhan3, Masayoshi Tomizuka3, Lei Li, Zehuan Yuan, Changhu Wang, Ping Luo1 — Show less +7 more•Institutions (3)
TL;DR: Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard 3× training schedule and running at 22 fps using ResNet-50 FPN model.
…read moreread less
Abstract: We present Sparse R-CNN, a purely sparse method for object detection in
images. Existing works on object detection heavily rely on dense object
candidates, such as $k$ anchor boxes pre-defined on all grids of image feature
map of size $H\times W$. In our method, however, a fixed sparse set of learned
object proposals, total length of $N$, are provided to object recognition head
to perform classification and location. By eliminating $HWk$ (up to hundreds of
thousands) hand-designed object candidates to $N$ (e.g. 100) learnable
proposals, Sparse R-CNN completely avoids all efforts related to object
candidates design and many-to-one label assignment. More importantly, final
predictions are directly output without non-maximum suppression post-procedure.
Sparse R-CNN demonstrates accuracy, run-time and training convergence
performance on par with the well-established detector baselines on the
challenging COCO dataset, e.g., achieving 45.0 AP in standard $3\times$
training schedule and running at 22 fps using ResNet-50 FPN model. We hope our
work could inspire re-thinking the convention of dense prior in object
detectors. The code is available at: https://github.com/PeizeSun/SparseR-CNN.
…read moreread less
TL;DR: A novel Joint Classification and Segmentation (JCS) system to perform real-time and explainable COVID- 19 chest CT diagnosis and extensive experiments demonstrate that the proposed JCS diagnosis system is very efficient for CO VID-19 classification and segmentation.
…read moreread less
Abstract: Recently, the coronavirus disease 2019 (COVID-19) has caused a pandemic disease in over 200 countries, influencing billions of humans. To control the infection, identifying and separating the infected people is the most crucial step. The main diagnostic tool is the Reverse Transcription Polymerase Chain Reaction (RT-PCR) test. Still, the sensitivity of the RT-PCR test is not high enough to effectively prevent the pandemic. The chest CT scan test provides a valuable complementary tool to the RT-PCR test, and it can identify the patients in the early-stage with high sensitivity. However, the chest CT scan test is usually time-consuming, requiring about 21.5 minutes per case. This paper develops a novel Joint Classification and Segmentation (JCS) system to perform real-time and explainable COVID-19 chest CT diagnosis. To train our JCS system, we construct a large scale COVID-19 Classification and Segmentation (COVID-CS) dataset, with 144,167 chest CT images of 400 COVID-19 patients and 350 uninfected cases. 3,855 chest CT images of 200 patients are annotated with fine-grained pixel-level labels of opacifications, which are increased attenuation of the lung parenchyma. We also have annotated lesion counts, opacification areas, and locations and thus benefit various diagnosis aspects. Extensive experiments demonstrate that the proposed JCS diagnosis system is very efficient for COVID-19 classification and segmentation. It obtains an average sensitivity of 95.0% and a specificity of 93.0% on the classification test set, and 78.5% Dice score on the segmentation test set of our COVID-CS dataset. The COVID-CS dataset and code are available at this https URL.
…read moreread less
TL;DR: A comprehensive review of attention mechanisms in computer vision can be found in this article, which categorizes them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.
…read moreread less
Abstract: Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention and branch attention; a related repository this https URL is dedicated to collecting related work. We also suggest future directions for attention mechanism research.
…read moreread less
TL;DR: This work adds locality to vision transformers by introducing depth-wise convolution into the feed-forward network and successfully applies the same locality mechanism to 4 visiontransformers, which shows the generalization of the locality concept.
…read moreread less
Abstract: We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by 2.6\% and 3.1\% with a negligible increase in the number of parameters and computational effort. Code is available at \url{this https URL}.
…read moreread less
TL;DR: In this article, a survey of recent developments in deep learning based object detectors is presented along with some of the prominent backbone architectures used in recognition tasks and compared the performances of these architectures on multiple metrics.
…read moreread less
Abstract: Object Detection is the task of classification and localization of objects in an image or video. It has gained prominence in recent years due to its widespread applications. This article surveys recent developments in deep learning based object detectors. Concise overview of benchmark datasets and evaluation metrics used in detection is also provided along with some of the prominent backbone architectures used in recognition tasks. It also covers contemporary lightweight classification models used on edge devices. Lastly, we compare the performances of these architectures on multiple metrics.
…read moreread less
References
Open Access
More filters
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
…read moreread less
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
…read moreread less
03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
…read moreread less
Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called «dropout» that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
…read moreread less
12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
…read moreread less
Abstract: The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.
…read moreread less
01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
…read moreread less
Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.
…read moreread less
Proceedings Article•DOI•
20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
…read moreread less
Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
…read moreread less
Related Papers (5)
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby — Show less +9 more