Swin transformer hierarchical vision transformer using shifted windows

Swin Transformer

PWC
PWC
PWC
PWC

This repo is the official implementation of «Swin Transformer: Hierarchical Vision Transformer using Shifted Windows» as well as the follow-ups. It currently includes code and models for the following tasks:

Image Classification: Included in this repo. See get_started.md for a quick start.

Object Detection and Instance Segmentation: See Swin Transformer for Object Detection.

Semantic Segmentation: See Swin Transformer for Semantic Segmentation.

Video Action Recognition: See Video Swin Transformer.

Semi-Supervised Object Detection: See Soft Teacher.

SSL: Contrasitive Learning: See Transformer-SSL.

SSL: Masked Image Modeling: See get_started.md#simmim-support.

Mixture-of-Experts: See get_started for more instructions.

Feature-Distillation: See Feature-Distillation.

Updates

12/29/2022

Nvidia‘s FasterTransformer now supports Swin Transformer V2 inference, which have significant speed improvements on T4 and A100 GPUs.

11/30/2022

Models and codes of Feature Distillation are released. Please refer to Feature-Distillation for details, and the checkpoints (FD-EsViT-Swin-B, FD-DeiT-ViT-B, FD-DINO-ViT-B, FD-CLIP-ViT-B, FD-CLIP-ViT-L).

09/24/2022

Merged SimMIM, which is a Masked Image Modeling based pre-training approach applicable to Swin and SwinV2 (and also applicable for ViT and ResNet). Please refer to get started with SimMIM to play with SimMIM pre-training.
Released a series of Swin and SwinV2 models pre-trained using the SimMIM approach (see MODELHUB for SimMIM), with model size ranging from SwinV2-Small-50M to SwinV2-giant-1B, data size ranging from ImageNet-1K-10% to ImageNet-22K, and iterations from 125k to 500k. You may leverage these models to study the properties of MIM methods. Please look into the data scaling paper for more details.

07/09/2022

News:

SwinV2-G achieves 61.4 mIoU on ADE20K semantic segmentation (+1.5 mIoU over the previous SwinV2-G model), using an additional feature distillation (FD) approach, setting a new recrod on this benchmark. FD is an approach that can generally improve the fine-tuning performance of various pre-trained models, including DeiT, DINO, and CLIP. Particularly, it improves CLIP pre-trained ViT-L by +1.6% to reach 89.0% on ImageNet-1K image classification, which is the most accurate ViT-L model.
Merged a PR from Nvidia that links to faster Swin Transformer inference that have significant speed improvements on T4 and A100 GPUs.
Merged a PR from Nvidia that enables an option to use pure FP16 (Apex O2) in training, while almost maintaining the accuracy.

06/03/2022

Added Swin-MoE, the Mixture-of-Experts variant of Swin Transformer implemented using Tutel (an optimized Mixture-of-Experts implementation). Swin-MoE is introduced in the TuTel paper.

05/12/2022

Pretrained models of Swin Transformer V2 on ImageNet-1K and ImageNet-22K are released.
ImageNet-22K pretrained models for Swin-V1-Tiny and Swin-V2-Small are released.

03/02/2022

Swin Transformer V2 and SimMIM got accepted by CVPR 2022. SimMIM is a self-supervised pre-training approach based on masked image modeling, a key technique that works out the 3-billion-parameter Swin V2 model using 40x less labelled data than that of previous billion-scale models based on JFT-3B.

02/09/2022

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo

10/12/2021

Swin Transformer received ICCV 2021 best paper award (Marr Prize).

08/09/2021

Soft Teacher will appear at ICCV2021. The code will be released at GitHub Repo. Soft Teacher is an end-to-end semi-supervisd object detection method, achieving a new record on the COCO test-dev: 61.3 box AP and 53.0 mask AP.

07/03/2021

Add Swin MLP, which is an adaption of Swin Transformer by replacing all multi-head self-attention (MHSA) blocks by MLP layers (more precisely it is a group linear layer). The shifted window configuration can also significantly improve the performance of vanilla MLP architectures.

06/25/2021

Video Swin Transformer is released at Video-Swin-Transformer.
Video Swin Transformer achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

05/12/2021

Used as a backbone for Self-Supervised Learning: Transformer-SSL

Using Swin-Transformer as the backbone for self-supervised learning enables us to evaluate the transferring performance of the learnt representations on down-stream tasks, which is missing in previous works due to the use of ViT/DeiT, which has not been well tamed for down-stream tasks.

04/12/2021

Initial commits:

Pretrained models on ImageNet-1K (Swin-T-IN1K, Swin-S-IN1K, Swin-B-IN1K) and ImageNet-22K (Swin-B-IN22K, Swin-L-IN22K) are provided.
The supported code and models for ImageNet-1K image classification, COCO object detection and ADE20K semantic segmentation are provided.
The cuda kernel implementation for the local relation layer is provided in branch LR-Net.

Introduction

Swin Transformer (the name Swin stands for Shifted window) is initially described in arxiv, which capably serves as a
general-purpose backbone for computer vision. It is basically a hierarchical Transformer whose representation is
computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention
computation to non-overlapping local windows while also allowing for cross-window connection.

Swin Transformer achieves strong performance on COCO object detection (58.7 box AP and 51.1 mask AP on test-dev) and
ADE20K semantic segmentation (53.5 mIoU on val), surpassing previous models by a large margin.

teaser

Main Results on ImageNet with Pretrained Models

ImageNet-1K and ImageNet-22K Pretrained Swin-V1 Models

name	pretrain	resolution	acc@1	acc@5	#params	FLOPs	FPS	22K model	1K model
Swin-T	ImageNet-1K	224×224	81.2	95.5	28M	4.5G	755	—	github/baidu/config/log
Swin-S	ImageNet-1K	224×224	83.2	96.2	50M	8.7G	437	—	github/baidu/config/log
Swin-B	ImageNet-1K	224×224	83.5	96.5	88M	15.4G	278	—	github/baidu/config/log
Swin-B	ImageNet-1K	384×384	84.5	97.0	88M	47.1G	85	—	github/baidu/config
Swin-T	ImageNet-22K	224×224	80.9	96.0	28M	4.5G	755	github/baidu/config	github/baidu/config
Swin-S	ImageNet-22K	224×224	83.2	97.0	50M	8.7G	437	github/baidu/config	github/baidu/config
Swin-B	ImageNet-22K	224×224	85.2	97.5	88M	15.4G	278	github/baidu/config	github/baidu/config
Swin-B	ImageNet-22K	384×384	86.4	98.0	88M	47.1G	85	github/baidu	github/baidu/config
Swin-L	ImageNet-22K	224×224	86.3	97.9	197M	34.5G	141	github/baidu/config	github/baidu/config
Swin-L	ImageNet-22K	384×384	87.3	98.2	197M	103.9G	42	github/baidu	github/baidu/config

ImageNet-1K and ImageNet-22K Pretrained Swin-V2 Models

name	pretrain	resolution	window	acc@1	acc@5	#params	FLOPs	FPS	22K model	1K model
SwinV2-T	ImageNet-1K	256×256	8×8	81.8	95.9	28M	5.9G	572	—	github/baidu/config
SwinV2-S	ImageNet-1K	256×256	8×8	83.7	96.6	50M	11.5G	327	—	github/baidu/config
SwinV2-B	ImageNet-1K	256×256	8×8	84.2	96.9	88M	20.3G	217	—	github/baidu/config
SwinV2-T	ImageNet-1K	256×256	16×16	82.8	96.2	28M	6.6G	437	—	github/baidu/config
SwinV2-S	ImageNet-1K	256×256	16×16	84.1	96.8	50M	12.6G	257	—	github/baidu/config
SwinV2-B	ImageNet-1K	256×256	16×16	84.6	97.0	88M	21.8G	174	—	github/baidu/config
SwinV2-B^*	ImageNet-22K	256×256	16×16	86.2	97.9	88M	21.8G	174	github/baidu/config	github/baidu/config
SwinV2-B^*	ImageNet-22K	384×384	24×24	87.1	98.2	88M	54.7G	57	github/baidu/config	github/baidu/config
SwinV2-L^*	ImageNet-22K	256×256	16×16	86.9	98.0	197M	47.5G	95	github/baidu/config	github/baidu/config
SwinV2-L^*	ImageNet-22K	384×384	24×24	87.6	98.3	197M	115.4G	33	github/baidu/config	github/baidu/config

Note:

SwinV2-B^* (SwinV2-L^*) with input resolution of 256×256 and 384×384 both fine-tuned from the same pre-training model using a smaller input resolution of 192×192.
SwinV2-B^* (384×384) achieves 78.08 acc@1 on ImageNet-1K-V2 while SwinV2-L^* (384×384) achieves 78.31.

ImageNet-1K Pretrained Swin MLP Models

name	pretrain	resolution	acc@1	acc@5	#params	FLOPs	FPS	1K model
Mixer-B/16	ImageNet-1K	224×224	76.4	—	59M	12.7G	—	official repo
ResMLP-S24	ImageNet-1K	224×224	79.4	—	30M	6.0G	715	timm
ResMLP-B24	ImageNet-1K	224×224	81.0	—	116M	23.0G	231	timm
Swin-T/C24	ImageNet-1K	256×256	81.6	95.7	28M	5.9G	563	github/baidu/config
SwinMLP-T/C24	ImageNet-1K	256×256	79.4	94.6	20M	4.0G	807	github/baidu/config
SwinMLP-T/C12	ImageNet-1K	256×256	79.6	94.7	21M	4.0G	792	github/baidu/config
SwinMLP-T/C6	ImageNet-1K	256×256	79.7	94.9	23M	4.0G	766	github/baidu/config
SwinMLP-B	ImageNet-1K	224×224	81.3	95.3	61M	10.4G	409	github/baidu/config

Note: access code for baidu is swin. C24 means each head has 24 channels.

ImageNet-22K Pretrained Swin-MoE Models

Please refer to get_started for instructions on running Swin-MoE.
Pretrained models for Swin-MoE can be found in MODEL HUB

Main Results on Downstream Tasks

COCO Object Detection (2017 val)

Backbone	Method	pretrain	Lr Schd	box mAP	mask mAP	#params	FLOPs
Swin-T	Mask R-CNN	ImageNet-1K	3x	46.0	41.6	48M	267G
Swin-S	Mask R-CNN	ImageNet-1K	3x	48.5	43.3	69M	359G
Swin-T	Cascade Mask R-CNN	ImageNet-1K	3x	50.4	43.7	86M	745G
Swin-S	Cascade Mask R-CNN	ImageNet-1K	3x	51.9	45.0	107M	838G
Swin-B	Cascade Mask R-CNN	ImageNet-1K	3x	51.9	45.0	145M	982G
Swin-T	RepPoints V2	ImageNet-1K	3x	50.0	—	45M	283G
Swin-T	Mask RepPoints V2	ImageNet-1K	3x	50.3	43.6	47M	292G
Swin-B	HTC++	ImageNet-22K	6x	56.4	49.1	160M	1043G
Swin-L	HTC++	ImageNet-22K	3x	57.1	49.5	284M	1470G
Swin-L	HTC++^*	ImageNet-22K	3x	58.0	50.4	284M	—

Note: ^* indicates multi-scale testing.

ADE20K Semantic Segmentation (val)

Backbone	Method	pretrain	Crop Size	Lr Schd	mIoU	mIoU (ms+flip)	#params	FLOPs
Swin-T	UPerNet	ImageNet-1K	512×512	160K	44.51	45.81	60M	945G
Swin-S	UperNet	ImageNet-1K	512×512	160K	47.64	49.47	81M	1038G
Swin-B	UperNet	ImageNet-1K	512×512	160K	48.13	49.72	121M	1188G
Swin-B	UPerNet	ImageNet-22K	640×640	160K	50.04	51.66	121M	1841G
Swin-L	UperNet	ImageNet-22K	640×640	160K	52.05	53.53	234M	3230G

Citing Swin Transformer

@inproceedings{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021}
}

Citing Local Relation Networks (the first full-attention visual backbone)

@inproceedings{hu2019local,
  title={Local Relation Networks for Image Recognition},
  author={Hu, Han and Zhang, Zheng and Xie, Zhenda and Lin, Stephen},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  pages={3464--3473},
  year={2019}
}

Citing Swin Transformer V2

@inproceedings{liu2021swinv2,
  title={Swin Transformer V2: Scaling Up Capacity and Resolution}, 
  author={Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Citing SimMIM (a self-supervised approach that enables SwinV2-G)

@inproceedings{xie2021simmim,
  title={SimMIM: A Simple Framework for Masked Image Modeling},
  author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han},
  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Citing SimMIM-data-scaling

@article{xie2022data,
  title={On Data Scaling in Masked Image Modeling},
  author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Wei, Yixuan and Dai, Qi and Hu, Han},
  journal={arXiv preprint arXiv:2206.04664},
  year={2022}
}

Citing Swin-MoE

@misc{hwang2022tutel,
      title={Tutel: Adaptive Mixture-of-Experts at Scale}, 
      author={Changho Hwang and Wei Cui and Yifan Xiong and Ziyue Yang and Ze Liu and Han Hu and Zilong Wang and Rafael Salas and Jithin Jose and Prabhat Ram and Joe Chau and Peng Cheng and Fan Yang and Mao Yang and Yongqiang Xiong},
      year={2022},
      eprint={2206.03382},
      archivePrefix={arXiv}
}

Getting Started

For Image Classification, please see get_started.md for detailed instructions.
For Object Detection and Instance Segmentation, please see Swin Transformer for Object Detection.
For Semantic Segmentation, please see Swin Transformer for Semantic Segmentation.
For Self-Supervised Learning, please see Transformer-SSL.
For Video Recognition, please see Video Swin Transformer.

Third-party Usage and Experiments

In this pargraph, we cross link third-party repositories which use Swin and report results. You can let us know by raising an issue

(Note please report accuracy numbers and provide trained models in your new repository to facilitate others to get sense of correctness and model behavior)

[12/29/2022] Swin Transformers (V2) inference implemented in FasterTransformer: FasterTransformer

[06/30/2022] Swin Transformers (V1) inference implemented in FasterTransformer: FasterTransformer

[05/12/2022] Swin Transformers (V1) implemented in TensorFlow with the pre-trained parameters ported into them. Find the implementation,
TensorFlow weights, code example here in this repository.

[04/06/2022] Swin Transformer for Audio Classification: Hierarchical Token Semantic Audio Transformer.

[12/21/2021] Swin Transformer for StyleGAN: StyleSwin

[12/13/2021] Swin Transformer for Face Recognition: FaceX-Zoo

[08/29/2021] Swin Transformer for Image Restoration: SwinIR

[08/12/2021] Swin Transformer for person reID: https://github.com/layumi/Person_reID_baseline_pytorch

[06/29/2021] Swin-Transformer in PaddleClas and inference based on whl package: https://github.com/PaddlePaddle/PaddleClas

[04/14/2021] Swin for RetinaNet in Detectron: https://github.com/xiaohu2015/SwinT_detectron2.

[04/16/2021] Included in a famous model zoo: https://github.com/rwightman/pytorch-image-models.

[04/20/2021] Swin-Transformer classifier inference using TorchServe: https://github.com/kamalkraj/Swin-Transformer-Serve

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct.
For more information see the Code of Conduct FAQ or
contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
Microsoft’s Trademark & Brand Guidelines.
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party’s policies.

Источник

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.

PDF

Abstract

ICCV 2021 PDF

ICCV 2021 Abstract

Code

Tasks

Datasets

Results from the Paper

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Semantic Segmentation	ADE20K	Swin-L (UperNet, ImageNet-22k pretrain)	Validation mIoU	53.50	# 64	Compare
Test Score	62.8	# 1			Compare
Semantic Segmentation	ADE20K	Swin-B (UperNet, ImageNet-1k pretrain)	Validation mIoU	49.7	# 103	Compare
Semantic Segmentation	ADE20K val	Swin-B (UperNet, ImageNet-1k pretrain)	mIoU	49.7	# 50	Compare
Semantic Segmentation	ADE20K val	Swin-L (UperNet, ImageNet-22k pretrain)	mIoU	53.5	# 36	Compare
Object Detection	COCO minival	Swin-L (HTC++, single scale)	box AP	57.1	# 36	Compare
Instance Segmentation	COCO minival	Swin-L (HTC++, multi scale)	mask AP	50.4	# 20	Compare
Object Detection	COCO minival	Swin-L (HTC++, multi scale)	box AP	58	# 32	Compare
Instance Segmentation	COCO minival	Swin-L (HTC++, single scale)	mask AP	49.5	# 23	Compare
Instance Segmentation	COCO test-dev	Swin-L (HTC++, single scale)	mask AP	50.2	# 19	Compare
Instance Segmentation	COCO test-dev	Swin-L (HTC++, multi scale)	mask AP	51.1	# 17	Compare
Object Detection	COCO test-dev	Swin-L (HTC++, multi scale)	box mAP	58.7	# 28	Compare
Object Detection	COCO test-dev	Swin-L (HTC++, single scale)	box mAP	57.7	# 31	Compare
Semantic Segmentation	FoodSeg103	Swin-Transformer (Swin-Small)	mIoU	41.6	# 4	Compare
Image Classification	ImageNet	Swin-L (384 res, ImageNet-22k pretrain)	Top 1 Accuracy	87.3%	# 92	Compare
Number of params	197M	# 851			Compare
GFLOPs	103.9	# 439			Compare
Image Classification	ImageNet	Swin-B (384 res, ImageNet-22k pretrain)	Top 1 Accuracy	86.4%	# 135	Compare
Number of params	88M	# 788			Compare
GFLOPs	47	# 408			Compare
Image Classification	ImageNet	Swin-T	Top 1 Accuracy	81.3%	# 553	Compare
Number of params	29M	# 602			Compare
GFLOPs	4.5	# 206			Compare
Thermal Image Segmentation	MFN Dataset	SwinT	mIOU	49.0	# 25	Compare
Instance Segmentation	Occluded COCO	Swin-B + Cascade Mask R-CNN	Mean Recall	62.90	# 2	Compare
Instance Segmentation	Occluded COCO	Swin-T + Mask R-CNN	Mean Recall	58.81	# 6	Compare
Instance Segmentation	Occluded COCO	Swin-S + Mask R-CNN	Mean Recall	61.14	# 5	Compare
Image Classification	OmniBenchmark	SwinTransformer	Average Top-1 Accuracy	46.4	# 2	Compare
Instance Segmentation	Separated COCO	Swin-S + Mask R-CNN	Mean Recall	33.67	# 5	Compare
Instance Segmentation	Separated COCO	Swin-B + Cascade Mask R-CNN	Mean Recall	36.31	# 2	Compare
Instance Segmentation	Separated COCO	Swin-T + Mask R-CNN	Mean Recall	31.94	# 6	Compare

Methods

Источник

Become a DeepAI PRO

500 AI generator calls per month + $5 per 500 more (includes images)

1750 AI Chat messages per month + $5 per 1750 more

60 Genius Mode messages per month + $5 per 60 more

HD image generator access

Private image generation

Complete styles library

API access

Ad-free experience

This is a recurring payment that will happen monthly

If you exceed number of images or messages listed, they will be charged at a rate of $5

Pay as you go

100 AI Generator Calls (includes images)

350 AI Chat messages

Does not include Genius Mode

Private image generation

Complete styles library

API access

Ad-free experience

Источник

A Summary of the Swin Transformer: A Hierarchical Vision Transformer using Shifted Windows

In this post, we will review and summarize the Swin Transformer paper, titled as Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Some of the code used here will be obtained from this Github Repo, so you better clone it in case you want to test some of this work, However, the aim of this post is to better simplify and summarize the Swin transformer paper. Soon, there will be another Post explaining how to implement the Swin Transformer in details.

Overview

The «Swin Transformer: Hierarchical Vision Transformer using Shifted Windows» is a research paper that proposes a new architecture for visual recognition tasks using a hierarchical transformer model. The architecture, called the Swin Transformer, uses a combination of local and global attention mechanisms to process images and improve the accuracy of image classification and object detection tasks. The Swin Transformer uses a series of shifted window attention mechanisms to enable the model to focus on different parts of the image at different scales, and a hierarchical structure to allow the model to learn and reason about the relationships between different image regions. The authors of the paper claim that the Swin Transformer outperforms existing transformer-based models on a number of benchmark datasets and tasks.

To be clear and at the same time not oversimplify this work, there are a few key concepts that Swin Transformer proposed and they are built on top of ViT to get their complete grasp of the new model’s architecture. The two concepts are:

Shifted Window Attention
Patch Merging

The rest of the Swin Transformer’s architecture seems pretty much as the same as ViT (with some small modifications). Hence, what are the two concepts? We will explain them later in this blog post.

First, let’s get a deeper overview of the architecture.

What makes it different than ViT?

The Swin Transformer is an extension of the Vision Transformer (ViT) model, which was introduced in the paper «An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale» (https://arxiv.org/abs/2010.11929). Like ViT, the Swin Transformer is a Transformer-based architecture that processes images as a sequence of patches, rather than using convolutional layers as in traditional image recognition models. However, the Swin Transformer introduces several key changes to the ViT architecture to improve performance on visual recognition tasks.

One of the main differences between the Swin Transformer and ViT is the use of shifted window attention mechanisms. In the Swin Transformer, the attention mechanisms operate over a series of shifted windows of different sizes, rather than over the full image as in ViT. This allows the model to attend to different parts of the image at different scales and better capture local relationships between image patches.

The Swin Transformer also introduces a hierarchical structure, where the output of the shifted window attention mechanisms at each scale is passed through a separate transformer layer before being combined and passed to the next scale. This hierarchical structure allows the model to learn and reason about the relationships between different image regions at different scales.

The Swin Transformer architecture is shown in the figure above, with the tiny version (SwinT) depicted. Like the Vision Transformer (ViT), it begins by dividing an input RGB image into non-overlapping patches using a patch-splitting module. Each patch is treated as a «token» and its features are the concatenated RGB values of the raw pixels. In this implementation, the patches are 4×4 in size and therefore have a feature dimension of 4x4x3=48. A linear embedding layer is then applied to this raw-valued feature to project it to a different dimension (denoted as C).

Patch Partition (From section 3.1 of the paper):

The first step in the process is to input an image and convert it to patch embeddings, which is the same as in ViT. The difference is that the patch size in the Swin Transformer is 4×4 instead of 16×16 as in ViT. Patch embeddings have previously been explained here.

from timm.models.layers import PatchEmbed
x = torch.randn(1, 3, 224, 224)
patch_embed = PatchEmbed(img_size=224, patch_size=4, embed_dim=96)
patch_embed(x).shape
 torch.Size([1, 3136, 96])

As can be seen, the output of the Patch Embedding layer is of shape (1,3136,96), that is (1, H/4, W/4, 96), where 96 is the embedding dimension C.

stage_1 = BasicLayer(dim=96, out_dim=192,
                     input_resolution=(56, 56),
                     depth=2)
inp = torch.randn(1, 56*56, 96)
stage_1(inp).shape
 torch.Size([1, 3136, 96])

As shown in the code snippet, the dimensions of the input do not change as it passes through «Stage 1». In fact, the dimensions remain constant as the input passes through every stage. It is only between stages that a patch merging layer is applied to reduce the number of tokens as the network becomes deeper.

Path Merging Layer

The first patch merging layer combines the features of groups of 2×2 neighboring patches and applies a linear layer on the concatenated features, which have a dimension of 4C. This reduces the number of tokens by a factor of 4 (corresponding to a 2x downsampling of resolution), and the output dimension is set to 2C. In this case, C is the number of channels (embedding dimension) and is equal to 96 for the tiny version of the Swin Transformer described in this blog post.

The patch-merging layer merges four patches at a time, so with each merge the height and width of the image are reduced by a factor of 2. For example, in stage 1 the input resolution is (H/4, W/4), but after patch merging the resolution becomes (H/8, W/8), which is the input for stage 2. Similarly, the input resolution for stage 3 is (H/16, W/16) and for stage 4 it is (H/32, W/32). The patch-merging process can be understood by examining the inputs and outputs in the code.

from timm.models.swin_transformer import PatchMerging
x = torch.randn(1, 56*56, 96)
l = PatchMerging(input_resolution=(56, 56), dim=96, out_dim=192, norm_layer=nn.LayerNorm)
l(x).shape
torch.Size([1, 784, 192]) # (1, 28x28, 192)

As shown, the output width and height are both reduced by a factor of 2, and the number of output channels is 2C where C is the number of input channels. In the case of the Swin-T model, C=96. The source code for patch merging can be examined to further understand its function.

class PatchMerging(nn.Module):
    def __init__(self, input_resolution, dim, out_dim=None, norm_layer=nn.LayerNorm):
        super().__init__()
        self.input_resolution = input_resolution
        self.dim = dim
        self.out_dim = out_dim or 2 * dim
        self.norm = norm_layer(4 * dim)
        self.reduction = nn.Linear(4 * dim, self.out_dim, bias=False)

    def forward(self, x):
        """
        x: B, H*W, C
        B: Batch size 
        """
        H, W = self.input_resolution
        B, L, C = x.shape
        x = x.view(B, H, W, C)

        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C
        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C
        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C
        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C

        x = torch.cat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C      
        x = x.view(B, -1, 4 * C)  # B H/2*W/2 4*C

        x = self.norm(x)
        x = self.reduction(x)
        return x

Deeper Dive into Shifted Windows Mechanism (From section 3.2 of the paper):

In the Swin Transformer, the attention mechanisms operate over a series of shifted windows of different sizes, rather than over the full image as in the original Vision Transformer (ViT) model. Each window consists of a set of image patches, and the model uses attention to weight the importance of each patch within the window. The size of the window and the number of patches it contains can vary, and the model can use different window sizes for different scales of image information.

The use of shifted windows allows the model to attend to different parts of the image at different scales, rather than just processing the entire image at a single scale as in ViT. This enables the model to better capture local relationships between image patches, as it can focus on a small region of the image and attend to the patches within that region.

The attention mechanisms in the Swin Transformer are similar to those used in other transformer models. They use a dot product between the query and key vectors to compute the attention weights for each patch within the window. The model then uses these attention weights to compute a weighted sum of the value vectors for each patch, which is used as input to the next layer of the model.

As shown in the figure, the first module uses a regular window partitioning strategy that begins at the top-left pixel and evenly divides the 8×8 feature map into 2×2 windows of size 4×4 (M=4). The next module uses a windowing configuration that is shifted from the previous layer by [M/2] pixels in both the x and y dimensions ([M/2], [M/2]). This is illustrated in the figure below:

As seen, the left image shows an 8×8 feature map that is evenly divided into 4 windows of size 4×4. The window size is M=4. In the first part of the two successive blocks, attention is calculated within these windows. However, the network also needs cross-window attention to learn better, because it is no longer using a global context. To achieve this, the second part of the Swin Transformer block shifts the windows by ([M/2], [M/2]) pixels from their regular positions and performs attention between the new windows, leading to cross-window connections. In this case, since M=4, the windows are displaced by (2,2). Self-attention is then performed inside the shifted local windows.

To implement the shifted window mechanism in PyTorch, you can use the torch.nn.UniformPatch2d function to extract patches from the input tensor and apply a sliding window operation with a specified stride. Here is an example of how this could be done:

import torch
# Input tensor with size (batch_size, channels, height, width)
x = torch.randn(batch_size, channels, height, width)
# Set the patch size and stride
patch_size = 4
stride = 2
# Create a UniformPatch2d object with the specified patch size and stride
patch_sampler = torch.nn.UniformPatch2d(patch_size, stride)
# Extract patches from the input tensor using the patch sampler
patches = patch_sampler(x)
# Patches has size (batch_size, channels, num_patches_height, num_patches_width, patch_size, patch_size)

Swin Transformer Experiments

It seems that the Swin Transformer models outperform other vision transformer models (such as DeiT and ViT) and are comparable with EfficientNet and RegNet models when trained from scratch on the ImageNet-1K dataset. Additionally, the Swin Transformer models may have a slightly better speed-accuracy trade-off compared to EfficientNet and RegNet models. This suggests that the Swin Transformer architecture is effective for visual recognition tasks and may be a promising alternative to other state-of-the-art models.

It appears that the Swin Transformer model is a modification of the standard transformer architecture and has some potential for further improvement. This is in contrast to EfficientNet and RegNet models, which are the result of extensive architecture searches. This suggests that the Swin Transformer architecture may be able to achieve even better performance with further optimization or modifications.

At last, the shifted window attention mechanism in the Swin Transformer allows the model to attend to different parts of the image at different scales and better capture local relationships between image patches, which can improve the accuracy of visual recognition tasks.

Источник

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

read less

Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.

…read moreread less

Citations

PDF

Open Access

More filters

Peize Sun¹, Rufeng Zhang², Yi Jiang, Tao Kong, Chenfeng Xu³, Wei Zhan³, Masayoshi Tomizuka³, Lei Li, Zehuan Yuan, Changhu Wang, Ping Luo¹ — Show less +7 more•Institutions (3)

TL;DR: Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard 3× training schedule and running at 22 fps using ResNet-50 FPN model.

…read moreread less

Abstract: We present Sparse R-CNN, a purely sparse method for object detection in
images. Existing works on object detection heavily rely on dense object
candidates, such as $k$ anchor boxes pre-defined on all grids of image feature
map of size $H\times W$. In our method, however, a fixed sparse set of learned
object proposals, total length of $N$, are provided to object recognition head
to perform classification and location. By eliminating $HWk$ (up to hundreds of
thousands) hand-designed object candidates to $N$ (e.g. 100) learnable
proposals, Sparse R-CNN completely avoids all efforts related to object
candidates design and many-to-one label assignment. More importantly, final
predictions are directly output without non-maximum suppression post-procedure.
Sparse R-CNN demonstrates accuracy, run-time and training convergence
performance on par with the well-established detector baselines on the
challenging COCO dataset, e.g., achieving 45.0 AP in standard $3\times$
training schedule and running at 22 fps using ResNet-50 FPN model. We hope our
work could inspire re-thinking the convention of dense prior in object
detectors. The code is available at: https://github.com/PeizeSun/SparseR-CNN.

…read moreread less

TL;DR: A novel Joint Classification and Segmentation (JCS) system to perform real-time and explainable COVID- 19 chest CT diagnosis and extensive experiments demonstrate that the proposed JCS diagnosis system is very efficient for CO VID-19 classification and segmentation.

…read moreread less

Abstract: Recently, the coronavirus disease 2019 (COVID-19) has caused a pandemic disease in over 200 countries, influencing billions of humans. To control the infection, identifying and separating the infected people is the most crucial step. The main diagnostic tool is the Reverse Transcription Polymerase Chain Reaction (RT-PCR) test. Still, the sensitivity of the RT-PCR test is not high enough to effectively prevent the pandemic. The chest CT scan test provides a valuable complementary tool to the RT-PCR test, and it can identify the patients in the early-stage with high sensitivity. However, the chest CT scan test is usually time-consuming, requiring about 21.5 minutes per case. This paper develops a novel Joint Classification and Segmentation (JCS) system to perform real-time and explainable COVID-19 chest CT diagnosis. To train our JCS system, we construct a large scale COVID-19 Classification and Segmentation (COVID-CS) dataset, with 144,167 chest CT images of 400 COVID-19 patients and 350 uninfected cases. 3,855 chest CT images of 200 patients are annotated with fine-grained pixel-level labels of opacifications, which are increased attenuation of the lung parenchyma. We also have annotated lesion counts, opacification areas, and locations and thus benefit various diagnosis aspects. Extensive experiments demonstrate that the proposed JCS diagnosis system is very efficient for COVID-19 classification and segmentation. It obtains an average sensitivity of 95.0% and a specificity of 93.0% on the classification test set, and 78.5% Dice score on the segmentation test set of our COVID-CS dataset. The COVID-CS dataset and code are available at this https URL.

…read moreread less

TL;DR: A comprehensive review of attention mechanisms in computer vision can be found in this article, which categorizes them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.

…read moreread less

Abstract: Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention and branch attention; a related repository this https URL is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

…read moreread less

TL;DR: This work adds locality to vision transformers by introducing depth-wise convolution into the feed-forward network and successfully applies the same locality mechanism to 4 visiontransformers, which shows the generalization of the locality concept.

…read moreread less

Abstract: We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by 2.6\% and 3.1\% with a negligible increase in the number of parameters and computational effort. Code is available at \url{this https URL}.

…read moreread less

TL;DR: In this article, a survey of recent developments in deep learning based object detectors is presented along with some of the prominent backbone architectures used in recognition tasks and compared the performances of these architectures on multiple metrics.

…read moreread less

Abstract: Object Detection is the task of classification and localization of objects in an image or video. It has gained prominence in recent years due to its widespread applications. This article surveys recent developments in deep learning based object detectors. Concise overview of benchmark datasets and evaluation metrics used in detection is also provided along with some of the prominent backbone architectures used in recognition tasks. It also covers contemporary lightweight classification models used on edge devices. Lastly, we compare the performances of these architectures on multiple metrics.

…read moreread less

References

PDF

Open Access

More filters

27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

…read moreread less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

…read moreread less

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

…read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called «dropout» that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

…read moreread less

12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

…read moreread less

Abstract: The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

…read moreread less

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

…read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

…read moreread less

Proceedings Article•DOI•

20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

…read moreread less

Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

…read moreread less

Related Papers (5)

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby — Show less +9 more

Источник