Swin transformer hierarchical vision transformer using shifted windows

Swin Transformer

PWC
PWC
PWC
PWC

This repo is the official implementation of «Swin Transformer: Hierarchical Vision Transformer using Shifted Windows» as well as the follow-ups. It currently includes code and models for the following tasks:

Image Classification: Included in this repo. See get_started.md for a quick start.

Object Detection and Instance Segmentation: See Swin Transformer for Object Detection.

Semantic Segmentation: See Swin Transformer for Semantic Segmentation.

Video Action Recognition: See Video Swin Transformer.

Semi-Supervised Object Detection: See Soft Teacher.

SSL: Contrasitive Learning: See Transformer-SSL.

SSL: Masked Image Modeling: See get_started.md#simmim-support.

Mixture-of-Experts: See get_started for more instructions.

Feature-Distillation: See Feature-Distillation.

Updates

12/29/2022

  1. Nvidia‘s FasterTransformer now supports Swin Transformer V2 inference, which have significant speed improvements on T4 and A100 GPUs.

11/30/2022

  1. Models and codes of Feature Distillation are released. Please refer to Feature-Distillation for details, and the checkpoints (FD-EsViT-Swin-B, FD-DeiT-ViT-B, FD-DINO-ViT-B, FD-CLIP-ViT-B, FD-CLIP-ViT-L).

09/24/2022

  1. Merged SimMIM, which is a Masked Image Modeling based pre-training approach applicable to Swin and SwinV2 (and also applicable for ViT and ResNet). Please refer to get started with SimMIM to play with SimMIM pre-training.

  2. Released a series of Swin and SwinV2 models pre-trained using the SimMIM approach (see MODELHUB for SimMIM), with model size ranging from SwinV2-Small-50M to SwinV2-giant-1B, data size ranging from ImageNet-1K-10% to ImageNet-22K, and iterations from 125k to 500k. You may leverage these models to study the properties of MIM methods. Please look into the data scaling paper for more details.

07/09/2022

News:

  1. SwinV2-G achieves 61.4 mIoU on ADE20K semantic segmentation (+1.5 mIoU over the previous SwinV2-G model), using an additional feature distillation (FD) approach, setting a new recrod on this benchmark. FD is an approach that can generally improve the fine-tuning performance of various pre-trained models, including DeiT, DINO, and CLIP. Particularly, it improves CLIP pre-trained ViT-L by +1.6% to reach 89.0% on ImageNet-1K image classification, which is the most accurate ViT-L model.
  2. Merged a PR from Nvidia that links to faster Swin Transformer inference that have significant speed improvements on T4 and A100 GPUs.
  3. Merged a PR from Nvidia that enables an option to use pure FP16 (Apex O2) in training, while almost maintaining the accuracy.

06/03/2022

  1. Added Swin-MoE, the Mixture-of-Experts variant of Swin Transformer implemented using Tutel (an optimized Mixture-of-Experts implementation). Swin-MoE is introduced in the TuTel paper.

05/12/2022

  1. Pretrained models of Swin Transformer V2 on ImageNet-1K and ImageNet-22K are released.
  2. ImageNet-22K pretrained models for Swin-V1-Tiny and Swin-V2-Small are released.

03/02/2022

  1. Swin Transformer V2 and SimMIM got accepted by CVPR 2022. SimMIM is a self-supervised pre-training approach based on masked image modeling, a key technique that works out the 3-billion-parameter Swin V2 model using 40x less labelled data than that of previous billion-scale models based on JFT-3B.

02/09/2022

  1. Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo Hugging Face Spaces

10/12/2021

  1. Swin Transformer received ICCV 2021 best paper award (Marr Prize).

08/09/2021

  1. Soft Teacher will appear at ICCV2021. The code will be released at GitHub Repo. Soft Teacher is an end-to-end semi-supervisd object detection method, achieving a new record on the COCO test-dev: 61.3 box AP and 53.0 mask AP.

07/03/2021

  1. Add Swin MLP, which is an adaption of Swin Transformer by replacing all multi-head self-attention (MHSA) blocks by MLP layers (more precisely it is a group linear layer). The shifted window configuration can also significantly improve the performance of vanilla MLP architectures.

06/25/2021

  1. Video Swin Transformer is released at Video-Swin-Transformer.
    Video Swin Transformer achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

05/12/2021

  1. Used as a backbone for Self-Supervised Learning: Transformer-SSL

Using Swin-Transformer as the backbone for self-supervised learning enables us to evaluate the transferring performance of the learnt representations on down-stream tasks, which is missing in previous works due to the use of ViT/DeiT, which has not been well tamed for down-stream tasks.

04/12/2021

Initial commits:

  1. Pretrained models on ImageNet-1K (Swin-T-IN1K, Swin-S-IN1K, Swin-B-IN1K) and ImageNet-22K (Swin-B-IN22K, Swin-L-IN22K) are provided.
  2. The supported code and models for ImageNet-1K image classification, COCO object detection and ADE20K semantic segmentation are provided.
  3. The cuda kernel implementation for the local relation layer is provided in branch LR-Net.

Introduction

Swin Transformer (the name Swin stands for Shifted window) is initially described in arxiv, which capably serves as a
general-purpose backbone for computer vision. It is basically a hierarchical Transformer whose representation is
computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention
computation to non-overlapping local windows while also allowing for cross-window connection.

Swin Transformer achieves strong performance on COCO object detection (58.7 box AP and 51.1 mask AP on test-dev) and
ADE20K semantic segmentation (53.5 mIoU on val), surpassing previous models by a large margin.

teaser

Main Results on ImageNet with Pretrained Models

ImageNet-1K and ImageNet-22K Pretrained Swin-V1 Models

name pretrain resolution acc@1 acc@5 #params FLOPs FPS 22K model 1K model
Swin-T ImageNet-1K 224×224 81.2 95.5 28M 4.5G 755 github/baidu/config/log
Swin-S ImageNet-1K 224×224 83.2 96.2 50M 8.7G 437 github/baidu/config/log
Swin-B ImageNet-1K 224×224 83.5 96.5 88M 15.4G 278 github/baidu/config/log
Swin-B ImageNet-1K 384×384 84.5 97.0 88M 47.1G 85 github/baidu/config
Swin-T ImageNet-22K 224×224 80.9 96.0 28M 4.5G 755 github/baidu/config github/baidu/config
Swin-S ImageNet-22K 224×224 83.2 97.0 50M 8.7G 437 github/baidu/config github/baidu/config
Swin-B ImageNet-22K 224×224 85.2 97.5 88M 15.4G 278 github/baidu/config github/baidu/config
Swin-B ImageNet-22K 384×384 86.4 98.0 88M 47.1G 85 github/baidu github/baidu/config
Swin-L ImageNet-22K 224×224 86.3 97.9 197M 34.5G 141 github/baidu/config github/baidu/config
Swin-L ImageNet-22K 384×384 87.3 98.2 197M 103.9G 42 github/baidu github/baidu/config

ImageNet-1K and ImageNet-22K Pretrained Swin-V2 Models

name pretrain resolution window acc@1 acc@5 #params FLOPs FPS 22K model 1K model
SwinV2-T ImageNet-1K 256×256 8×8 81.8 95.9 28M 5.9G 572 github/baidu/config
SwinV2-S ImageNet-1K 256×256 8×8 83.7 96.6 50M 11.5G 327 github/baidu/config
SwinV2-B ImageNet-1K 256×256 8×8 84.2 96.9 88M 20.3G 217 github/baidu/config
SwinV2-T ImageNet-1K 256×256 16×16 82.8 96.2 28M 6.6G 437 github/baidu/config
SwinV2-S ImageNet-1K 256×256 16×16 84.1 96.8 50M 12.6G 257 github/baidu/config
SwinV2-B ImageNet-1K 256×256 16×16 84.6 97.0 88M 21.8G 174 github/baidu/config
SwinV2-B* ImageNet-22K 256×256 16×16 86.2 97.9 88M 21.8G 174 github/baidu/config github/baidu/config
SwinV2-B* ImageNet-22K 384×384 24×24 87.1 98.2 88M 54.7G 57 github/baidu/config github/baidu/config
SwinV2-L* ImageNet-22K 256×256 16×16 86.9 98.0 197M 47.5G 95 github/baidu/config github/baidu/config
SwinV2-L* ImageNet-22K 384×384 24×24 87.6 98.3 197M 115.4G 33 github/baidu/config github/baidu/config

Note:

  • SwinV2-B* (SwinV2-L*) with input resolution of 256×256 and 384×384 both fine-tuned from the same pre-training model using a smaller input resolution of 192×192.
  • SwinV2-B* (384×384) achieves 78.08 acc@1 on ImageNet-1K-V2 while SwinV2-L* (384×384) achieves 78.31.

ImageNet-1K Pretrained Swin MLP Models

name pretrain resolution acc@1 acc@5 #params FLOPs FPS 1K model
Mixer-B/16 ImageNet-1K 224×224 76.4 59M 12.7G official repo
ResMLP-S24 ImageNet-1K 224×224 79.4 30M 6.0G 715 timm
ResMLP-B24 ImageNet-1K 224×224 81.0 116M 23.0G 231 timm
Swin-T/C24 ImageNet-1K 256×256 81.6 95.7 28M 5.9G 563 github/baidu/config
SwinMLP-T/C24 ImageNet-1K 256×256 79.4 94.6 20M 4.0G 807 github/baidu/config
SwinMLP-T/C12 ImageNet-1K 256×256 79.6 94.7 21M 4.0G 792 github/baidu/config
SwinMLP-T/C6 ImageNet-1K 256×256 79.7 94.9 23M 4.0G 766 github/baidu/config
SwinMLP-B ImageNet-1K 224×224 81.3 95.3 61M 10.4G 409 github/baidu/config

Note: access code for baidu is swin. C24 means each head has 24 channels.

ImageNet-22K Pretrained Swin-MoE Models

  • Please refer to get_started for instructions on running Swin-MoE.
  • Pretrained models for Swin-MoE can be found in MODEL HUB

Main Results on Downstream Tasks

COCO Object Detection (2017 val)

Backbone Method pretrain Lr Schd box mAP mask mAP #params FLOPs
Swin-T Mask R-CNN ImageNet-1K 3x 46.0 41.6 48M 267G
Swin-S Mask R-CNN ImageNet-1K 3x 48.5 43.3 69M 359G
Swin-T Cascade Mask R-CNN ImageNet-1K 3x 50.4 43.7 86M 745G
Swin-S Cascade Mask R-CNN ImageNet-1K 3x 51.9 45.0 107M 838G
Swin-B Cascade Mask R-CNN ImageNet-1K 3x 51.9 45.0 145M 982G
Swin-T RepPoints V2 ImageNet-1K 3x 50.0 45M 283G
Swin-T Mask RepPoints V2 ImageNet-1K 3x 50.3 43.6 47M 292G
Swin-B HTC++ ImageNet-22K 6x 56.4 49.1 160M 1043G
Swin-L HTC++ ImageNet-22K 3x 57.1 49.5 284M 1470G
Swin-L HTC++* ImageNet-22K 3x 58.0 50.4 284M

Note: * indicates multi-scale testing.

ADE20K Semantic Segmentation (val)

Backbone Method pretrain Crop Size Lr Schd mIoU mIoU (ms+flip) #params FLOPs
Swin-T UPerNet ImageNet-1K 512×512 160K 44.51 45.81 60M 945G
Swin-S UperNet ImageNet-1K 512×512 160K 47.64 49.47 81M 1038G
Swin-B UperNet ImageNet-1K 512×512 160K 48.13 49.72 121M 1188G
Swin-B UPerNet ImageNet-22K 640×640 160K 50.04 51.66 121M 1841G
Swin-L UperNet ImageNet-22K 640×640 160K 52.05 53.53 234M 3230G

Citing Swin Transformer

@inproceedings{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2021}
}

Citing Local Relation Networks (the first full-attention visual backbone)

@inproceedings{hu2019local,
  title={Local Relation Networks for Image Recognition},
  author={Hu, Han and Zhang, Zheng and Xie, Zhenda and Lin, Stephen},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  pages={3464--3473},
  year={2019}
}

Citing Swin Transformer V2

@inproceedings{liu2021swinv2,
  title={Swin Transformer V2: Scaling Up Capacity and Resolution}, 
  author={Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Citing SimMIM (a self-supervised approach that enables SwinV2-G)

@inproceedings{xie2021simmim,
  title={SimMIM: A Simple Framework for Masked Image Modeling},
  author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han},
  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022}
}

Citing SimMIM-data-scaling

@article{xie2022data,
  title={On Data Scaling in Masked Image Modeling},
  author={Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Wei, Yixuan and Dai, Qi and Hu, Han},
  journal={arXiv preprint arXiv:2206.04664},
  year={2022}
}

Citing Swin-MoE

@misc{hwang2022tutel,
      title={Tutel: Adaptive Mixture-of-Experts at Scale}, 
      author={Changho Hwang and Wei Cui and Yifan Xiong and Ziyue Yang and Ze Liu and Han Hu and Zilong Wang and Rafael Salas and Jithin Jose and Prabhat Ram and Joe Chau and Peng Cheng and Fan Yang and Mao Yang and Yongqiang Xiong},
      year={2022},
      eprint={2206.03382},
      archivePrefix={arXiv}
}

Getting Started

  • For Image Classification, please see get_started.md for detailed instructions.
  • For Object Detection and Instance Segmentation, please see Swin Transformer for Object Detection.
  • For Semantic Segmentation, please see Swin Transformer for Semantic Segmentation.
  • For Self-Supervised Learning, please see Transformer-SSL.
  • For Video Recognition, please see Video Swin Transformer.

Third-party Usage and Experiments

In this pargraph, we cross link third-party repositories which use Swin and report results. You can let us know by raising an issue

(Note please report accuracy numbers and provide trained models in your new repository to facilitate others to get sense of correctness and model behavior)

[12/29/2022] Swin Transformers (V2) inference implemented in FasterTransformer: FasterTransformer

[06/30/2022] Swin Transformers (V1) inference implemented in FasterTransformer: FasterTransformer

[05/12/2022] Swin Transformers (V1) implemented in TensorFlow with the pre-trained parameters ported into them. Find the implementation,
TensorFlow weights, code example here in this repository.

[04/06/2022] Swin Transformer for Audio Classification: Hierarchical Token Semantic Audio Transformer.

[12/21/2021] Swin Transformer for StyleGAN: StyleSwin

[12/13/2021] Swin Transformer for Face Recognition: FaceX-Zoo

[08/29/2021] Swin Transformer for Image Restoration: SwinIR

[08/12/2021] Swin Transformer for person reID: https://github.com/layumi/Person_reID_baseline_pytorch

[06/29/2021] Swin-Transformer in PaddleClas and inference based on whl package: https://github.com/PaddlePaddle/PaddleClas

[04/14/2021] Swin for RetinaNet in Detectron: https://github.com/xiaohu2015/SwinT_detectron2.

[04/16/2021] Included in a famous model zoo: https://github.com/rwightman/pytorch-image-models.

[04/20/2021] Swin-Transformer classifier inference using TorchServe: https://github.com/kamalkraj/Swin-Transformer-Serve

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct.
For more information see the Code of Conduct FAQ or
contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
Microsoft’s Trademark & Brand Guidelines.
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party’s policies.

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \textbf{S}hifted \textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at~\url{https://github.com/microsoft/Swin-Transformer}.


PDF


Abstract



ICCV 2021 PDF


ICCV 2021 Abstract

Code


Tasks


Datasets


Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Uses Extra
Training
Data
Result Benchmark
Semantic Segmentation ADE20K

Swin-L (UperNet, ImageNet-22k pretrain)

Validation mIoU 53.50 # 64

Compare

Test Score 62.8 # 1

Compare

Semantic Segmentation ADE20K

Swin-B (UperNet, ImageNet-1k pretrain)

Validation mIoU 49.7 # 103

Compare

Semantic Segmentation ADE20K val

Swin-B (UperNet, ImageNet-1k pretrain)

mIoU 49.7 # 50

Compare

Semantic Segmentation ADE20K val

Swin-L (UperNet, ImageNet-22k pretrain)

mIoU 53.5 # 36

Compare

Object Detection COCO minival

Swin-L (HTC++, single scale)

box AP 57.1 # 36

Compare

Instance Segmentation COCO minival

Swin-L (HTC++, multi scale)

mask AP 50.4 # 20

Compare

Object Detection COCO minival

Swin-L (HTC++, multi scale)

box AP 58 # 32

Compare

Instance Segmentation COCO minival

Swin-L (HTC++, single scale)

mask AP 49.5 # 23

Compare

Instance Segmentation COCO test-dev

Swin-L (HTC++, single scale)

mask AP 50.2 # 19

Compare

Instance Segmentation COCO test-dev

Swin-L (HTC++, multi scale)

mask AP 51.1 # 17

Compare

Object Detection COCO test-dev

Swin-L (HTC++, multi scale)

box mAP 58.7 # 28

Compare

Object Detection COCO test-dev

Swin-L (HTC++, single scale)

box mAP 57.7 # 31

Compare

Semantic Segmentation FoodSeg103

Swin-Transformer (Swin-Small)

mIoU 41.6 # 4

Compare

Image Classification ImageNet

Swin-L (384 res, ImageNet-22k pretrain)

Top 1 Accuracy 87.3% # 92

Compare

Number of params 197M # 851

Compare

GFLOPs 103.9 # 439

Compare

Image Classification ImageNet

Swin-B (384 res, ImageNet-22k pretrain)

Top 1 Accuracy 86.4% # 135

Compare

Number of params 88M # 788

Compare

GFLOPs 47 # 408

Compare

Image Classification ImageNet

Swin-T

Top 1 Accuracy 81.3% # 553

Compare

Number of params 29M # 602

Compare

GFLOPs 4.5 # 206

Compare

Thermal Image Segmentation MFN Dataset

SwinT

mIOU 49.0 # 25

Compare

Instance Segmentation Occluded COCO

Swin-B + Cascade Mask R-CNN

Mean Recall 62.90 # 2

Compare

Instance Segmentation Occluded COCO

Swin-T + Mask R-CNN

Mean Recall 58.81 # 6

Compare

Instance Segmentation Occluded COCO

Swin-S + Mask R-CNN

Mean Recall 61.14 # 5

Compare

Image Classification OmniBenchmark

SwinTransformer

Average Top-1 Accuracy 46.4 # 2

Compare

Instance Segmentation Separated COCO

Swin-S + Mask R-CNN

Mean Recall 33.67 # 5

Compare

Instance Segmentation Separated COCO

Swin-B + Cascade Mask R-CNN

Mean Recall 36.31 # 2

Compare

Instance Segmentation Separated COCO

Swin-T + Mask R-CNN

Mean Recall 31.94 # 6

Compare

Methods


Become a DeepAI PRO

500 AI generator calls per month + $5 per 500 more (includes images)

1750 AI Chat messages per month + $5 per 1750 more

60 Genius Mode messages per month + $5 per 60 more

HD image generator access

Private image generation

Complete styles library

API access

Ad-free experience

*

This is a recurring payment that will happen monthly

*

If you exceed number of images or messages listed, they will be charged at a rate of $5

Pay as you go

100 AI Generator Calls (includes images)

350 AI Chat messages

Does not include Genius Mode

Private image generation

Complete styles library

API access

Ad-free experience

A Summary of the Swin Transformer: A Hierarchical Vision Transformer using Shifted Windows

In this post, we will review and summarize the Swin Transformer paper, titled as Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Some of the code used here will be obtained from this Github Repo, so you better clone it in case you want to test some of this work, However, the aim of this post is to better simplify and summarize the Swin transformer paper. Soon, there will be another Post explaining how to implement the Swin Transformer in details. 

Overview

The «Swin Transformer: Hierarchical Vision Transformer using Shifted Windows» is a research paper that proposes a new architecture for visual recognition tasks using a hierarchical transformer model. The architecture, called the Swin Transformer, uses a combination of local and global attention mechanisms to process images and improve the accuracy of image classification and object detection tasks. The Swin Transformer uses a series of shifted window attention mechanisms to enable the model to focus on different parts of the image at different scales, and a hierarchical structure to allow the model to learn and reason about the relationships between different image regions. The authors of the paper claim that the Swin Transformer outperforms existing transformer-based models on a number of benchmark datasets and tasks.

To be clear and at the same time not oversimplify this work, there are a few key concepts that Swin Transformer proposed and they are built on top of ViT to get their complete grasp of the new model’s architecture. The two concepts are:

  • Shifted Window Attention
  • Patch Merging

The rest of the Swin Transformer’s architecture seems pretty much as the same as ViT (with some small modifications). Hence, what are the two concepts? We will explain them later in this blog post.

First, let’s get a deeper overview of the architecture. 

What makes it different than ViT?

The Swin Transformer is an extension of the Vision Transformer (ViT) model, which was introduced in the paper «An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale» (https://arxiv.org/abs/2010.11929). Like ViT, the Swin Transformer is a Transformer-based architecture that processes images as a sequence of patches, rather than using convolutional layers as in traditional image recognition models. However, the Swin Transformer introduces several key changes to the ViT architecture to improve performance on visual recognition tasks.

One of the main differences between the Swin Transformer and ViT is the use of shifted window attention mechanisms. In the Swin Transformer, the attention mechanisms operate over a series of shifted windows of different sizes, rather than over the full image as in ViT. This allows the model to attend to different parts of the image at different scales and better capture local relationships between image patches.

The Swin Transformer also introduces a hierarchical structure, where the output of the shifted window attention mechanisms at each scale is passed through a separate transformer layer before being combined and passed to the next scale. This hierarchical structure allows the model to learn and reason about the relationships between different image regions at different scales.

The Swin Transformer architecture is shown in the figure above, with the tiny version (SwinT) depicted. Like the Vision Transformer (ViT), it begins by dividing an input RGB image into non-overlapping patches using a patch-splitting module. Each patch is treated as a «token» and its features are the concatenated RGB values of the raw pixels. In this implementation, the patches are 4×4 in size and therefore have a feature dimension of 4x4x3=48. A linear embedding layer is then applied to this raw-valued feature to project it to a different dimension (denoted as C).

Patch Partition (From section 3.1 of the paper):

The first step in the process is to input an image and convert it to patch embeddings, which is the same as in ViT. The difference is that the patch size in the Swin Transformer is 4×4 instead of 16×16 as in ViT. Patch embeddings have previously been explained here.

from timm.models.layers import PatchEmbed
x = torch.randn(1, 3, 224, 224)
patch_embed = PatchEmbed(img_size=224, patch_size=4, embed_dim=96)
patch_embed(x).shape
 torch.Size([1, 3136, 96])

As can be seen, the output of the Patch Embedding layer is of shape (1,3136,96), that is (1, H/4, W/4, 96), where 96 is the embedding dimension C.

stage_1 = BasicLayer(dim=96, out_dim=192,
                     input_resolution=(56, 56),
                     depth=2)
inp = torch.randn(1, 56*56, 96)
stage_1(inp).shape
 torch.Size([1, 3136, 96])

As shown in the code snippet, the dimensions of the input do not change as it passes through «Stage 1». In fact, the dimensions remain constant as the input passes through every stage. It is only between stages that a patch merging layer is applied to reduce the number of tokens as the network becomes deeper.

Path Merging Layer

The first patch merging layer combines the features of groups of 2×2 neighboring patches and applies a linear layer on the concatenated features, which have a dimension of 4C. This reduces the number of tokens by a factor of 4 (corresponding to a 2x downsampling of resolution), and the output dimension is set to 2C. In this case, C is the number of channels (embedding dimension) and is equal to 96 for the tiny version of the Swin Transformer described in this blog post.

The patch-merging layer merges four patches at a time, so with each merge the height and width of the image are reduced by a factor of 2. For example, in stage 1 the input resolution is (H/4, W/4), but after patch merging the resolution becomes (H/8, W/8), which is the input for stage 2. Similarly, the input resolution for stage 3 is (H/16, W/16) and for stage 4 it is (H/32, W/32). The patch-merging process can be understood by examining the inputs and outputs in the code.

from timm.models.swin_transformer import PatchMerging
x = torch.randn(1, 56*56, 96)
l = PatchMerging(input_resolution=(56, 56), dim=96, out_dim=192, norm_layer=nn.LayerNorm)
l(x).shape
torch.Size([1, 784, 192]) # (1, 28x28, 192)

As shown, the output width and height are both reduced by a factor of 2, and the number of output channels is 2C where C is the number of input channels. In the case of the Swin-T model, C=96. The source code for patch merging can be examined to further understand its function.

class PatchMerging(nn.Module):
    def __init__(self, input_resolution, dim, out_dim=None, norm_layer=nn.LayerNorm):
        super().__init__()
        self.input_resolution = input_resolution
        self.dim = dim
        self.out_dim = out_dim or 2 * dim
        self.norm = norm_layer(4 * dim)
        self.reduction = nn.Linear(4 * dim, self.out_dim, bias=False)

    def forward(self, x):
        """
        x: B, H*W, C
        B: Batch size 
        """
        H, W = self.input_resolution
        B, L, C = x.shape
        x = x.view(B, H, W, C)

        x0 = x[:, 0::2, 0::2, :]  # B H/2 W/2 C
        x1 = x[:, 1::2, 0::2, :]  # B H/2 W/2 C
        x2 = x[:, 0::2, 1::2, :]  # B H/2 W/2 C
        x3 = x[:, 1::2, 1::2, :]  # B H/2 W/2 C

        x = torch.cat([x0, x1, x2, x3], -1)  # B H/2 W/2 4*C      
        x = x.view(B, -1, 4 * C)  # B H/2*W/2 4*C

        x = self.norm(x)
        x = self.reduction(x)
        return x

Deeper Dive into Shifted Windows Mechanism (From section 3.2 of the paper):

In the Swin Transformer, the attention mechanisms operate over a series of shifted windows of different sizes, rather than over the full image as in the original Vision Transformer (ViT) model. Each window consists of a set of image patches, and the model uses attention to weight the importance of each patch within the window. The size of the window and the number of patches it contains can vary, and the model can use different window sizes for different scales of image information.

The use of shifted windows allows the model to attend to different parts of the image at different scales, rather than just processing the entire image at a single scale as in ViT. This enables the model to better capture local relationships between image patches, as it can focus on a small region of the image and attend to the patches within that region.

The attention mechanisms in the Swin Transformer are similar to those used in other transformer models. They use a dot product between the query and key vectors to compute the attention weights for each patch within the window. The model then uses these attention weights to compute a weighted sum of the value vectors for each patch, which is used as input to the next layer of the model.

As shown in the figure, the first module uses a regular window partitioning strategy that begins at the top-left pixel and evenly divides the 8×8 feature map into 2×2 windows of size 4×4 (M=4). The next module uses a windowing configuration that is shifted from the previous layer by [M/2] pixels in both the x and y dimensions ([M/2], [M/2]). This is illustrated in the figure below:

As seen, the left image shows an 8×8 feature map that is evenly divided into 4 windows of size 4×4. The window size is M=4. In the first part of the two successive blocks, attention is calculated within these windows. However, the network also needs cross-window attention to learn better, because it is no longer using a global context. To achieve this, the second part of the Swin Transformer block shifts the windows by ([M/2], [M/2]) pixels from their regular positions and performs attention between the new windows, leading to cross-window connections. In this case, since M=4, the windows are displaced by (2,2). Self-attention is then performed inside the shifted local windows.

To implement the shifted window mechanism in PyTorch, you can use the torch.nn.UniformPatch2d function to extract patches from the input tensor and apply a sliding window operation with a specified stride. Here is an example of how this could be done:

import torch
# Input tensor with size (batch_size, channels, height, width)
x = torch.randn(batch_size, channels, height, width)
# Set the patch size and stride
patch_size = 4
stride = 2
# Create a UniformPatch2d object with the specified patch size and stride
patch_sampler = torch.nn.UniformPatch2d(patch_size, stride)
# Extract patches from the input tensor using the patch sampler
patches = patch_sampler(x)
# Patches has size (batch_size, channels, num_patches_height, num_patches_width, patch_size, patch_size)

Swin Transformer Experiments

It seems that the Swin Transformer models outperform other vision transformer models (such as DeiT and ViT) and are comparable with EfficientNet and RegNet models when trained from scratch on the ImageNet-1K dataset. Additionally, the Swin Transformer models may have a slightly better speed-accuracy trade-off compared to EfficientNet and RegNet models. This suggests that the Swin Transformer architecture is effective for visual recognition tasks and may be a promising alternative to other state-of-the-art models.

It appears that the Swin Transformer model is a modification of the standard transformer architecture and has some potential for further improvement. This is in contrast to EfficientNet and RegNet models, which are the result of extensive architecture searches. This suggests that the Swin Transformer architecture may be able to achieve even better performance with further optimization or modifications.

At last, the shifted window attention mechanism in the Swin Transformer allows the model to attend to different parts of the image at different scales and better capture local relationships between image patches, which can improve the accuracy of visual recognition tasks.

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

read less

Abstract: This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at~\url{this https URL}.

…read moreread less

Citations

PDF

Open Access

More filters

Peize Sun1, Rufeng Zhang2, Yi Jiang, Tao Kong, Chenfeng Xu3, Wei Zhan3, Masayoshi Tomizuka3, Lei Li, Zehuan Yuan, Changhu Wang, Ping Luo1  — Show less +7 moreInstitutions (3)

TL;DR: Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard 3× training schedule and running at 22 fps using ResNet-50 FPN model.

…read moreread less

Abstract: We present Sparse R-CNN, a purely sparse method for object detection in
images. Existing works on object detection heavily rely on dense object
candidates, such as $k$ anchor boxes pre-defined on all grids of image feature
map of size $H\times W$. In our method, however, a fixed sparse set of learned
object proposals, total length of $N$, are provided to object recognition head
to perform classification and location. By eliminating $HWk$ (up to hundreds of
thousands) hand-designed object candidates to $N$ (e.g. 100) learnable
proposals, Sparse R-CNN completely avoids all efforts related to object
candidates design and many-to-one label assignment. More importantly, final
predictions are directly output without non-maximum suppression post-procedure.
Sparse R-CNN demonstrates accuracy, run-time and training convergence
performance on par with the well-established detector baselines on the
challenging COCO dataset, e.g., achieving 45.0 AP in standard $3\times$
training schedule and running at 22 fps using ResNet-50 FPN model. We hope our
work could inspire re-thinking the convention of dense prior in object
detectors. The code is available at: https://github.com/PeizeSun/SparseR-CNN.

…read moreread less

TL;DR: A novel Joint Classification and Segmentation (JCS) system to perform real-time and explainable COVID- 19 chest CT diagnosis and extensive experiments demonstrate that the proposed JCS diagnosis system is very efficient for CO VID-19 classification and segmentation.

…read moreread less

Abstract: Recently, the coronavirus disease 2019 (COVID-19) has caused a pandemic disease in over 200 countries, influencing billions of humans. To control the infection, identifying and separating the infected people is the most crucial step. The main diagnostic tool is the Reverse Transcription Polymerase Chain Reaction (RT-PCR) test. Still, the sensitivity of the RT-PCR test is not high enough to effectively prevent the pandemic. The chest CT scan test provides a valuable complementary tool to the RT-PCR test, and it can identify the patients in the early-stage with high sensitivity. However, the chest CT scan test is usually time-consuming, requiring about 21.5 minutes per case. This paper develops a novel Joint Classification and Segmentation (JCS) system to perform real-time and explainable COVID-19 chest CT diagnosis. To train our JCS system, we construct a large scale COVID-19 Classification and Segmentation (COVID-CS) dataset, with 144,167 chest CT images of 400 COVID-19 patients and 350 uninfected cases. 3,855 chest CT images of 200 patients are annotated with fine-grained pixel-level labels of opacifications, which are increased attenuation of the lung parenchyma. We also have annotated lesion counts, opacification areas, and locations and thus benefit various diagnosis aspects. Extensive experiments demonstrate that the proposed JCS diagnosis system is very efficient for COVID-19 classification and segmentation. It obtains an average sensitivity of 95.0% and a specificity of 93.0% on the classification test set, and 78.5% Dice score on the segmentation test set of our COVID-CS dataset. The COVID-CS dataset and code are available at this https URL.

…read moreread less

TL;DR: A comprehensive review of attention mechanisms in computer vision can be found in this article, which categorizes them according to approach, such as channel attention, spatial attention, temporal attention and branch attention.

…read moreread less

Abstract: Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention and branch attention; a related repository this https URL is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

…read moreread less

TL;DR: This work adds locality to vision transformers by introducing depth-wise convolution into the feed-forward network and successfully applies the same locality mechanism to 4 visiontransformers, which shows the generalization of the locality concept.

…read moreread less

Abstract: We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by 2.6\% and 3.1\% with a negligible increase in the number of parameters and computational effort. Code is available at \url{this https URL}.

…read moreread less

TL;DR: In this article, a survey of recent developments in deep learning based object detectors is presented along with some of the prominent backbone architectures used in recognition tasks and compared the performances of these architectures on multiple metrics.

…read moreread less

Abstract: Object Detection is the task of classification and localization of objects in an image or video. It has gained prominence in recent years due to its widespread applications. This article surveys recent developments in deep learning based object detectors. Concise overview of benchmark datasets and evaluation metrics used in detection is also provided along with some of the prominent backbone architectures used in recognition tasks. It also covers contemporary lightweight classification models used on edge devices. Lastly, we compare the performances of these architectures on multiple metrics.

…read moreread less

References

PDF

Open Access

More filters

27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

…read moreread less

Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

…read moreread less

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

…read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called «dropout» that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

…read moreread less

12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

…read moreread less

Abstract: The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

…read moreread less

01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

…read moreread less

Abstract: In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

…read moreread less

Proceedings ArticleDOI

20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

…read moreread less

Abstract: The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called “ImageNet”, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

…read moreread less

Related Papers (5)

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby  — Show less +9 more

  • Sxstrace exe как исправить windows 10
  • Swinreagent что это за папка в windows 10
  • Sven rx 400w драйвер windows 10
  • Sven drift драйвера windows 10
  • Swf player скачать бесплатно для windows 10