Skip to content

Pukei-Pukei/MyViT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

MyViT is simplified version of rwightman/pytorch-image-models/timm/models/vision_transformer

This project aim to make easy to review code and the paper <An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale>

Equations

Transformer Encoder

$$\begin{aligned} (H, W) &amp;= \text{the resolution of the original image}\\ C &amp;= \text{the number of channels}\\ (P, P) &amp;= \text{the resolution of each image patch}\\ D &amp;= \text{latent vector size}\\ N' &amp;= H \cdot W / P^2 = \text{the number of patches}\\ N &amp;= N' + 1 = \text{the Transformer’s sequence length}\\ \\ \mathrm{LN} &amp;= \text{LayerNorm}\\ \\ &amp;\textbf{Input}\\ \mathbf{x}_{p} &amp;\in \mathbb{R}^{N' \times (P^2 \cdot C)}\\ \\ &amp;\textbf{Learnable}\\ \mathbf{E} &amp;\in \mathbb{R}^{(P^2 \cdot C) \times D}\\ \mathbf{E}_{pos} &amp;\in \mathbb{R}^{N \times D}\\ \mathbf{x}_{class} &amp;\in \mathbb{R}^{1 \times D}\\ \\ \mathbf{z}_{0} &amp;= [\mathbf{x}_{class}\ ;\ \mathbf{x}_{p}\mathbf{E}]<del>+</del>\mathbf{E}_{pos} &amp;\mathbf{z}_{0} &amp;\in \mathbb{R}^{N \times D}\\ \\ \mathbf{z'}_{l} &amp;= \mathrm{MSA}(~\mathrm{LN}(\mathbf{z}_{l-1})<del>)</del>+<del>\mathbf{z}_{l-1} &amp;\mathbf{z'}_{l} &amp;\in \mathbb{R}^{N \times D}\\ \mathbf{z}_{l} &amp;= \mathrm{MLP}(</del>\mathrm{LN}(\mathbf{z'}_{l})<del>)</del>+~\mathbf{z'}_{l} &amp;\mathbf{z}_{l} &amp;\in \mathbb{R}^{N \times D}\\ &amp;\text{,where} \quad l = 1 \ldots L\\ \\ &amp;\textbf{Output}&amp;\\ \mathbf{y} &amp;= \mathrm{LN}(\mathbf{z}^{0}_{L}) &amp;\mathbf{y} &amp;\in \mathbb{R}^{D}\\ \end{aligned}$$

MSA (Multihead Self Attention)

$$\begin{aligned} h &amp;= \text{number of heads}\\ d &amp;= D / h\\ \\ &amp;\textbf{Input}\\ \mathbf{z} &amp;\in \mathbb{R}^{N \times D}\\ \\ &amp;\textbf{Learnable}\\ \mathbf{U}_{qkv} &amp;\in \mathbb{R}^{D \times (3 \cdot d)}\\ \mathbf{U}_{msa} &amp;\in \mathbb{R}^{D \times D}\\ \\ [\mathbf{q, k, v}] &amp;= \mathbf{zU}_{qkv} &amp;\mathbf{q, k, v} &amp;\in \mathbb{R}^{N \times d}\\ \\ A &amp;= \mathrm{softmax}(\ \mathbf{qk}^{\top}\ /\ \sqrt{d}\ ) &amp;A &amp;\in \mathbb{R}^{N \times N}\\ \\ \mathrm{SA}(\mathbf{z}) &amp;= A\mathbf{v} &amp;\mathrm{SA}(\mathbf{z}) &amp;\in \mathbb{R}^{N \times d}\\ \\ &amp;\textbf{Output}\\ \mathrm{MSA}(\mathbf{z}) &amp;= [\mathrm{SA}_{1}(\mathbf{z}) ; \mathrm{SA}_{2}(\mathbf{z}) ; \cdots ; \mathrm{SA}_{h}(\mathbf{z})] \mathbf{U}_{msa} &amp;\mathrm{MSA}(\mathbf{z}) &amp;\in \mathbb{R}^{N \times D} \end{aligned}$$

MLP(Mulilayer Perceptron)

$$\begin{aligned} D_{hidden} &amp;= \text{hidden layer size}\\ \\ &amp;\textbf{Input}\\ \mathbf{z} &amp;\in \mathbb{R}^{N \times D}\\ \\ &amp;\textbf{Learnable}\\ \mathbf{L}_{hidden} &amp;\in \mathbb{R}^{D \times D_{hidden}}\\ \mathbf{L}_{out} &amp;\in \mathbb{R}^{D_{hidden} \times D}\\ \\ &amp;\textbf{Output}\\ \mathrm{MLP}(\mathbf{z}) &amp;= \mathrm{GELU}(<del>\mathbf{zL}_{hidden}</del>)\mathbf{L}_{out} &amp;\mathrm{MLP}(\mathbf{z}) &amp;\in \mathbb{R}^{N \times D} \end{aligned}$$

About

Simple implementation of Vision Transformer

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published