Vision Transformer Implementation

Vision Transformer (ViT) models implement the architecture proposed in the paper An Image is Worth 16x16 Words. These models are designed for image classification tasks and operate by treating image patches as tokens in a Transformer model.

model_vit_b_16(pretrained = FALSE, progress = TRUE, ...)

model_vit_b_32(pretrained = FALSE, progress = TRUE, ...)

model_vit_l_16(pretrained = FALSE, progress = TRUE, ...)

model_vit_l_32(pretrained = FALSE, progress = TRUE, ...)

model_vit_h_14(pretrained = FALSE, progress = TRUE, ...)

Arguments

pretrained: (bool): If TRUE, returns a model pre-trained on ImageNet.
progress: (bool): If TRUE, displays a progress bar of the download to stderr.
...: Other parameters passed to the model implementation.

Details

Model Variants and Performance (ImageNet-1k)

| Model     | Top-1 Acc | Top-5 Acc | Params  | GFLOPS | File Size | Weights Used              | Notes                  |
|-----------|-----------|-----------|---------|--------|-----------|---------------------------|------------------------|
| vit_b_16  | 81.1%     | 95.3%     | 86.6M   | 17.56  | 346 MB    | IMAGENET1K_V1             | Base, 16x16 patches    |
| vit_b_32  | 75.9%     | 92.5%     | 88.2M   | 4.41   | 353 MB    | IMAGENET1K_V1             | Base, 32x32 patches    |
| vit_l_16  | 79.7%     | 94.6%     | 304.3M  | 61.55  | 1.22 GB   | IMAGENET1K_V1             | Large, 16x16 patches   |
| vit_l_32  | 77.0%     | 93.1%     | 306.5M  | 15.38  | 1.23 GB   | IMAGENET1K_V1             | Large, 32x32 patches   |
| vit_h_14  | 88.6%     | 98.7%     | 633.5M  | 1016.7 | 2.53 GB   | IMAGENET1K_SWAG_E2E_V1    | Huge, 14x14 patches    |

TorchVision Recipe: https://github.com/pytorch/vision/tree/main/references/classification
SWAG Recipe: https://github.com/facebookresearch/SWAG

Weights Selection:

All models use the default IMAGENET1K_V1 weights for consistency, stability, and official support from TorchVision.
These are supervised weights trained on ImageNet-1k.
For vit_h_14, the default weight is IMAGENET1K_SWAG_E2E_V1, pretrained on SWAG and fine-tuned on ImageNet.

Functions

model_vit_b_16(): ViT-B/16 model (Base, 16×16 patch size)
model_vit_b_32(): ViT-B/32 model (Base, 32×32 patch size)
model_vit_l_16(): ViT-L/16 model (Base, 16×16 patch size)
model_vit_l_32(): ViT-L/32 model (Base, 32×32 patch size)
model_vit_h_14(): ViT-H/14 model (Base, 14×14 patch size)

Arguments

Details

Model Variants and Performance (ImageNet-1k)

Functions

See also