Vision Transformer (ViT) models implement the architecture proposed in the paper An Image is Worth 16x16 Words. These models are designed for image classification tasks and operate by treating image patches as tokens in a Transformer model.
model_vit_b_16(pretrained = FALSE, progress = TRUE, ...)
model_vit_b_32(pretrained = FALSE, progress = TRUE, ...)
model_vit_l_16(pretrained = FALSE, progress = TRUE, ...)
model_vit_l_32(pretrained = FALSE, progress = TRUE, ...)
model_vit_h_14(pretrained = FALSE, progress = TRUE, ...)
| Model | Top-1 Acc | Top-5 Acc | Params | GFLOPS | File Size | Weights Used | Notes |
|-----------|-----------|-----------|---------|--------|-----------|---------------------------|------------------------|
| vit_b_16 | 81.1% | 95.3% | 86.6M | 17.56 | 346 MB | IMAGENET1K_V1 | Base, 16x16 patches |
| vit_b_32 | 75.9% | 92.5% | 88.2M | 4.41 | 353 MB | IMAGENET1K_V1 | Base, 32x32 patches |
| vit_l_16 | 79.7% | 94.6% | 304.3M | 61.55 | 1.22 GB | IMAGENET1K_V1 | Large, 16x16 patches |
| vit_l_32 | 77.0% | 93.1% | 306.5M | 15.38 | 1.23 GB | IMAGENET1K_V1 | Large, 32x32 patches |
| vit_h_14 | 88.6% | 98.7% | 633.5M | 1016.7 | 2.53 GB | IMAGENET1K_SWAG_E2E_V1 | Huge, 14x14 patches |
TorchVision Recipe: https://github.com/pytorch/vision/tree/main/references/classification
SWAG Recipe: https://github.com/facebookresearch/SWAG
Weights Selection:
All models use the default IMAGENET1K_V1
weights for consistency, stability, and official support from TorchVision.
These are supervised weights trained on ImageNet-1k.
For vit_h_14
, the default weight is IMAGENET1K_SWAG_E2E_V1
, pretrained on SWAG and fine-tuned on ImageNet.
model_vit_b_16()
: ViT-B/16 model (Base, 16×16 patch size)
model_vit_b_32()
: ViT-B/32 model (Base, 32×32 patch size)
model_vit_l_16()
: ViT-L/16 model (Base, 16×16 patch size)
model_vit_l_32()
: ViT-L/32 model (Base, 32×32 patch size)
model_vit_h_14()
: ViT-H/14 model (Base, 14×14 patch size)
Other models:
model_alexnet()
,
model_inception_v3()
,
model_mobilenet_v2()
,
model_resnet
,
model_vgg