ConvNeXt Detection Models (Faster R-CNN style)

Object detection models combining a ConvNeXt backbone with a Feature Pyramid Network (FPN) and the Faster R-CNN detection head. The architecture mirrors model_fasterrcnn_resnet50_fpn(), with the ResNet backbone replaced by ConvNeXt variants. The design follows the paper A ConvNet for the 2020s.

Available Models

model_convnext_tiny_detection()
model_convnext_small_detection()
model_convnext_base_detection()

Backbone Performance (ImageNet-1k)

Accuracy metrics reflect backbone classification performance only. Detection head weights are randomly initialized and must be fine-tuned on task-specific labelled data before meaningful predictions are produced.

| Model                             | Top-1 Acc | Top-5 Acc | Params  | GFLOPS | File Size | Backbone Weights              | Notes                    |
|-----------------------------------|-----------|-----------|---------|--------|-----------|-------------------------------|--------------------------|
| model_convnext_tiny_detection     | 82.5%     | 96.1%     | 28.6M   | 4.46   | 109 MB    | IMAGENET1K_V1                 | Tiny backbone, FPN head  |
| model_convnext_small_detection    | 83.6%     | 96.7%     | 50.2M   | 8.68   | 192 MB    | IMAGENET1K_V1 (22k pretrain)  | Small backbone, FPN head |
| model_convnext_base_detection     | 84.1%     | 96.9%     | 88.6M   | 15.36  | 338 MB    | IMAGENET1K_V1                 | Base backbone, FPN head  |

FPN Channel Configuration

Each ConvNeXt variant produces four feature maps (C2–C5) fed into the FPN. Channel widths differ between Tiny/Small and Base:

| Variant | FPN in_channels          | FPN out_channels |
|---------|--------------------------|------------------|
| Tiny    | c(96, 192, 384, 768)     | 256              |
| Small   | c(96, 192, 384, 768)     | 256              |
| Base    | c(128, 256, 512, 1024)   | 256              |

Weights Selection

All variants use IMAGENET1K_V1 backbone weights by default (supervised ImageNet-1k).
The Small variant backbone (model_convnext_small_22k) was additionally pretrained on ImageNet-22k prior to fine-tuning on ImageNet-1k.
Detection head weights are randomly initialized — bounding-box predictions are meaningless without fine-tuning on labelled detection data.
Set pretrained_backbone = TRUE to load ImageNet backbone weights.

model_convnext_tiny_detection(
  num_classes = 91,
  pretrained_backbone = FALSE,
  ...
)

model_convnext_small_detection(
  num_classes = 91,
  pretrained_backbone = FALSE,
  ...
)

model_convnext_base_detection(
  num_classes = 91,
  pretrained_backbone = FALSE,
  ...
)

Arguments

num_classes: Number of output classes excluding background (default: 90 for COCO).
pretrained_backbone: Logical. If TRUE, loads ImageNet-pretrained ConvNeXt backbone weights. Default: FALSE.
...: Other arguments (unused).

Functions

model_convnext_tiny_detection(): ConvNeXt Tiny with FPN detection head
model_convnext_small_detection(): ConvNeXt Small with FPN detection head
model_convnext_base_detection(): ConvNeXt Base with FPN detection head

Note

Detection head weights are randomly initialized. Predicted bounding boxes will be arbitrary until the detection head is trained on labelled data. Only the backbone benefits from pretrained_backbone = TRUE.

Examples

if (FALSE) { # \dontrun{
library(magrittr)
norm_mean <- c(0.485, 0.456, 0.406) # ImageNet normalization constants
norm_std  <- c(0.229, 0.224, 0.225)

url <- paste0("https://upload.wikimedia.org/wikipedia/commons/thumb/",
              "e/ea/Morsan_Normande_vache.jpg/120px-Morsan_Normande_vache.jpg")
img <- base_loader(url) %>%
  transform_to_tensor() %>%
  transform_resize(c(520, 520))

input <- img %>% transform_normalize(norm_mean, norm_std)
batch <- input$unsqueeze(1)    # Add batch dimension: (1, 3, H, W)

# ConvNeXt Tiny detection
model <- model_convnext_tiny_detection(pretrained_backbone = TRUE)
model$eval()
# Please wait 2 mins + on CPU
pred     <- model(batch)$detections[[1]]
num_boxes <- as.integer(pred$boxes$size()[1])
topk     <- pred$scores$topk(k = 5)[[2]]
boxes    <- pred$boxes[topk, ]
labels   <- imagenet_classes(as.integer(pred$labels[topk]))

# `draw_bounding_box()` may fail if bbox values are not consistent.
if (num_boxes > 0) {
  boxed <- draw_bounding_boxes(img, boxes, labels = labels)
  tensor_image_browse(boxed)
}
} # }