How are images processed when they differ from the expected backbone size #2166

Aure20 · 2024-05-02T11:59:49Z

Aure20
May 2, 2024

I am currently trying different types of backbones for image classification, (ECA-NFNet-L0,tf_efficientnetv2_b0.in1k,efficientvit_b1.r288_in1k). I have been passing to the models images of different shapes from the expected ones by the model and the model accept them without any issue, so I assume there is some interpolation going on under the hood, is this correct (also if the image is already with the correct shape then can I assume that the interpolation layer is not included)? Moreover how do I know for a specific backbone if I need to manually normalize the inputs according to imagenet, or I need to pass the values in the [0,1] range or in the [0,255] range. Bonus question: in the case there is some interpolation/normalization layer, if I transform the model to onxx for faster inference are these layers also passed accordingly?

rwightman · 2024-05-02T21:36:59Z

rwightman
May 2, 2024
Maintainer

@Aure20 there is no interpolation or normalization of image inputs in the model, this is done with the preprocessing pipeline, and there are metadata attributes for each pretrained model that indicate which normalization mean/std is needed for each weight.

fully convolutional networks work with any image size, although you run into issues if you go under the size of the network reduction (stride), which is usually 32. These models do not need any specific image size to be set on creation and have no argument for the size. However, if you deviate too far from the original train size, performance can drop significantly.

vit and vit-hybrid models can have constraints on the image size, the position embeddings especially are often initialized for a specific size, to allow the model to work with a different size the model needs to be passed the image size (img_size) for the models that need it. A different size can be used from pretrain if the model has added support to interpolate/resize those embeddings. Some models also support dynamic interplation/resize of the position embedding, usually when a flag is enabled as it involves overhead.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How are images processed when they differ from the expected backbone size #2166

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How are images processed when they differ from the expected backbone size #2166

Aure20 May 2, 2024

Replies: 1 comment

rwightman May 2, 2024 Maintainer

Aure20
May 2, 2024

rwightman
May 2, 2024
Maintainer