VLM Visual Token Count from Image Resolution and Patch Size

Detection, Video & Advanced Vision DS practice problem on Onlearn.

Difficulty: medium.

Topics: Understanding Visual Tokenization in Vision-Language Models (VLMs), Patch Size (P), Token Sequence Length, CLS Token, Spatial Grid Decomposition, Feature Map Flattening, Computer Vision, Deep Learning Architectures, Linear Algebra, Representation Learning, Multimodal Machine Learning, Vision Transformers (ViT), Patch Embedding, Self-Attention Mechanisms, Image Preprocessing, Sequence Modeling.

Implement a function that calculates the total number of visual tokens for a Vision Transformer (ViT) based VLM. Given the input image resolution (height, width), patch size, and a boolean indicating whether a global classification/feature token (CLS token) is prepended, return the total token count. Assume the image is perfectly divisible by the patch size.