Multi-term Memory Patchification for Video

Detection, Video & Advanced Vision DS practice problem on Onlearn.

Difficulty: medium.

Topics: Multi-term Memory Patchification for Video, Patch Embedding, Temporal Shift Module, Long Short-Term Memory, Positional Encoding, Cross-Frame Attention, Computer Vision, Deep Learning, Signal Processing, Information Theory, Computational Complexity, Video Representation Learning, Attention Mechanisms, Temporal Modeling, Feature Extraction, Memory Architectures.

Autoregressive video generation models compress historical frames into a fixed size token budget before passing them as context to each new denoising step. A naive approach would compress all history uniformly, but this discards the useful intuition that recent frames are more relevant to the next chunk than older ones. Multi term memory patchification addresses this by dividing the history into several temporal terms, each covering a different span of past frames. More recent terms use finer spatial and temporal strides (preserving detail), while older terms use coarser strides (aggressive compression). Each term is independently downsampled and then patchified into tokens. All terms are concatenated to form the full history context. Write a function multi term memory patchification that, given a list of memory term specifications and the spatial dimensions of the VAE latent, computes the token count for each term and the total. The function should accept the following parameters: terms is a list of dictionaries, where each dictionary describes one memory term with four keys. num latent frames is the number of VAE latent frames covered by this term. temporal stride is the factor by which the latent frames are downsampled along the time axis. spatial stride is the factor by which the latent height and width are downsampled. patch size is the spatial patch size used to tokenize this term after spatial downsampling. latent h (int) is the height of the VAE latent (already VAE compressed, before any per term downsampling). latent w (int) is the width of the VAE latent. For each term, compute compressed dimensions using ceiling division at every step. First apply temporal and spatial strides to get the compressed latent dimensions, then apply the patch size to get the token grid, then multiply to get the token count for that term. The function should return a dictionary with the following keys: tokens per term (list of int): token count for each term in input order total tokens (int): sum of all per term token counts token fractions (list of float): each term's share of total tokens, rounded to 4 decimal places

dsDetection, Video & Advanced Vision

Tutor

Waking the tutor…