Latent Space Patchification for Diffusion Transformers

Detection, Video & Advanced Vision DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding Latent Space Patchification in DiT (Diffusion Transformers), Patch Embedding, 2D Positional Encoding, Einstein Summation (einsum), Latent Space Discretization, Sequence Length Reduction, Deep Learning, Computer Vision, Linear Algebra, Transformer Architectures, Generative Modeling, Diffusion Models, Vision Transformers (ViT), Latent Representation Learning, Tensor Reshaping, Embedding Layers.

Implement a 'LatentPatchifier' class that takes a latent tensor of shape (B, C, H, W) and converts it into a sequence of patches suitable for a Diffusion Transformer. The patch size is (p, p). After patching, apply a linear projection to map the patches to a hidden dimension 'd'. Include a method to add learnable positional embeddings.