Attention Memory Traffic and FLOPs

LLM Inference & Memory Systems DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding Attention Mechanism Computational Complexity and Memory Bandwidth, KV Cache Memory Footprint, Weight Matrix Read Traffic, Fused Kernel Efficiency, Attention Matrix FLOPs, FP16 vs FP32 Precision Impacts, Linear Algebra, Computer Architecture, Computational Complexity, Deep Learning Theory, Hardware Acceleration, Matrix Multiplication FLOPs, Memory Bandwidth Constraints, KV Cache Management, Transformer Decoder Architecture, Arithmetic Intensity.

Implement a function that calculates the total memory traffic (in bytes) and FLOPs for a single transformer decoder layer during the generation of a new token. Assume a model with hidden size 'd model', 'n heads', and sequence length 'seq len'. Assume FP16 precision (2 bytes per parameter). The memory traffic includes reading the weights and the KV cache.