Estimate KV Cache Size from Model Config
LLM Inference & Memory Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: Understanding LLM Memory Requirements for Inference, Attention Key-Value Tensors, Memory Footprint Calculation, Multi-Head Attention Scaling, FP16 vs BF16 vs FP32 Memory Usage, Active Context Window, Machine Learning Infrastructure, Deep Learning Systems, Transformer Architecture, Computer Memory Management, Hardware Resource Optimization, LLM Inference Optimization, KV Cache Mechanics, Tensor Memory Layout, Model Parallelism Memory Overhead, Floating Point Precision.
Given a Transformer model configuration (number of layers, hidden size, number of attention heads, and precision in bytes), write a function to calculate the total memory (in GB) required for the KV Cache for a specific sequence length. Assume the KV cache stores both keys and values for all layers and all heads. The formula is: 2 num layers sequence length hidden size bytes per param.