Analyzing Memory Fragmentation in LLM Serving
Data Pipelines, Monitoring & Reliability DS practice problem on Onlearn.
Difficulty: easy.
Topics: Analyzing Memory Fragmentation in LLM Serving, PagedAttention, External Fragmentation, Memory Compaction, Throughput Bottleneck Analysis, Tensor Parallelism Overhead, Distributed Systems, Memory Management, Performance Engineering, LLM Inference Optimization, System Observability, Dynamic Memory Allocation, KV Cache Management, Resource Scheduling, Latency Profiling, Concurrency Control.
When serving large language models, the key value (KV) cache for each request is allocated in fixed size memory blocks. As requests arrive and complete at different times, blocks are allocated and freed throughout the memory pool, creating a pattern of occupied and free regions. Over time, free blocks can become scattered across the pool rather than forming a single contiguous region this is memory fragmentation. Fragmentation is problematic because even if there are enough total free blocks to serve a new request, those blocks may be too scattered to allocate efficiently. Understanding fragmentation metrics helps system designers decide when to trigger compaction or adjust scheduling strategies. Given a memory pool represented as a list of block statuses (1 = allocated, 0 = free), write a function that computes four key metrics: utilization : the fraction of blocks that are currently allocated num free fragments : the number of contiguous groups of free blocks largest free fragment : the size (in blocks) of the largest contiguous free region fragmentation ratio : a value between 0 and 1 measuring how scattered the free memory is. When all free blocks are in one contiguous region, fragmentation is 0. When free blocks are maximally scattered, fragmentation approaches 1. If there are no free blocks, the fragmentation ratio is 0.0. Return all values in a dictionary. Round float values to 4 decimal places.