Preemption Strategies: Swapping vs Recomputation
LLM Inference & Memory Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: Preemption Strategies: Swapping vs Recomputation, KV Cache Eviction, Checkpointing Overhead, Page Fault Latency, Memory Bandwidth Bottleneck, Activation Recomputation, Distributed Systems, Memory Management, Computational Complexity, Operating Systems, Deep Learning Infrastructure, Cache Coherency Protocols, Virtual Memory Paging, Dynamic Programming Algorithms, Resource Scheduling Policies, Tensor Parallelism Strategies.
In large model serving systems, when GPU memory is exhausted, running requests may need to be preempted to free resources for higher priority work. Two common strategies exist for handling the intermediate state (KV cache) of preempted requests: 1. Swapping : Transfer the KV cache to CPU memory and reload it later. The cost depends on the cache size and memory bandwidth between GPU and CPU. 2. Recomputation : Discard the KV cache entirely and recompute it from scratch when the request resumes. The cost depends on the number of tokens and model compute requirements. A key insight is that swapping cost scales linearly with sequence length (proportional to cache size), while recomputation cost includes a quadratic component from the attention mechanism (each token attends to all previous tokens). This means short sequences may favor recomputation, while long sequences may favor swapping. Implement a function that, given a list of requests (each identified by its current token count) and system configuration parameters, evaluates both preemption strategies for each request and selects the cheaper one. The function should return a dictionary containing: 'strategies': list of chosen strategy strings ('swap' or 'recompute') for each request 'swap costs ms': list of swap costs in milliseconds (rounded to 4 decimal places) 'recompute costs ms': list of recompute costs in milliseconds (rounded to 4 decimal places) 'total cost ms': sum of chosen costs across all requests (rounded to 4 decimal places) When costs are exactly equal, prefer swapping.