Copy-on-Write Memory Sharing for LLM Sampling

Infrastructure, Parallelism & Hardware Efficiency DS practice problem on Onlearn.

Difficulty: medium.

Topics: Copy-on-Write Memory Sharing for LLM Sampling, Copy-on-Write (CoW) Semantics, Page Fault Handling, Zero-Copy Data Transfer, Memory Overcommit, Reference Counting, Operating Systems, Distributed Systems, Memory Management, High-Performance Computing, LLM Inference Engines, Virtual Memory Addressing, Process Forking Mechanisms, Page Table Management, Kernel-Level Resource Allocation, KV Cache Optimization.

When serving large language models, multiple sampling sequences (e.g., best of n, parallel completions) often share the same prompt. A memory efficient technique allows these sequences to share the underlying key value cache blocks for the common prompt prefix. Physical memory blocks use reference counting: when a sequence needs to modify (append tokens to) a shared block, the block is copied first so other sequences are unaffected this is the copy on write (CoW) principle. Implement a function cow memory simulator that simulates block level memory management for parallel LLM sampling and computes memory usage statistics. The memory model works as follows: The KV cache is divided into fixed size blocks, each holding block size tokens. The prompt occupies ceil(prompt length / block size) logical blocks. Fully filled prompt blocks are shared across all sequences (1 physical block each, regardless of how many sequences reference them). If the last prompt block is partially filled, it is initially shared. When a sequence generates its first token, it must write into this block. If the block's reference count 1, a CoW copy is triggered (the sequence gets its own physical copy, and the original's reference count decreases by 1). If the reference count equals 1, the sequence writes in place with no copy. After filling any remaining space in the (possibly copied) last prompt block, each sequence's additional generated tokens go into new unshared blocks. The function should return a dictionary with: total physical blocks: total physical memory blocks allocated total logical blocks: sum of logical blocks across all sequences (each sequence sees ceil((prompt length + gen length i) / block size) blocks) shared prompt blocks: number of fully filled prompt blocks shared across all sequences cow events: number of CoW copy events that occurred new generation blocks: total new blocks allocated for generated tokens beyond the prompt region memory saved pct: percentage of memory saved compared to naive (no sharing) allocation, rounded to 2 decimal places