Chunked Prefill Scheduling Alongside Decode
LLM Inference & Memory Systems DS practice problem on Onlearn.
Difficulty: hard.
Topics: Understanding Chunked Prefill Scheduling in LLM Inference, KV Cache Management, Head-of-Line Blocking, Token Interleaving, Prefill-Decode Ratio, Context Window Partitioning, Computer Architecture, Operating Systems, Distributed Systems, Parallel Computing, Performance Engineering, Queue Management, Resource Scheduling, Memory Management, Latency Optimization, Throughput Analysis.
Implement a simple scheduler that manages a queue of requests. Each request has a prompt length and an output length. Given a fixed 'chunk size', simulate the scheduling process where prefill tokens are processed in chunks and interleaved with decode steps. Return the sequence of operations (e.g., 'Prefill' or 'Decode') performed until all requests are completed.