Disaggregated Prefill-Decode Serving Simulator

LLM Inference & Memory Systems DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding Disaggregated LLM Serving Architectures, KV Cache Management, Time-To-First-Token (TTFT), Inter-Token Latency (ITL), Batch Processing Efficiency, Hardware Resource Partitioning, Computer Architecture, Distributed Systems, Performance Modeling, Queueing Theory, Parallel Computing, LLM Inference Pipeline, Resource Disaggregation, Prefill vs Decode Latency, Compute-Memory Boundness, Scheduling Algorithms.

Implement a simulator for a disaggregated LLM serving system. The system consists of a Prefill Engine and a Decode Engine. Prefill requests arrive at a certain rate and take 'prefill time' to process. Once finished, they transition to the Decode Engine, which processes one token at a time with 'decode time' per token. Implement a function 'simulate serving(num requests, prefill time, decode time, tokens per request)' that returns the total time taken to finish all requests, assuming infinite parallel capacity for the prefill engine but a sequential bottleneck for the decode engine (one token batch at a time).

dsLLM Inference & Memory Systems

Tutor

Waking the tutor…

LLM Inference & Memory Systems

0 of 23 solved

Back to roadmap