Speculative Decoding End-to-End Simulation
LLM Inference & Memory Systems DS practice problem on Onlearn.
Difficulty: hard.
Topics: Speculative Decoding End-to-End Simulation, Acceptance Rate Thresholding, Tree-based Attention Masking, Draft-Target Model Alignment, Speculative Token Verification, KV Cache Paging Strategies, LLM Inference Optimization, Distributed Systems Architecture, Probabilistic Modeling, Computational Complexity Theory, Hardware-Aware Programming, Speculative Decoding Algorithms, KV Cache Management, Parallel Sequence Generation, Draft Model Distillation, Memory Bandwidth Bottlenecks.
Implement a full end to end simulation of Speculative Decoding , an inference acceleration technique for large language models. Speculative decoding uses a small, fast draft model to propose a sequence of K tokens, then uses a large, accurate target model to verify them all in a single forward pass. The key insight is that verifying K tokens in parallel is much cheaper than generating them one by one with the large model. Your function should simulate the complete speculative decoding loop: 1. You are given K draft token proposals along with both the draft model's and target model's probability distributions at each position. 2. For each proposed token (from left to right), determine whether it should be accepted or rejected based on comparing the two models' probabilities. 3. If a token is rejected at position i, no further draft tokens are considered. Instead, a corrected token is sampled from a modified distribution derived from the difference between the target and draft distributions at that position. 4. If all K draft tokens are accepted, a bonus token is sampled from the target model's distribution at position K+1. The function should use provided random values (for reproducibility) to make all stochastic decisions. Specifically: accept rand[i] is used for the accept/reject decision at position i sample rand[i] (for i < K) is used to sample from the adjusted distribution if rejection occurs at position i sample rand[K] is used to sample the bonus token if all K tokens are accepted To sample a token from a probability distribution given a uniform random value u, use the inverse CDF method: find the smallest index whose cumulative probability exceeds u. Return the list of final token indices produced by the decoding process.