Draft-Target Speculative Decoding Simulation

LLM Inference & Memory Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: Draft-Target Speculative Decoding Simulation, Draft-Target Model Synchronization, Acceptance Rate Thresholding, Tree-based Attention Masking, Speculative Token Verification, KV Cache Recomputation Strategies, LLM Inference Optimization, Distributed Systems Architecture, Probabilistic Modeling, Computational Complexity, Hardware-Aware Programming, Speculative Decoding Algorithms, KV Cache Management, Parallel Sequence Generation, Model Quantization Techniques, Latency-Throughput Trade-offs.

Draft Target Speculative Decoding Simulation Speculative decoding is an inference acceleration technique for large language models. A small, fast draft model proposes a sequence of $K$ candidate tokens, then the large target model evaluates all candidates in a single forward pass. Tokens are accepted or rejected by comparing the two models' probability distributions. Implement speculative decode(draft probs, target probs, draft tokens, random coins) that simulates one round of the speculative decoding acceptance rejection loop. Inputs: draft probs: A list of $K$ numpy arrays, where draft probs[i] is the draft model's probability distribution over the vocabulary at position $i$ target probs: A list of $K+1$ numpy arrays, where target probs[i] is the target model's probability distribution at position $i$ (the extra $(K+1)$ th distribution is used for the bonus token) draft tokens: A list of $K$ integer token indices that the draft model sampled random coins: A list of $K$ float values drawn uniformly from $[0, 1)$, used for acceptance decisions Process: For each draft position $i$ from $0$ to $K 1$: Compute the acceptance probability for the drafted token Use random coins[i] to decide acceptance or rejection If rejected: compute an adjusted residual distribution from the difference between target and draft distributions, take the argmax of that adjusted distribution as the replacement token, and stop processing further positions If all $K$ tokens are accepted, select one additional bonus token from the target model's $(K+1)$ th distribution using argmax Output: Return a dictionary with: "accepted tokens": list of token indices produced this round (accepted draft tokens plus either the resampled token or the bonus token) "num generated": total number of tokens produced "acceptance rate": fraction of the $K$ draft tokens that were accepted, rounded to 4 decimal places