Speculative Decoding Acceptance Rate vs Temperature

LLM Inference & Memory Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: Speculative Decoding Acceptance Rate vs Temperature, Acceptance Rate, Temperature Scaling, Draft Model Latency, Target Model Logits, Rejection Sampling, Large Language Model Inference, Probabilistic Modeling, Computational Complexity, Distributed Systems, Information Theory, Speculative Decoding Algorithms, Sampling Strategies, KV Cache Management, Latency Optimization, Stochastic Processes.

In speculative decoding, a fast draft model proposes tokens that are verified against a slower target model using rejection sampling. For a proposed token x, the acceptance probability is min(1, q(x)/p(x)), where p(x) is the draft model's probability and q(x) is the target model's probability. The overall expected acceptance rate across the entire vocabulary can be expressed as the sum of the element wise minimum of the two distributions. Temperature scaling affects both distributions by dividing logits before applying softmax, which changes how peaked or flat each distribution is, and consequently changes the acceptance rate. Implement a function that takes draft model logits, target model logits, and an array of temperature values, then computes the expected acceptance rate for each temperature. For each temperature T, the softmax with temperature distribution is computed as softmax(logits / T). The expected acceptance rate is the sum of element wise minimums of the draft and target probability distributions at that temperature. Return a list of acceptance rates, each rounded to 4 decimal places.