N-gram Speculation Dictionary Construction and Lookup
LLM Inference & Memory Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: N-gram Speculation Dictionary Construction and Lookup, Hash-based Lookup Tables, Memory Fragmentation, Multi-Head Self-Attention, Weight Pruning, KV Cache Eviction Policies, Speculative Decoding, KV Cache Management, Transformer Architecture, Hardware Acceleration, Distributed Inference, N-gram Draft Models, Paged Attention, Attention Mechanism, Tensor Parallelism, Quantization Techniques.
In speculative decoding for large language models, one lightweight approach to generating draft tokens is to use n gram statistics from the already generated token sequence. Instead of employing a separate draft model, we can build an n gram frequency table from the token history and use it to predict the most likely next tokens. Implement a function ngram speculation that: 1. Takes a sequence of token IDs and builds an n gram frequency table. Each n gram consists of an (n 1) token prefix and a single next token. 2. For a given context (a tuple of n 1 token IDs), computes the empirical probability distribution over possible next tokens. 3. Returns the top k most probable next token candidates as a list of (token id, probability) tuples. The results should be sorted by probability in descending order. If two candidates have the same probability, sort them by token ID in ascending order (to ensure deterministic tie breaking, as would be needed in a real system). If the context is not found in the n gram table, return an empty list. If fewer than k candidates exist, return all available candidates. Use only pure Python (no NumPy required).