EAGLE-Style Draft Model from Hidden States

LLM Inference & Memory Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: EAGLE-Style Draft Model from Hidden States, EAGLE-style Feature Projection, Top-K Logit Sampling, KV Cache Quantization, Hidden State Residual Connection, Draft-to-Target Verification, Transformer Architecture, LLM Inference Optimization, Sequence Modeling, Memory Management Systems, Probabilistic Decoding, Speculative Decoding, Hidden State Representation, KV Cache Compression, Draft Model Distillation, Autoregressive Generation.

EAGLE Style Draft Token Generation from Hidden States In speculative decoding, a lightweight draft model proposes candidate tokens that are later verified by a larger target model. The EAGLE (Extrapolation Algorithm for Greater Language model Efficiency) approach is notable because its draft model operates on the target model's hidden states rather than raw token logits, capturing richer contextual information. Your Task Implement eagle draft forward that autoregressively generates draft tokens using a simple two layer draft head architecture. The draft model takes as input: The last hidden state from the target model The last accepted token ID Learned weight matrices for the draft head At each step, the draft model: 1. Looks up the token embedding for the current token 2. Concatenates the current hidden state with the token embedding 3. Passes through a fusion layer (linear + ReLU activation) 4. Passes through a draft head layer (linear + ReLU activation) 5. Projects the resulting hidden state to vocabulary logits using a language model head 6. Selects the next token via greedy decoding (argmax) 7. Uses the new hidden state and selected token for the next iteration Return the list of generated draft token IDs. Shapes hidden state: $(d,)$ hidden dimension embed matrix: $(V, d)$ vocabulary embeddings fc fuse weight: $(d, 2d)$ fusion layer weight fc fuse bias: $(d,)$ fusion layer bias draft head weight: $(d, d)$ draft head weight draft head bias: $(d,)$ draft head bias lm head weight: $(V, d)$ output projection where $d$ is the hidden dimension and $V$ is the vocabulary size.