Beam Search with Memory-Efficient Block Sharing

LLM Inference & Memory Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: Beam Search with Memory-Efficient Block Sharing, KV Cache Paging, Logit Normalization, Beam Width, Prefix Sharing, Attention Head Pruning, Natural Language Processing, Distributed Systems, Computer Architecture, Information Theory, Algorithm Design, Sequence Generation, Memory Management, Transformer Decoding, Cache Coherency, Heuristic Search.

Task: Implement Beam Search with Memory Efficient Block Sharing In standard beam search, each beam candidate stores its own complete token sequence, leading to redundant memory usage when beams share common prefixes. A more memory efficient approach stores tokens in fixed size blocks that can be shared across beams via reference counting, with copy on write semantics when a shared block needs modification. Function Signature Write a function beam search block sharing(log probs, beam width, block size, eos token= 1) that takes: log probs: A numpy array of shape (max steps, vocab size) containing log probabilities for each token at each decoding step. beam width: Number of beams to maintain at each step. block size: Maximum number of tokens stored per memory block. eos token: End of sequence token id. If 1, no early stopping is applied. Block Memory System Tokens are stored in blocks of up to block size tokens each. Each block has a reference count tracking how many beams reference it. When a beam is extended from a parent beam, the child shares the parent's block references rather than copying all tokens. When appending a token to a shared block (reference count 1), a copy on write operation creates a new block with the existing content plus the new token. When a block is full, a new block is allocated for the next token. Output Return a dictionary with: 'sequences': List of token id lists for the top beam width sequences (ordered by descending score). 'scores': List of cumulative log probability scores rounded to 4 decimal places. 'total blocks allocated': Total number of unique blocks created during the entire search. 'blocks in use final': Number of distinct blocks referenced by the final beams. 'naive blocks needed': Number of blocks that would be needed if each beam stored its own full sequence independently (i.e., beam width ceil(max steps / block size)). Notes At step 0, select the top beam width tokens to initialize beams. At each subsequent step, expand all beams with all tokens, then keep only the top beam width candidates by cumulative score. When multiple children descend from the same parent, the last child to be processed can reuse the parent's block table directly (the others must create shared copies with incremented reference counts). Beams that are not selected as parents should have their block references released.

dsLLM Inference & Memory Systems

Tutor

Waking the tutor…

LLM Inference & Memory Systems

0 of 23 solved