Byte Pair Encoding (BPE) Tokenizer

Text Representation & Classical NLP DS practice problem on Onlearn.

Difficulty: medium.

Topics: Byte Pair Encoding (BPE) Tokenizer, Byte Pair Encoding, Merge Rules, Frequency Counting, Character-level Granularity, Out-of-Vocabulary Handling, Natural Language Processing, Information Theory, Computational Linguistics, Data Structures, Machine Learning Infrastructure, Subword Tokenization, Statistical Modeling, Sequence Processing, String Algorithms, Vocabulary Management.

Implement the training phase of the Byte Pair Encoding (BPE) algorithm, a subword tokenization method widely used in modern NLP models like GPT and RoBERTa. Given a corpus represented as a dictionary mapping space separated token sequences to their frequencies, and a number of merge operations to perform, implement the BPE training algorithm. The algorithm works iteratively: 1. Count all adjacent token pairs across the corpus, weighted by word frequency 2. Find the most frequent pair 3. Merge that pair everywhere it appears in the corpus (replacing the two tokens with their concatenation) 4. Record the merge operation 5. Repeat for the specified number of merges Note: When counting pairs in a word, if a pair appears multiple times in the same word (e.g., 'a b a b' contains 'a b' twice), count each occurrence separately, multiplied by the word frequency. If there are fewer possible merges than num merges (e.g., all words are single tokens), stop early. The special token '</w ' represents end of word and should be treated as a regular token that can participate in merges. Write a function that returns the list of merge operations performed, where each merge is a tuple of the two tokens that were merged.