Character-Level Tokenizer (stoi/itos/BOS)

Text Representation & Classical NLP DS practice problem on Onlearn.

Difficulty: easy.

Topics: Understanding Character-Level Tokenization and Vocabulary Mapping, stoi (String-to-Integer mapping), itos (Integer-to-String mapping), BOS (Beginning of Sentence token), EOS (End of Sentence token), Vocabulary Cardinality, One-to-one Mapping Invariance, Natural Language Processing, Data Preprocessing, Information Theory, Computational Linguistics, Discrete Mathematics, Tokenization Strategies, Vocabulary Construction, Integer Encoding, Sequence Processing, Special Token Injection.

Implement a Character Level Tokenizer class that builds a vocabulary from a given string. The tokenizer should include a special 'BOS' (Beginning of Sentence) token at index 0 and an 'EOS' (End of Sentence) token at index 1. The remaining unique characters should be mapped starting from index 2. Implement 'encode' (str to list of ints) and 'decode' (list of ints to str) methods.

dsText Representation & Classical NLP

Tutor

Waking the tutor…