Detecting Benchmark Contamination in Training Data

Text Generation & NLP Evaluation DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding and Detecting Data Leakage in LLM Training Corpora, N-gram Tokenization, Jaccard Similarity Coefficient, Set Intersection, Token Normalization, Threshold-based Filtering, Natural Language Processing, Machine Learning Evaluation, Data Engineering, Information Retrieval, Statistical Analysis, Data Leakage Detection, Corpus Preprocessing, Evaluation Integrity, Similarity Measures, Scalable String Matching.

Implement a function detect contamination that takes a list of training sequences and a list of benchmark test sequences. The function should return a list of indices from the benchmark set that have a Jaccard similarity greater than a specified threshold (0.3) with any document in the training set, using a 5 gram tokenization approach.