TF-IDF (Term Frequency-Inverse Document Frequency)
Data Preparation & Feature Engineering DS practice problem on Onlearn.
Difficulty: medium.
Topics: Understanding Implement TF-IDF (Term Frequency-Inverse Document Frequency), Term Frequency, Inverse Document Frequency, Laplace Smoothing, Document Frequency, Logarithmic Scaling, Natural Language Processing, Information Retrieval, Feature Engineering, Statistical Learning, Data Preprocessing, Vector Space Models, Text Normalization, Frequency Analysis, Corpus Statistics, Weighting Schemes.
Task: Implement TF IDF (Term Frequency Inverse Document Frequency) Your task is to implement a function that computes the TF IDF scores for a query against a given corpus of documents. Function Signature Write a function compute tf idf(corpus, query) that takes the following inputs: corpus: A list of documents, where each document is a list of words. query: A list of words for which you want to compute the TF IDF scores. Output The function should return a list of lists containing the TF IDF scores for the query words in each document, rounded to five decimal places. Important Considerations 1. Handling Division by Zero: When implementing the Inverse Document Frequency (IDF) calculation, you must account for cases where a term does not appear in any document (df = 0). This can lead to division by zero in the standard IDF formula. Add smoothing (e.g., adding 1 to both numerator and denominator) to avoid such errors. 2. Empty Corpus: Ensure your implementation gracefully handles the case of an empty corpus. If no documents are provided, your function should either raise an appropriate error or return an empty result. This will ensure the program remains robust and predictable. 3. Edge Cases: Query terms not present in the corpus. Documents with no words. Extremely large or small values for term frequencies or document frequencies. By addressing these considerations, your implementation will be robust and handle real world scenarios effectively.