Feature Drift Detection using Population Stability Index

Data Pipelines, Monitoring & Reliability DS practice problem on Onlearn.

Difficulty: medium.

Topics: Feature Drift Detection using Population Stability Index, Population Stability Index, Kullback-Leibler Divergence, Binning Strategies, Feature Distribution Skew, Threshold-based Alerting, Statistical Process Control, Data Engineering, Machine Learning Operations, Information Theory, Software Reliability Engineering, Distribution Shift Analysis, Data Quality Monitoring, Feature Engineering Pipelines, Statistical Hypothesis Testing, Model Performance Observability.

In production ML systems, detecting when input feature distributions change (drift) between training and production is crucial for maintaining model performance. The Population Stability Index (PSI) is a widely used metric in MLOps for quantifying distribution shifts. Write a function detect feature drift(reference data, production data, num bins) that: 1. Takes a reference distribution (e.g., training data feature values) and a production distribution (current incoming data) 2. Computes the PSI to measure how much the production distribution has shifted from the reference 3. Returns a dictionary with the PSI value and drift assessment The function should return a dictionary containing: psi: The calculated Population Stability Index (rounded to 4 decimal places) drift detected: Boolean indicating if drift is detected (PSI = 0.1) drift level: One of 'none' (PSI < 0.1), 'moderate' (0.1 <= PSI < 0.25), or 'significant' (PSI = 0.25) If either input list is empty, return an empty dictionary. Note: When computing bin proportions, use a small epsilon value (0.0001) to replace zero proportions to avoid numerical issues with logarithms.