Isolation Forest for Anomaly Detection
Dimensionality Reduction & Anomaly Detection DS practice problem on Onlearn.
Difficulty: medium.
Topics: Isolation Forest for Anomaly Detection, Path Length, Isolation Score, Recursive Binary Splitting, Contamination Parameter, Sub-sampling Ratio, Unsupervised Learning, Ensemble Methods, Information Theory, Statistical Analysis, Computational Complexity, Anomaly Detection, Decision Tree Induction, Partitioning Algorithms, Feature Space Analysis, Randomized Sampling.
Write a Python function isolation forest that implements the Isolation Forest algorithm for anomaly detection. The function should take a 2D numpy array X of shape (n samples, n features), an integer n trees specifying the number of isolation trees to build, an integer sample size specifying how many samples to use for building each tree, and an integer random state for reproducibility. The function should return a 1D numpy array of anomaly scores for each sample, where higher scores (closer to 1) indicate more anomalous points and lower scores (closer to 0.5) indicate normal points. The Isolation Forest algorithm works by building random binary trees that recursively partition data. Anomalies are isolated quickly (short path lengths) because they are rare and different from normal points. For each tree, randomly select a feature and a split value between the minimum and maximum of that feature to partition the data. The anomaly score is computed from the average path length across all trees, normalized by the expected average path length for a Binary Search Tree. The maximum depth for each tree should be set to ceil(log2(sample size)).