A/B Test Statistical Analysis for Model Comparison

Statistical Inference DS practice problem on Onlearn.

Difficulty: hard.

Topics: A/B Test Statistical Analysis for Model Comparison, P-value Calculation, Confidence Intervals, Type I and Type II Errors, Null Hypothesis Significance Testing, Bayes Factor, Probability Theory, Statistical Inference, Experimental Design, Hypothesis Testing, Decision Theory, Frequentist Frameworks, Bayesian Inference, Sampling Distributions, Effect Size Estimation, Power Analysis.

In production ML systems, A/B testing is the gold standard for comparing model versions. When rolling out a new model (treatment) against an existing model (control), you need rigorous statistical analysis to make data driven decisions. Given binary outcome data from both control and treatment groups, implement a comprehensive A/B test analyzer that computes: 1. Success rates for both groups 2. Absolute lift : The raw difference in success rates (treatment control) 3. Relative lift : The percentage improvement over control 4. Z statistic : The test statistic from a two proportion z test using pooled variance 5. P value : Two tailed p value for the hypothesis test 6. Confidence interval : For the difference in proportions using unpooled standard error 7. Statistical significance : Whether p value is below alpha (1 confidence level) 8. Practical significance : Whether absolute lift meets minimum detectable effect threshold 9. Required sample size : Minimum samples needed per group for 80% power 10. Recommendation : One of 'launch treatment', 'keep control', or 'continue testing' The recommendation logic should be: 'launch treatment': statistically significant AND practically significant AND treatment is better 'keep control': statistically significant AND (treatment is worse OR effect is not practically significant) 'continue testing': not statistically significant OR current sample size is insufficient If either input list is empty, return an empty dictionary. Write a function analyze ab test(control outcomes, treatment outcomes, confidence level, min detectable effect) that performs this analysis.