Build a Multi-Armed Bandit Testbed

Foundations & Tabular RL DS practice problem on Onlearn.

Difficulty: medium.

Topics: Understanding the Exploration-Exploitation Trade-off in Multi-Armed Bandits, Stationary Distributions, Epsilon-Greedy Policy, Sample-Average Method, Reward Signal, Action Space, Reinforcement Learning, Probability Theory, Statistics, Optimization, Simulation Modeling, Exploration-Exploitation Trade-off, Action-Value Methods, Stochastic Processes, Regret Analysis, Incremental Learning.

Implement a 'BanditTestbed' class that simulates a k armed bandit problem. The class should initialize k arms with true mean rewards sampled from a normal distribution. Provide a method 'pull(action)' that returns a reward sampled from a normal distribution centered at the true mean of that action. Additionally, track the estimated value of each action using an incremental update rule: Q {n+1} = Q n + 1/n (R Q n).