Incremental Mean for Online Reward Estimation

Foundations & Tabular RL DS practice problem on Onlearn.

Difficulty: easy.

Topics: Understanding Incremental Mean for Online Reward Estimation, Incremental Update Rule, Running Average, Sample Mean, Constant-Space Complexity, Action-Value Estimation, Reinforcement Learning, Statistics, Computational Complexity, Numerical Analysis, Probability Theory, Multi-Armed Bandits, Online Learning Algorithms, Iterative Estimation, Memory-Efficient Computing, Stochastic Processes.

Implement an efficient method to update the mean reward for a k armed bandit action after receiving each new reward, without storing the full history of rewards . Given the previous mean estimate (Q prev), the number of times the action has been selected (k), and a new reward (R), compute the updated mean using the incremental formula. Note: Using a regular mean that stores all past rewards will eventually run out of memory. Your solution should use only the previous mean, the count, and the new reward.