Upper Confidence Bound (UCB) Action Selection
Foundations & Tabular RL DS practice problem on Onlearn.
Difficulty: medium.
Topics: Understanding Upper Confidence Bound (UCB) Action Selection in Multi-Armed Bandits, Hoeffding's Inequality, Optimism in the Face of Uncertainty, Action-Value Estimation, Exploration Parameter Tuning, Non-stationary vs Stationary Bandits, Reinforcement Learning, Probability Theory, Decision Theory, Optimization, Statistics, Multi-Armed Bandits, Exploration-Exploitation Trade-off, Confidence Intervals, Regret Minimization, Online Learning.
Implement the UCB1 action selection strategy. Given a list of current action value estimates (Q), the number of times each action has been selected (N), and the total number of steps (t), return the index of the action to select. Use a default exploration parameter c = 2.0. If an action has never been selected, it should be prioritized for exploration (return its index immediately).