Optimistic Initialization for Exploration

Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.

Difficulty: medium.

Topics: Optimistic Initialization for Exploration, Initial Q-Value Bias, Epsilon-Greedy Strategy, Posterior Sampling, Upper Confidence Bound, Regret Minimization, Reinforcement Learning, Probability Theory, Stochastic Optimization, Statistical Learning Theory, Decision Theory, Exploration-Exploitation Trade-off, Value Function Approximation, Bayesian Inference, Multi-Armed Bandits, Markov Decision Processes.

Implement a greedy multi armed bandit agent that uses optimistic initial value estimates to drive exploration. In a multi armed bandit setting, the agent must balance exploring different arms and exploiting the best known arm. One elegant approach is to initialize the estimated Q values to a value much higher than the true rewards. Because the agent always selects the arm with the highest Q value (greedy), it will naturally try different arms as each one's estimate gets pulled down after being selected. Given: true rewards: a list of the true (deterministic) reward for each arm. Pulling arm i always returns true rewards[i]. initial q: the initial Q value estimate for all arms (set optimistically high to encourage exploration). n steps: the number of total arm pulls to simulate. step size: a constant step size parameter (alpha) for updating Q values. At each step: 1. Select the arm with the highest current Q value. Break ties by choosing the arm with the lowest index. 2. Observe the reward for the selected arm. 3. Update the Q value for that arm using a constant step size update rule. Return a tuple of two lists: 1. The final Q values for each arm, each rounded to 4 decimal places. 2. The total number of times each arm was selected.