Q-Learning Algorithm for MDPs
Foundations & Tabular RL DS practice problem on Onlearn.
Difficulty: medium.
Topics: Q-Learning Algorithm for MDPs, Temporal Difference Error, Epsilon-Greedy Policy, Q-Table Initialization, Discount Factor Gamma, Learning Rate Alpha, Reinforcement Learning, Dynamic Programming, Stochastic Processes, Decision Theory, Computational Intelligence, Temporal Difference Learning, Markov Decision Processes, Value Function Approximation, Exploration-Exploitation Strategies, Bellman Optimality Equations.
Write a function that implements the Q Learning algorithm to learn the optimal Q table for a given Markov Decision Process (MDP). The function should take the number of states, number of actions, transition probabilities matrix, rewards matrix, list of terminal states, learning rate, discount factor, epsilon for exploration, and the number of episodes as inputs. Use these parameters to iteratively update the Q table based on the Q Learning update rule, employing an epsilon greedy strategy for action selection. Ensure the function handles starting from non terminal states and stops episodes upon reaching a terminal state. Constraints: num states: Integer greater than or equal to 1. num actions: Integer greater than or equal to 1. P: A 3D NumPy array of shape (num states, num actions, num states) where each element is a probability between 0 and 1, and each sub array sums to 1. R: A 2D NumPy array of shape (num states, num actions) with float or integer values. terminal states: A list or NumPy array of integers, each between 0 and num states 1, with no duplicates. alpha: A float between 0 and 1. gamma: A float between 0 and 1. epsilon: A float between 0 and 1. num episodes: An integer greater than or equal to 1. The function should return a 2D NumPy array of shape (num states, num actions) representing the learned Q table.