Expected SARSA Algorithm for Policy Evaluation and Control

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Expected SARSA Algorithm for Policy Evaluation and Control, Action-Value Expectation, Off-Policy Learning, Bias-Variance Tradeoff, Softmax Policy, Bootstrapping, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Statistical Estimation, Decision Theory, Temporal Difference Learning, Policy Iteration, Monte Carlo Methods, Function Approximation, Exploration Strategies.

Implement the Expected SARSA algorithm, a temporal difference reinforcement learning method that updates action value estimates using the expected value of the next state action pair under the current policy, rather than the value of the actually taken next action. Given a Q table (num states x num actions), process an episode consisting of experience tuples (state, action, reward, next state, done) and return the updated Q table. The agent follows an epsilon greedy policy, where each non greedy action is selected with probability epsilon/num actions, and the greedy action receives the remaining probability mass. When computing the expected value at a next state, use the epsilon greedy action probabilities derived from the current Q values at that state. If an episode step is terminal (done=True), the expected future value should be zero. Note: The Q table may be updated incrementally within an episode each step's update should be visible to subsequent steps in the same episode. Args: Q: numpy array of shape (num states, num actions) with initial Q values episode: list of tuples (state, action, reward, next state, done) alpha: learning rate gamma: discount factor epsilon: exploration parameter for epsilon greedy policy Returns: Updated Q table as a numpy array of the same shape (do not modify the original).