Differential Sarsa Algorithm

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Differential Sarsa Algorithm, Differential Bellman Equation, Relative Value Function, Average Reward Rate, Differential TD Error, On-policy Control, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Control Theory, Statistical Estimation, Temporal Difference Learning, Average Reward Framework, Policy Evaluation, Markov Decision Processes, Function Approximation.

Implement the Differential Sarsa algorithm for estimating action values in the average reward setting for continuing (non episodic) tasks. Unlike the standard discounted Sarsa, Differential Sarsa is designed for tasks that run indefinitely without terminal states. Instead of discounting future rewards, the algorithm maintains a running estimate of the average reward per time step and uses the differential TD error for updates. The algorithm maintains: A Q value table Q(s, a) initialized to zero for all state action pairs An average reward estimate R bar initialized to zero Action selection: Use greedy selection from Q values. When there are ties, break them by choosing the lexicographically smallest action. Inputs: transitions: A dictionary mapping (state, action) to (reward, next state) for deterministic transitions initial state: The starting state alpha: Step size for Q value updates beta: Step size for the average reward estimate num steps: Number of interaction steps to execute Output: Return a tuple (Q, R bar) where Q is a dictionary of Q values {(state, action): value} for all state action pairs present in the transitions, and R bar is the final average reward estimate. Note: The Q table should contain entries for every (state, action) pair that appears as a key in the transitions dictionary.