The SARSA Algorithm on policy

Foundations & Tabular RL DS practice problem on Onlearn.

Difficulty: medium.

Topics: The SARSA Algorithm on policy, State-Action-Reward-State-Action Tuple, On-Policy Learning, Epsilon-Greedy Action Selection, Q-Table Update Rule, Bootstrapping, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Statistical Decision Theory, Control Theory, Temporal Difference Learning, Markov Decision Processes, Policy Evaluation, Exploration-Exploitation Trade-off, Value Function Approximation.

Implement the SARSA algorithm to estimate Q values for a given set of deterministic transitions using greedy action selection. All Q values are initialized to zero. Each episode starts from a given initial state. The episode ends when it reaches the $terminal$ state or when the number of steps exceeds $maxsteps$. Changes made to Q values are persistent across episodes.