The SARSA Algorithm on policy
Foundations & Tabular RL DS practice problem on Onlearn.
Difficulty: medium.
Topics: The SARSA Algorithm on policy, State-Action-Reward-State-Action Tuple, On-Policy Learning, Epsilon-Greedy Action Selection, Q-Table Update Rule, Bootstrapping, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Statistical Decision Theory, Control Theory, Temporal Difference Learning, Markov Decision Processes, Policy Evaluation, Exploration-Exploitation Trade-off, Value Function Approximation.
Implement the SARSA algorithm to estimate Q values for a given set of deterministic transitions using greedy action selection. All Q values are initialized to zero. Each episode starts from a given initial state. The episode ends when it reaches the $terminal$ state or when the number of steps exceeds $maxsteps$. Changes made to Q values are persistent across episodes.