Cliff Walking: Sarsa vs Q-Learning
Foundations & Tabular RL DS practice problem on Onlearn.
Difficulty: medium.
Topics: Understanding the differences between On-Policy (Sarsa) and Off-Policy (Q-Learning) temporal difference control methods in a grid-world environment., Epsilon-Greedy Policy, On-Policy Learning, Off-Policy Learning, Bellman Optimality Equation, Bootstrapping, Reinforcement Learning, Markov Decision Processes, Dynamic Programming, Stochastic Processes, Probability Theory, Temporal Difference Learning, Exploration vs Exploitation, Value Function Approximation, Policy Evaluation, Control Theory.
Implement a Cliff Walking environment solver using both Sarsa and Q Learning. The grid is 4x12. The agent starts at (3,0) and the goal is (3,11). Falling off the cliff (row 3, columns 1 10) results in a 100 penalty and resets the agent. Standard steps have a 1 penalty. Compare the cumulative rewards after 500 episodes.