RL Backup Operations for Value Functions
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: RL Backup Operations for Value Functions, Bootstrapping, Contraction Mapping, Eligibility Traces, Expected SARSA, N-step Returns, Reinforcement Learning, Dynamic Programming, Stochastic Processes, Functional Analysis, Computational Complexity, Temporal Difference Learning, Bellman Operator Theory, Markov Decision Processes, Value Function Approximation, Monte Carlo Methods.
In reinforcement learning, backup operations are the fundamental building blocks of value based algorithms. Each algorithm (policy evaluation, value iteration, Q learning, etc.) differs primarily in the type of backup it applies. Understanding these backups is essential for grasping how different RL methods update their value estimates. Implement a function rl backup that computes the result of three types of backup operations: 1. "action value" backup: Computes the value of taking a specific action from a state, using current state value estimates for successor states. transitions: a list of tuples (probability, next state, reward) for the given action. values: a dict mapping state names to their current value estimates. 2. "state value" backup: Computes the value of a state under a given policy by averaging over all actions weighted by policy probabilities. transitions: a dict mapping action names to lists of (probability, next state, reward) tuples. policy: a dict mapping action names to their selection probabilities under the current policy. values: a dict mapping state names to their current value estimates. 3. "optimal value" backup: Computes the value of a state under an optimal policy by taking the maximum value over all available actions. transitions: a dict mapping action names to lists of (probability, next state, reward) tuples. values: a dict mapping state names to their current value estimates. policy is not used. For each backup, successor states not found in values should be treated as having value 0. Return the backed up value as a float rounded to 4 decimal places.