RL Backup Diagram Generator
RL Environments, Games & Applications DS practice problem on Onlearn.
Difficulty: medium.
Topics: RL Backup Diagram Generator, One-step Lookahead, Expected Value Calculation, Discount Factor Application, Transition Probability Matrix, Recursive Value Update, Dynamic Programming, Temporal Difference Learning, Markov Decision Processes, Monte Carlo Methods, Function Approximation, Bellman Equations, Bootstrapping Techniques, State-Action Value Estimation, Policy Evaluation, Experience Replay.
Implement a function that generates a structured backup diagram for reinforcement learning value estimation. Given a Markov Decision Process (MDP), a current state, a backup type, and current value estimates, your function should produce a dictionary that captures the full tree structure of the backup computation. The MDP is specified by: transitions: A dictionary mapping (state, action) tuples to lists of (probability, next state, reward) tuples representing the environment dynamics. policy: A list of lists where policy[s][a] gives the probability of taking action a in state s. Your function should support two backup types: "v expectation": The Bellman expectation backup, which computes the state value as a policy weighted average over action values. "v optimal": The Bellman optimality backup, which computes the state value as the maximum over action values. For each action available at the given state (an action is available if (state, action) exists in transitions), compute the action value by summing over all possible transitions: prob (reward + gamma V[next state]). Return a dictionary with the following keys: "action values": A dictionary mapping each available action (int) to its computed action value (rounded to 4 decimal places). "aggregation": The string "expectation" or "maximization" depending on the backup type. "backed up value": The final backed up state value (rounded to 4 decimal places). "diagram": A list of dictionaries (one per available action, ordered by action index), each containing: "action": The action index. "policy prob": The policy probability for this action (rounded to 4 decimal places). "transitions": A list of dictionaries for each transition, containing "next state", "prob", "reward", "target" (the reward plus discounted next state value), and "contribution" (prob times target). All floats rounded to 4 decimal places. "action value": Same as in action values for this action.