Afterstate Value Functions
Planning, Dynamics & Decision Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: Afterstate Value Functions, Afterstate Representation, State-Action-State Transitions, Expectation Operators, Bootstrapping, Deterministic Dynamics, Reinforcement Learning, Optimal Control, Stochastic Processes, Dynamic Programming, Computational Complexity, Temporal Difference Learning, Markov Decision Processes, Value Function Approximation, Model-Based Planning, Bellman Equations.
Implement value iteration using afterstate value functions for environments where the agent's action deterministically produces an intermediate state (the "afterstate"), and then the environment stochastically transitions to the next state. In many problems, different state action pairs can lead to the same afterstate. By learning values for afterstates instead of state action pairs, the agent can generalize across these equivalent situations, leading to more efficient learning. Your task is to implement afterstate value iteration that: 1. Maintains a value function over afterstates (not over states or state action pairs) 2. Iteratively updates afterstate values using synchronous (batch) updates until convergence 3. Extracts a greedy policy by selecting the action whose resulting afterstate has the highest value Args: states: List of environment states actions: Dict mapping each state to a list of available actions. States with an empty action list are terminal. afterstate map: Dict mapping (state, action) to the resulting afterstate env dynamics: Dict mapping each afterstate to a list of (probability, reward, next state) tuples describing the stochastic environment response gamma: Discount factor theta: Convergence threshold (stop when max absolute change < theta) max iterations: Maximum number of value iteration sweeps Returns: Tuple of (afterstate values, policy) where: afterstate values: dict mapping each afterstate to its value (rounded to 4 decimal places) policy: dict mapping each non terminal state to the best action. Break ties by choosing the lexicographically smallest action.