Fitted Value Iteration
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Fitted Value Iteration, Least Squares Policy Iteration, Basis Function Expansion, Contraction Mapping, Sample Complexity, Bellman Residual Minimization, Reinforcement Learning, Numerical Optimization, Function Approximation, Dynamic Programming, Statistical Learning Theory, Approximate Dynamic Programming, Supervised Regression, Bellman Operator Analysis, Stochastic Approximation, Generalization Error Bounds.
Implement the Fitted Value Iteration algorithm, an approximate dynamic programming method that uses least squares function approximation to estimate the value function in a Markov Decision Process. Unlike tabular value iteration which stores a value for every state, fitted value iteration represents the value function using a parameterized model. At each iteration, it computes Bellman backup targets for a set of sample states and then fits a linear model to approximate the updated value function. Inputs features: A NumPy array of shape (n states, n features) where each row is the feature vector for a state transitions: A dictionary mapping (state, action) to (next state, reward) tuples representing deterministic environment dynamics n actions: Total number of actions (int) terminal states: A list of terminal state indices (terminal states have value 0) gamma: Discount factor (float, between 0 and 1) n iterations: Number of fitting iterations (int) Algorithm Overview 1. Start with an initial value function of all zeros 2. At each iteration, use the current value estimates to compute Bellman backup targets for every state 3. Fit a new linear model (via least squares) mapping features to targets 4. Repeat for the specified number of iterations For non terminal states, the target is computed by taking the best action according to current value estimates. For terminal states, the target is always 0. If a state action pair is not in the transitions dictionary, that action is unavailable from that state. Output Return a tuple (weights, values) where: weights is a list of the linear model parameters, rounded to 4 decimal places values is a list of the fitted value for each state, rounded to 4 decimal places