Certainty-Equivalence in TD Learning

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Certainty-Equivalence in TD Learning, Certainty-Equivalence Principle, Bootstrapping, Maximum Likelihood Estimation, Value Function Approximation, Sample Complexity, Reinforcement Learning, Stochastic Processes, Statistical Estimation, Control Theory, Dynamic Programming, Temporal Difference Learning, Model-Based RL, Bayesian Inference, System Identification, Bellman Equations.

Implement the certainty equivalence method for computing the value function from observed transition data in a Markov Decision Process. Instead of performing iterative TD(0) updates, the certainty equivalence approach takes a model based shortcut: it first estimates the MDP dynamics (transition probabilities and expected rewards) from observed episodes, then directly solves for the value function that satisfies the Bellman equation under the estimated model. Given: episodes: a list of episodes, where each episode is a list of (state, reward, next state) tuples representing observed transitions n states: the total number of states (labeled 0 through n states 1) gamma: the discount factor Your function should: 1. Build an empirical MDP model from the observed data 2. Compute the value function that is consistent with this estimated model by solving the appropriate linear system States that are never visited as a source state should have a value of 0. Return the value function as a 1D numpy array of shape (n states,). Use only NumPy.