Full TD(0) Prediction for Value Estimation
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Full TD(0) Prediction for Value Estimation, Bootstrapping, Bellman Operator, TD Error, Learning Rate Decay, Bias-Variance Tradeoff, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Statistical Estimation, Computational Complexity, Temporal Difference Learning, Markov Reward Processes, Iterative Policy Evaluation, Monte Carlo Methods, Function Approximation.
Implement the TD(0) prediction algorithm for estimating the state value function under a given policy. TD(0) is a model free reinforcement learning method that learns value estimates from raw experience without requiring a model of the environment's dynamics. Your task is to implement the function td0 prediction(episodes, n states, gamma, alpha) that processes a collection of episodes and returns the learned state value function. The function takes: episodes: A list of episodes, where each episode is a list of tuples (state, reward, next state, done). Each tuple represents one transition. The state and next state are integer indices. The reward is a float received after leaving state. The done flag is a boolean indicating whether the episode has terminated after this transition. n states: The total number of states (integer). States are indexed from 0 to n states 1. gamma: The discount factor (float, default 0.99). alpha: The step size / learning rate (float, default 0.01). The value function should be initialized to zeros. Episodes should be processed in order, and within each episode, transitions should be processed sequentially. When a transition is terminal (done=True), the value of the successor state should be treated as zero for the update. Return the final value function as a list of floats.