Off-line vs On-line TD(0) Prediction
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Off-line vs On-line TD(0) Prediction, TD(0) Update Rule, Batch vs Incremental Updates, Mean Squared Value Error, Robbins-Monro Conditions, Contraction Mapping, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Statistical Estimation, Computational Complexity, Temporal Difference Learning, Markov Decision Processes, Iterative Policy Evaluation, Bootstrapping Methods, Stochastic Approximation.
Implement a function that performs TD(0) value prediction in either online or offline mode, given a batch of complete episodes. In both modes, you maintain a value function V initialized to zeros for all states. For each transition (state, reward, next state, done) in an episode, compute the TD target: If the transition is terminal (done=True), the target is simply the reward (no bootstrapping). Otherwise, the target is the reward plus the discounted estimated value of the next state. The two modes differ in when updates to V take effect: Online mode ("online"): After each transition within an episode, immediately update V. Subsequent transitions in the same episode use the already updated value function. Offline mode ("offline"): Accumulate all value changes for an episode without modifying V during that episode. All transitions within the same episode use the value function as it was at the start of the episode. Apply the accumulated changes only after processing the entire episode. Process episodes in order. For each transition, the update magnitude is alpha times the TD error. Args: episodes: list of episodes, each a list of (state, reward, next state, done) tuples n states: int, number of states (indexed 0 to n states 1) gamma: float, discount factor alpha: float, learning rate mode: str, either "online" or "offline" Return the final value function as a list of floats rounded to 4 decimal places.