TD Error Backpropagation in Networks

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: TD Error Backpropagation in Networks, TD Error, Eligibility Traces, Chain Rule, Bellman Equation, Weight Update Rule, Reinforcement Learning, Deep Learning, Optimization Theory, Stochastic Processes, Computational Statistics, Temporal Difference Learning, Backpropagation Algorithms, Function Approximation, Dynamic Programming, Gradient-based Learning.

Implement a single step of semi gradient TD(0) learning using a neural network as a value function approximator. In reinforcement learning with function approximation, a neural network is used to estimate state values V(s). When an agent transitions from state s to next state s' and receives reward r, a temporal difference (TD) error is computed and used to update the network weights. Your function should: 1. Perform a forward pass through a two layer neural network (one hidden layer with ReLU activation, one linear output) for both the current state and the next state to obtain value estimates 2. Compute the TD error from the reward, discount factor, and value estimates 3. Compute gradients of the current state value V(s) with respect to all network parameters 4. Update the network parameters using the semi gradient TD rule (only differentiate through V(s), treat V(s') as a fixed target) 5. If the transition is to a terminal state, the next state value should be treated as zero The network architecture: Input: state vector of dimension d Hidden layer: h neurons, ReLU activation Output layer: single linear output (the value estimate) Parameters: W1 (d, h), b1 (h,), W2 (h, 1), b2 (1,) Return a tuple of (td error, updated W1, updated b1, updated W2, updated b2).