REINFORCE with Value Baseline
Advanced & Deep RL DS practice problem on Onlearn.
Difficulty: medium.
Topics: REINFORCE with Value Baseline, Advantage Function, Baseline Subtraction, Log-Likelihood Gradient, Discounted Cumulative Reward, Policy Entropy Regularization, Reinforcement Learning, Stochastic Optimization, Probability Theory, Deep Learning, Control Theory, Policy Gradient Methods, Variance Reduction Techniques, Function Approximation, Monte Carlo Estimation, Temporal Difference Learning.
Implement the REINFORCE policy gradient algorithm with a learned value function baseline for episodic tasks. Given a single episode of experience and current parameters, your function should perform one full episode update of both the policy parameters and value function parameters. The policy is a softmax (tabular) policy over discrete states and actions, parameterized by a matrix theta of shape (n states, n actions). The value function is tabular, parameterized by a vector w of shape (n states,). Your function should: 1. Compute the discounted return $G t$ for each timestep in the episode 2. Process each timestep sequentially (from t=0 to T 1), at each step: Compute the current softmax action probabilities from theta Compute the advantage as the difference between the return and the current value estimate Update the value parameters using the advantage Update the policy parameters using the policy gradient scaled by the discount factor $\gamma^t$ and the advantage Note that updates are applied sequentially within the episode, so earlier updates affect later computations when the same state is revisited. Parameters: episode: list of (state, action, reward) tuples theta: numpy array of shape (n states, n actions), initial policy parameters w: numpy array of shape (n states,), initial value function parameters gamma: float, discount factor alpha theta: float, learning rate for policy parameters alpha w: float, learning rate for value function parameters Returns: A tuple (theta new, w new) containing the updated parameters after processing the full episode.