Policy Gradient with REINFORCE
Foundations & Tabular RL DS practice problem on Onlearn.
Difficulty: hard.
Topics: Understanding Policy Gradient with REINFORCE, Softmax Activation, Log-Likelihood Gradient, Monte Carlo Returns, Episode Trajectory, Policy Parameterization, Reinforcement Learning, Probability Theory, Numerical Optimization, Calculus, Linear Algebra, Policy Gradient Methods, Stochastic Processes, Gradient-based Learning, Differential Calculus, Matrix Operations.
Implement the policy gradient estimator using the REINFORCE algorithm. The policy is parameterized by a 2D NumPy array theta of shape (num states, num actions). The policy for each state is computed via softmax over theta[s, :]. Given a list of episodes (each a list of (state, action, reward) tuples), compute the average gradient of the log policy multiplied by the return at each time step.