Policy Gradient with REINFORCE

Foundations & Tabular RL DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding Policy Gradient with REINFORCE, Softmax Activation, Log-Likelihood Gradient, Monte Carlo Returns, Episode Trajectory, Policy Parameterization, Reinforcement Learning, Probability Theory, Numerical Optimization, Calculus, Linear Algebra, Policy Gradient Methods, Stochastic Processes, Gradient-based Learning, Differential Calculus, Matrix Operations.

Implement the policy gradient estimator using the REINFORCE algorithm. The policy is parameterized by a 2D NumPy array theta of shape (num states, num actions). The policy for each state is computed via softmax over theta[s, :]. Given a list of episodes (each a list of (state, action, reward) tuples), compute the average gradient of the log policy multiplied by the return at each time step.

dsFoundations & Tabular RL

Tutor

Waking the tutor…

Foundations & Tabular RL

0 of 24 solved

Back to roadmap