Gaussian Policy for Continuous Control
Advanced & Deep RL DS practice problem on Onlearn.
Difficulty: medium.
Topics: Gaussian Policy for Continuous Control, Reparameterization Trick, Log-Likelihood Gradient, Covariance Matrix Parameterization, Kullback-Leibler Divergence, Exploration-Exploitation Tradeoff, Probability Theory, Stochastic Optimization, Control Theory, Deep Learning, Reinforcement Learning, Multivariate Distributions, Policy Gradient Methods, Function Approximation, Information Geometry, Entropy Regularization.
In reinforcement learning with continuous action spaces, a common approach is to parameterize the policy as a Gaussian (normal) distribution. The policy maps state features to a distribution over actions, from which actions can be sampled during exploration. Implement a function gaussian policy that computes the key quantities of a Gaussian policy for a single continuous action dimension: 1. The policy mean , computed as a linear function of the state features using learned weight parameters. 2. The probability density (PDF) of a given action under the policy. 3. The log probability of the action, which is essential for policy gradient methods. 4. The score function , which is the gradient of the log probability with respect to the mean weight parameters. This gradient is the central quantity used in REINFORCE and other policy gradient algorithms to update the policy. The function takes: state: a numpy array of shape (d,) representing state features mean weights: a numpy array of shape (d,) representing the policy parameters std: a positive float representing the fixed standard deviation of the policy action: a float representing the action to evaluate Return a dictionary with keys: 'mean': the policy mean (float, rounded to 4 decimal places) 'pdf': the probability density at the action (float, rounded to 4 decimal places) 'log prob': the log probability of the action (float, rounded to 4 decimal places) 'score': the gradient of the log probability with respect to mean weights (list of floats, rounded to 4 decimal places)