Natural Policy Gradient
Advanced & Deep RL DS practice problem on Onlearn.
Difficulty: medium.
Topics: Natural Policy Gradient, Fisher Information Matrix, Kullback-Leibler Divergence, Conjugate Gradient Descent, Natural Gradient Update, Parametric Policy Distribution, Reinforcement Learning, Numerical Optimization, Information Geometry, Stochastic Calculus, Functional Analysis, Policy Gradient Methods, Second-Order Optimization, Riemannian Manifold Learning, Trust Region Methods, Fisher Information Theory.
Implement a single step of the Natural Policy Gradient (NPG) algorithm for a softmax policy with linear function approximation. In standard policy gradient methods, the gradient direction in parameter space does not account for the geometry of the policy distribution. The natural policy gradient addresses this by preconditioning the gradient with the inverse of the Fisher Information Matrix (FIM), which captures the curvature of the policy's probability distribution. You are given: theta: a numpy array of shape (d,) representing the current policy parameters trajectories: a list of episodes, where each episode is a list of (state, action, reward) tuples feature vectors: a dictionary mapping (state, action) tuples to feature vectors (lists of floats) of dimension d actions: a list of all possible actions (same for every state) gamma: discount factor alpha: learning rate for the NPG update epsilon: small constant added to the FIM diagonal for numerical stability (default 1e 4) The policy is a softmax over linear scores: for a given state, the probability of each action is proportional to exp(theta^T phi(s, a)), where phi(s, a) is the feature vector. Your function should compute and return a dictionary with: 'vanilla gradient': the standard REINFORCE policy gradient (averaged over trajectories), as a numpy array of shape (d,) 'fisher matrix': the empirical Fisher Information Matrix estimated from the trajectory data, as a numpy array of shape (d, d). This should be the average of the outer products of score vectors over all state action pairs across all trajectories. 'natural gradient': the natural gradient obtained by solving (F + epsilon I) ng = vanilla gradient, as a numpy array of shape (d,) 'updated theta': the updated parameter vector theta + alpha natural gradient, as a numpy array of shape (d,)