Trust Region Policy Optimization (TRPO)

Advanced & Deep RL DS practice problem on Onlearn.

Difficulty: medium.

Topics: Trust Region Policy Optimization (TRPO), Kullback-Leibler Divergence, Conjugate Gradient Algorithm, Fisher Information Matrix, Surrogate Objective Function, Monotonic Improvement Guarantee, Reinforcement Learning, Numerical Optimization, Probability Theory, Functional Analysis, Stochastic Calculus, Policy Gradient Methods, Constrained Optimization, Information Geometry, Approximate Dynamic Programming, Second-Order Methods.

Implement a single TRPO update step for a tabular softmax policy in a discrete state action environment. TRPO is a policy optimization algorithm that maximizes a surrogate objective subject to a constraint on how much the policy is allowed to change (measured by KL divergence). This prevents destructive large updates and ensures stable learning. Your function should: 1. Compute the softmax policy from the current logits 2. Compute the gradient of the surrogate advantage objective evaluated at the current parameters 3. Build the Fisher information matrix from the current policy 4. Use the conjugate gradient method to approximately solve for the natural gradient direction 5. Compute the maximum step size that satisfies the KL constraint 6. Perform a backtracking line search that checks both the KL constraint and surrogate improvement 7. Return the updated (flat) parameter vector The surrogate objective is the expected importance weighted advantage. The KL divergence is computed as the state visitation frequency weighted average of per state KL divergences from the old policy to the new policy. The Fisher information matrix should also be weighted by state visitation frequency and should use a small damping coefficient of 1e 8 for numerical stability. If the line search finds no acceptable step, return the original parameters unchanged. Args: theta: 1D numpy array of policy logits, length num states num actions states: 1D integer array of visited state indices, shape (N,) actions: 1D integer array of taken action indices, shape (N,) advantages: 1D float array of estimated advantages, shape (N,) num states: int, number of discrete states num actions: int, number of discrete actions delta: float, maximum allowed KL divergence cg iters: int, number of conjugate gradient iterations (default 10) line search steps: int, maximum line search iterations (default 10) line search decay: float, multiplicative decay for line search (default 0.5) Returns: Updated theta as a 1D numpy array