Actor-Critic Algorithm

Advanced & Deep RL DS practice problem on Onlearn.

Difficulty: medium.

Topics: Actor-Critic Algorithm, Advantage Function, Bootstrapping, Policy Entropy Regularization, Generalized Advantage Estimation, Target Network Synchronization, Reinforcement Learning, Stochastic Optimization, Probability Theory, Dynamic Programming, Function Approximation, Policy Gradient Methods, Temporal Difference Learning, Value Function Estimation, Variance Reduction Techniques, Neural Network Architectures.

Implement a single step of the one step Actor Critic algorithm with linear function approximation for both the actor and the critic. The algorithm combines two components: A Critic that estimates the state value function V(s) using a linear model: V(s) = w^T phi(s), where phi(s) is the state feature vector and w is the critic weight vector. An Actor that maintains a parameterized softmax policy over actions using linear features: the logits for all actions are computed as phi(s)^T theta, where theta is a matrix of shape (n features, n actions). The policy probabilities are the softmax of these logits. Given a single transition (s, a, r, s', done), your function should: 1. Compute the TD error using the critic's value estimates. 2. Update the critic weights using semi gradient TD(0). 3. Compute the softmax action probabilities under the current actor parameters (before the update). 4. Update the actor parameters using the policy gradient theorem, where the score function (gradient of log policy) for a softmax policy with linear features is used along with the TD error as the advantage estimate. 5. Return the updated actor parameters, updated critic weights, the TD error, and the action probabilities (before update). Use numerically stable softmax (subtract the max logit before exponentiation). Only use NumPy.