Deterministic Policy Gradient

Advanced & Deep RL DS practice problem on Onlearn.

Difficulty: medium.

Topics: Deterministic Policy Gradient, Reparameterization Trick, Deterministic Policy, Action-Value Gradient, Target Network Updates, Experience Replay Buffer, Reinforcement Learning, Stochastic Calculus, Numerical Optimization, Control Theory, Function Approximation, Actor-Critic Architectures, Policy Gradient Methods, Continuous Action Spaces, Value Function Estimation, Off-Policy Learning.

Implement a single update step of the Deterministic Policy Gradient (DPG) algorithm for continuous action reinforcement learning. In many RL problems the action space is continuous, making it impractical to enumerate all actions. The DPG approach maintains a deterministic policy that directly maps states to actions, and updates it by following the gradient of the Q function through the policy. Your implementation uses simple linear function approximators: Deterministic policy : The action is computed as a linear function of the state features using parameter vector theta. Q function (critic) : The state action value is approximated as a linear function of state features and the action, parameterized by weight vector w (the first state dim entries correspond to state features, and the last entry corresponds to the action). Given a batch of transitions (s, a, r, s', done), implement the following: 1. Critic update : Compute TD targets using the current policy to select next actions. Compute TD errors and update the critic weights using the mean semi gradient. 2. Actor update : Using the current (pre update) critic weights, compute the deterministic policy gradient by chaining the gradient of the Q function with respect to the action through the gradient of the policy with respect to its parameters. Update the policy parameters by gradient ascent. Both updates should use the original (pre update) parameters for their respective gradient computations. Inputs states: np.ndarray of shape (N, state dim) batch of state observations actions: np.ndarray of shape (N,) actions that were actually taken rewards: np.ndarray of shape (N,) observed rewards next states: np.ndarray of shape (N, state dim) next state observations dones: np.ndarray of shape (N,) terminal flags (1.0 if episode ended, 0.0 otherwise) theta: np.ndarray of shape (state dim,) policy parameters w: np.ndarray of shape (state dim + 1,) critic parameters gamma: float discount factor alpha theta: float policy learning rate alpha w: float critic learning rate Output Return a tuple (theta new, w new) where each is a list of floats rounded to 4 decimal places.