Distributed On-Policy Reinforcement Learning

Advanced & Deep RL DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding Distributed On-Policy Actor-Critic Architectures, PPO Clipped Objective, Advantage Estimation (GAE), Parameter Server Pattern, Experience Trajectory Buffers, Log-Probability Ratio Calculation, Reinforcement Learning, Distributed Systems, Calculus and Optimization, Parallel Computing, Probability Theory, On-Policy Learning, Actor-Critic Architectures, Asynchronous Gradient Updates, Policy Gradient Theorem, Importance Sampling.

Implement a simplified 'Distributed Actor' class that computes the policy gradient surrogate for the PPO (Proximal Policy Optimization) objective. Given a batch of experiences (states, actions, log probs, advantages), return the clipped surrogate loss to be used by the learner node for a gradient update step.