Advantage Actor-Critic (A2C) Batch Update from Parallel Environments

Advanced & Deep RL DS practice problem on Onlearn.

Difficulty: hard.

Topics: Understanding Vectorized Advantage Actor-Critic (A2C) Training Loops, Generalized Advantage Estimation (GAE), Entropy Regularization, Shared Backbone Networks, Synchronous Batch Updates, Log-Probability Gradient Calculation, Reinforcement Learning, Deep Learning, Calculus, Probability Theory, Parallel Computing, Policy Gradients, Actor-Critic Architectures, Temporal Difference Learning, Advantage Estimation, Vectorized Environments.

Implement a function calculate a2c update that processes a batch of experience from N parallel environments. Given rewards, done flags, and state values, calculate the generalized advantage estimation (GAE) and the corresponding policy and value losses. Assume the policy network outputs logits and the value network outputs scalar values.