Distributional RL: Categorical Projection (C51)

Advanced & Deep RL DS practice problem on Onlearn.

Difficulty: hard.

Topics: Distributional RL: Categorical Projection (C51), Categorical Projection, Kullback-Leibler Divergence, Wasserstein Metric, Support Atoms, Softmax Cross-Entropy, Reinforcement Learning, Probability Theory, Deep Learning, Numerical Analysis, Stochastic Processes, Distributional Reinforcement Learning, Bellman Operators, Neural Network Architectures, Optimization Algorithms, Information Theory.

Implement the categorical distributional Bellman projection and the corresponding cross entropy loss used in distributional reinforcement learning. In distributional RL, instead of estimating the expected return, we estimate the full probability distribution of returns. The return distribution is represented using a discrete set of evenly spaced atoms (support points) with associated probabilities. Given a transition (reward, done flag) and the current and next state return distributions, your function should: 1. Construct the support atoms: N evenly spaced values from v min to v max 2. For each atom in the next state distribution, compute where the Bellman updated atom lands on the support, clip it to [v min, v max], and distribute its probability mass to the two nearest support atoms proportionally to how close it is to each 3. When the transition is terminal (done=1), the bootstrapped next state values should be ignored (the projected atom is just the reward) 4. Compute the cross entropy loss between the projected target distribution and the current predicted distribution. Use a clipping value of 1e 8 on the predicted probabilities to avoid log(0) Inputs: current probs: numpy array of shape (n atoms,), predicted return distribution for (s, a) next probs: numpy array of shape (n atoms,), return distribution for the greedy action at s' reward: float, the observed reward gamma: float, discount factor done: int, 1 if terminal, 0 otherwise v min: float, minimum value of the support v max: float, maximum value of the support n atoms: int, number of atoms in the support Return a tuple (projected, loss) where projected is a numpy array of shape (n atoms,) representing the target distribution after projection, and loss is a float representing the cross entropy loss.