Budget-Constrained RL Loss

Advanced & Deep RL DS practice problem on Onlearn.

Difficulty: medium.

Topics: Budget-Constrained RL Loss, Kullback-Leibler Divergence, Primal-Dual Updates, Softmax Activation, Bellman Equation, Lagrangian Penalty Coefficient, Reinforcement Learning, Constrained Optimization, Deep Learning, Stochastic Processes, Control Theory, Policy Gradient Methods, Lagrangian Multipliers, Neural Network Architectures, Markov Decision Processes, Dynamic Programming.

Implement the budget constrained reinforcement learning loss function from the Kimi K2 paper. In K2's RL training, a 'Budget Control' mechanism penalizes responses that exceed a token budget to improve inference efficiency. The base RL loss uses a squared advantage form with KL regularization. Your task is to implement the complete loss computation including the budget penalty.