Control Variates in Off-Policy Methods

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Control Variates in Off-Policy Methods, Control Variate Coefficient, Radon-Nikodym Derivative, Baseline Subtraction, Bias-Variance Tradeoff, Weighted Importance Sampling, Reinforcement Learning, Statistical Inference, Stochastic Optimization, Information Theory, Functional Analysis, Off-Policy Evaluation, Monte Carlo Methods, Variance Reduction Techniques, Temporal Difference Learning, Importance Sampling.

In off policy reinforcement learning, we want to evaluate or improve a target policy using data collected under a different behavior policy. Importance sampling is the standard technique for correcting the distribution mismatch, but it often suffers from high variance, especially over long trajectories. Control variates are a classical variance reduction technique that can be applied to off policy estimation. The idea is to use a baseline function (typically the current value estimate) to reduce variance while keeping the estimator unbiased (or nearly so). When the importance sampling ratio deviates from 1, the control variate smoothly blends between the importance corrected return and the baseline value estimate. Implement a function that computes off policy return estimates using per decision importance sampling with a control variate baseline. Your function should accept: rewards: a list of T rewards collected under the behavior policy values: a list of T+1 state value estimates [V(s 0), V(s 1), ..., V(s T)] where the last element is the bootstrap value target probs: a list of T action probabilities under the target policy behavior probs: a list of T action probabilities under the behavior policy dones: a list of T done flags (1 = terminal, 0 = non terminal) gamma: the discount factor The function should process the trajectory backwards and at each timestep compute the control variate corrected return using the per step importance sampling ratio. When a step is terminal, future returns should not propagate across the episode boundary. Return a tuple of two lists, each rounded to 4 decimal places: 1. cv returns: the control variate return estimates for each timestep 2. advantages: the difference between cv returns and the corresponding state values