Q(σ) Unified Algorithm

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Q(σ) Unified Algorithm, Multi-step Returns, Importance Sampling, Eligibility Traces, Stochastic Sampling Parameter, Bellman Operator, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Statistical Estimation, Control Theory, Temporal Difference Learning, Monte Carlo Methods, Policy Evaluation, Bootstrapping Techniques, Variance Reduction.

Implement a function that computes the n step Q(sigma) return, a unified multi step temporal difference target that smoothly interpolates between sampling based and expectation based backup strategies. The Q(sigma) framework uses a per step parameter sigma in [0, 1] to control the degree of sampling versus expectation at each transition. When sigma=1 at a given step, the algorithm uses the actual sampled action (like Sarsa with importance sampling). When sigma=0, it uses the expected value under the target policy (like a tree backup). Values between 0 and 1 produce a weighted blend of both. Given a trajectory segment of n+1 states, n+1 actions, n rewards, and n+1 sigma values, along with the current Q value estimates, a target policy, a behavior policy, and a discount factor, compute the n step Q(sigma) return starting from the first time step. The computation bootstraps from the Q value at the final state action pair, then works backwards through the trajectory, combining reward, importance sampling ratios, policy probabilities, and expected values according to each step's sigma. Return the computed return as a single float.