Emphatic Temporal Difference Learning for Off-Policy Evaluation

Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.

Difficulty: medium.

Topics: Emphatic Temporal Difference Learning for Off-Policy Evaluation, Follow-on Trace, Emphatic Weighting, Bias-Variance Tradeoff, Fixed-Point Iteration, Behavior Policy, Reinforcement Learning, Stochastic Approximation, Statistical Learning Theory, Control Theory, Numerical Analysis, Temporal Difference Methods, Off-Policy Evaluation, Importance Sampling, Function Approximation, Convergence Analysis.

Implement the Emphatic TD(lambda) algorithm for off policy policy evaluation in a tabular setting. Standard semi gradient TD methods can diverge when used off policy with function approximation. Emphatic TD addresses this by introducing an emphasis weighting that corrects for the off policy distribution mismatch. The algorithm maintains a follow on trace that accumulates importance sampling ratios and discount factors, producing an emphasis value that appropriately weights each update. Your function should process a trajectory of experience generated by a behavior policy and evaluate a target policy. The algorithm maintains: A value table V(s) initialized from initial values Eligibility traces e(s) initialized to 0 for all states A scalar follow on trace F initialized to 0 A scalar tracking the previous importance sampling ratio, initialized to 0 Inputs: trajectory: A list of (state, action, reward, next state) tuples. next state can be None to indicate a terminal transition (value of terminal state is 0). gamma: Discount factor lam: Lambda parameter for eligibility traces alpha: Step size target policy: Dict mapping (state, action) to the probability under the target policy behavior policy: Dict mapping (state, action) to the probability under the behavior policy interest: Dict mapping state to its interest value. States not in this dict have interest 0. initial values: Dict mapping state to initial value estimate Output: Return a dictionary mapping each state (from initial values) to its final value estimate, rounded to 4 decimal places, with keys sorted alphabetically.