Per-Decision Importance Sampling for Off-Policy Evaluation

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Per-Decision Importance Sampling for Off-Policy Evaluation, Likelihood Ratio, Trajectory Weighting, Cumulative Discounted Reward, Behavior Policy, Target Policy, Reinforcement Learning, Probability Theory, Statistical Inference, Stochastic Processes, Numerical Analysis, Off-Policy Evaluation, Monte Carlo Methods, Sequential Decision Making, Variance Reduction Techniques, Importance Sampling Theory.

Implement per decision importance sampling (PDIS) and ordinary importance sampling (OIS) for off policy value estimation in episodic reinforcement learning. Given a set of episodes collected under a behavior policy $b$, estimate the expected return of a target policy $\pi$ from the starting state of each episode. Ordinary Importance Sampling (OIS) multiplies the full trajectory importance sampling ratio by the full discounted return for each episode, then averages across episodes. Per Decision Importance Sampling (PDIS) is a variance reduction technique that applies a different importance sampling ratio to each reward in the trajectory. Specifically, each reward at time step $k$ is weighted only by the product of importance ratios from the start of the episode up to and including step $k$, rather than the full trajectory ratio. Inputs episodes: A list of episodes. Each episode is a list of (state, action, reward) tuples representing a complete trajectory under the behavior policy. The reward at step $k$ is the reward received after taking the action at step $k$. target policy: A dictionary mapping (state, action) pairs to the probability of taking that action in that state under the target policy. behavior policy: A dictionary mapping (state, action) pairs to the probability under the behavior policy. All probabilities will be positive for actions that appear in the episodes. gamma: The discount factor (float in [0, 1]). Output Return a tuple (pdis estimate, ois estimate) where both values are floats rounded to 4 decimal places. Each estimate is the simple average over all episodes.