Compare Update Strategies in Policy Evaluation

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Compare Update Strategies in Policy Evaluation, Bootstrapping, Bellman Expectation Equation, Contraction Mapping, Bias-Variance Tradeoff, Eligibility Traces, Reinforcement Learning, Stochastic Processes, Numerical Analysis, Dynamic Programming, Statistical Inference, Temporal Difference Learning, Markov Decision Processes, Iterative Methods, Monte Carlo Methods, Function Approximation.

Implement a function compare policy evaluation that performs iterative policy evaluation using one of two update strategies: synchronous or in place . In policy evaluation, we repeatedly apply the Bellman expectation equation to update state values. Given a fixed policy, the transition dynamics can be summarized as a transition probability matrix P and an expected reward vector R, so the update rule for each state s is: V(s) = R(s) + gamma sum over all s' of P(s, s') V(s') The two strategies differ in how they apply this update across states during a single sweep: Synchronous : All states are updated using values from the previous sweep. The old values are read, and a completely new value array is produced. In place : States are updated sequentially (in order s = 0, 1, ..., n 1), and each newly computed value is immediately available for use by subsequent state updates within the same sweep. Your function should: Accept an initial value function V, transition matrix P, reward vector R, discount factor gamma, number of sweeps n sweeps, and a method string ("synchronous" or "in place") Return the value function after performing the specified number of sweeps Not modify the original input array V