Full vs Sample Backup Comparison

Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.

Difficulty: medium.

Topics: Full vs Sample Backup Comparison, Incremental Checkpointing, Reservoir Sampling, Entropy Estimation, Write-Ahead Logging, Bit-level Redundancy, Data Engineering, Distributed Systems, Information Theory, Statistical Sampling, Database Management, Storage Architecture, Data Compression, Probabilistic Modeling, Consistency Protocols, Resource Allocation.

Implement a function compare full vs sample backup that performs policy evaluation using two different backup strategies on a given MDP and returns the resulting value estimates from each. Full (Expected) Backup: Uses the complete environment model to compute the exact expected value for each state. In each sweep, every state's value is updated synchronously using all possible transitions weighted by their probabilities. The new value for state s is computed entirely from the old values before any state in that sweep is updated. Sample Backup: Uses a single sampled transition per state per sweep. For each state, an action is sampled from the policy, a next state is sampled from the transition model, and a TD(0) style update is applied using a learning rate alpha. Updates are applied in place (state 0 first, then state 1, etc.). Parameters: transition probs: Nested list of shape (S, A, S) representing P(s'|s,a) rewards: Nested list of shape (S, A, S) representing R(s,a,s') policy: Nested list of shape (S, A) representing pi(a|s) gamma: Discount factor num sweeps: Number of complete sweeps through all states alpha: Learning rate for sample backup seed: Random seed set once at the start Returns: A tuple (full values, sample values) where each is a list of floats rounded to 4 decimal places, representing the learned value of each state after all sweeps. Use np.random.seed(seed) once at the start. Use np.random.choice for sampling actions and next states.