Expected vs Sample Updates Comparison

Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.

Difficulty: medium.

Topics: Expected vs Sample Updates Comparison, Law of Large Numbers, Gradient Noise, Importance Sampling, Lipschitz Continuity, Batch Normalization, Statistical Learning Theory, Stochastic Optimization, Information Theory, Probabilistic Graphical Models, Numerical Analysis, Empirical Risk Minimization, Stochastic Gradient Descent, Monte Carlo Methods, Convergence Analysis, Variance Reduction Techniques.

Implement a function that compares two fundamental approaches to updating action value estimates in reinforcement learning: expected updates and sample updates. Given a known environment model with transition probabilities and rewards, your function should perform multiple sweeps over all state action pairs using both update strategies simultaneously and return the resulting Q value tables. Expected updates use the full probability distribution over next states to compute the exact backup value for each state action pair. All Q values in a sweep should be computed from the Q values at the start of that sweep (not from partially updated values within the sweep). Sample updates draw a single next state from the transition distribution and use it to incrementally adjust the Q value with a learning rate. These updates modify Q values in place, meaning later updates within the same sweep can see earlier changes. Both methods start from Q values initialized to zero. Each sweep iterates through states 0 to num states 1, and for each state through actions 0 to num actions 1. Your function should accept: P: transition probability array of shape (num states, num actions, num states) R: reward array of shape (num states, num actions) gamma: discount factor alpha: learning rate (used only for sample updates) num sweeps: number of complete sweeps through all state action pairs seed: random seed for reproducibility of sample updates Return a tuple (Q expected, Q sample) where both are numpy arrays of shape (num states, num actions) rounded to 4 decimal places.