Asynchronous Dynamic Programming for Value Iteration
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Asynchronous Dynamic Programming for Value Iteration, In-place Updates, Gauss-Seidel Iteration, Contraction Mapping, Asynchronous Parallelism, State-Space Partitioning, Reinforcement Learning, Dynamic Programming, Numerical Optimization, Stochastic Processes, Computational Complexity, Value Iteration, Bellman Operators, Markov Decision Processes, Convergence Analysis, Parallel Computing.
Implement asynchronous value iteration for a Markov Decision Process (MDP). Unlike synchronous dynamic programming, where all states are updated simultaneously in each sweep using values from the previous iteration, asynchronous DP updates states one at a time in a given order, and each update immediately uses the most recent value estimates (in place updates). Given an MDP specified by: num states: the number of states transitions: a nested list where transitions[s][a] is a list of (probability, next state, reward) tuples representing the dynamics for taking action a in state s gamma: the discount factor update order: a list of state indices specifying which state to update at each step (states may repeat) Your function should: 1. Initialize the value function to zeros for all states 2. Process each state in update order sequentially, performing a Bellman optimality backup using the current value estimates 3. Return the final value function as a list of floats If a state has no available actions (empty action list), its value remains unchanged when selected for update. The Bellman optimality backup for a state computes the maximum expected return over all available actions.