Two-Ply Search with Value Function
RL Environments, Games & Applications DS practice problem on Onlearn.
Difficulty: medium.
Topics: Two-Ply Search with Value Function, Minimax Algorithm, Alpha-Beta Pruning, Terminal State Evaluation, Lookahead Depth, Zero-Sum Game Modeling, Game Theory, Search Algorithms, Reinforcement Learning, Decision Theory, Computational Complexity, Adversarial Search, Heuristic Evaluation, State Space Representation, Dynamic Programming, Policy Approximation.
Implement a two ply (two step) lookahead search for action value estimation in a stochastic environment. Given a current state in a Markov Decision Process (MDP), a transition model, and an approximate value function for states, compute the action values by looking two steps ahead. At the leaf nodes (states reached after two steps), use the provided value function to estimate future returns. The agent considers all available actions from the current state. For each action, it considers the possible next states (weighted by transition probabilities). From each next state, it again considers all available actions and their outcomes, choosing the best second step action (greedy). The value function is used to evaluate states reached after the second step. If a next state has no available actions in the transition model (i.e., it is terminal or a leaf), fall back to a one ply evaluation using the value function directly for that state. Inputs: state: The current state (integer). transitions: A dictionary mapping (state, action) pairs to a list of (probability, next state, reward) tuples. value function: A dictionary mapping states to their estimated values (floats). gamma: The discount factor (float in [0, 1]). Output: Return a dictionary mapping each available action from the given state to its two ply backed up Q value, rounded to 4 decimal places.