Learn Tic-Tac-Toe with TD
RL Environments, Games & Applications DS practice problem on Onlearn.
Difficulty: medium.
Topics: Learn Tic-Tac-Toe with TD, TD(0) Update Rule, Epsilon-Greedy Policy, Q-Table Initialization, Discount Factor Gamma, Terminal State Evaluation, Reinforcement Learning, Game Theory, Probability Theory, Dynamic Programming, Software Engineering, Temporal Difference Learning, Markov Decision Processes, Value Function Approximation, Exploration-Exploitation Strategies, State Space Representation.
Implement TD(0) learning to estimate state values for a Tic Tac Toe player from a collection of recorded games. The board is represented as a tuple of 9 integers (row major order of the 3x3 grid): 0 for empty, 1 for X, 2 for O. Each game is a dictionary with: states: an ordered list of board state tuples visited by the learning player during the game. The last state in the list is the terminal (game ending) position. result: a float indicating the game outcome from the learning player's perspective (1.0 for win, 0.0 for loss, 0.5 for draw). Your function td tic tac toe(games, alpha, initial value) should learn values for all non terminal states encountered across all games. A non terminal state is any state that is not the last state in a game's sequence. The value of the terminal state (last in each sequence) is always the game result and should not be stored or updated. Non terminal states should be initialized to initial value upon first encounter. Games should be processed in order. Within each game, updates proceed forward through the state sequence. Each non terminal state is updated using a standard TD(0) backup toward the value of the next state in the sequence (or the result if the next state is terminal). Return a dictionary mapping each non terminal board state tuple to its learned value, rounded to 4 decimal places.