Backgammon with TD Learning

RL Environments, Games & Applications DS practice problem on Onlearn.

Difficulty: medium.

Topics: Backgammon with TD Learning, TD(0) Update Rule, Eligibility Traces, Backpropagation, Epsilon-Greedy Policy, State-Value Function, Reinforcement Learning, Game Theory, Probability Theory, Function Approximation, Stochastic Optimization, Temporal Difference Learning, Markov Decision Processes, Neural Network Architectures, Monte Carlo Methods, Exploration Strategies.

Task: TD(lambda) Learning for Game Position Evaluation Implement a function td gammon learning that applies temporal difference learning with eligibility traces to train a neural network value function across a single game of self play. This technique is foundational for training game playing agents that learn to evaluate board positions through self play. Architecture The value function is a two layer neural network: Hidden layer : h = sigmoid(W1 @ x + b1) where W1 is (n hidden, n features) and b1 is (n hidden,) Output layer : V = sigmoid(w2 . h + b2) where w2 is (n hidden,) and b2 is a scalar The output V represents the estimated probability of winning from the given board state. TD(lambda) Update Procedure For each time step t in the game sequence: 1. Compute V(s t) and its gradient with respect to all network parameters using backpropagation 2. Update all eligibility traces by decaying old traces and adding the current gradient 3. Compute the TD error: for non terminal steps, delta = V(s {t+1}) V(s t); for the final step, delta = outcome V(s T) 4. Update all weights using w += alpha delta e where e is the eligibility trace Note: V(s {t+1}) should be computed with the same weights used for V(s t) (i.e., before the weight update at step t). Inputs game states: List of numpy arrays, each of shape (n features,), representing the board feature vectors at each time step of a game outcome: Float, the final game result (1.0 for win, 0.0 for loss) W1: numpy array of shape (n hidden, n features), initial hidden layer weights b1: numpy array of shape (n hidden,), initial hidden layer biases w2: numpy array of shape (n hidden,), initial output layer weights b2: float, initial output layer bias alpha: float, learning rate lambd: float, trace decay parameter (lambda) Output Return a tuple (W1 final, b1 final, w2 final, b2 final) with the updated weights as numpy arrays (and float for b2). Use sigmoid(x) = 1 / (1 + exp( x)) with clipping x to [ 500, 500] for numerical stability.