TD-Gammon Position Evaluation Network

RL Environments, Games & Applications DS practice problem on Onlearn.

Difficulty: medium.

Topics: TD-Gammon Position Evaluation Network, Multi-Layer Perceptron, TD(lambda) Algorithm, Minimax Search, Self-Play Training, Feature Engineering, Reinforcement Learning, Neural Networks, Game Theory, Probability Theory, Temporal Difference Learning, Function Approximation, Backpropagation, Zero-Sum Games, Markov Decision Processes, Eligibility Traces.

Implement an online TD(lambda) learning algorithm for a neural network that evaluates board positions. The network takes a position feature vector as input and outputs a scalar value estimate through a single hidden layer with sigmoid activations. Network Architecture: Input: feature vector x of dimension n features Hidden layer: h = sigmoid(W1 @ x + b1), with n hidden units Output: v = sigmoid(W2 @ h + b2), a single scalar value in (0, 1) TD(lambda) Online Learning: Given a sequence of T states from a single game episode and a terminal outcome, process states one at a time (t = 0 to T 1). At each step: 1. Compute the current value v t and the gradient of v t with respect to all network parameters using the current weights. 2. Update eligibility traces by decaying old traces and adding the current gradient. 3. Compute the TD error: for non terminal steps, use the value of the next state (computed with current weights) as the target; for the last state, use the actual game outcome. 4. Apply weight updates using the TD error and eligibility traces. Note that weights are updated after each timestep, so subsequent forward passes use the modified weights. Function Signature: states: list of feature vectors (each a list of floats) outcome: float, the terminal game result (e.g., 1.0 for win, 0.0 for loss) W1: initial hidden layer weights, shape (n hidden, n features) b1: initial hidden layer biases, shape (n hidden,) W2: initial output layer weights, shape (1, n hidden) b2: initial output layer bias, shape (1,) alpha: learning rate gamma: discount factor lam: eligibility trace decay (lambda) Return: a dictionary with keys 'W1', 'b1', 'W2', 'b2' containing the updated parameters as nested lists, rounded to 4 decimal places.