Self-Play Training Loop for Two-Player Games

RL Environments, Games & Applications DS practice problem on Onlearn.

Difficulty: medium.

Topics: Self-Play Training Loop for Two-Player Games, Nash Equilibrium, Experience Replay Buffer, Actor-Critic Architecture, Elo Rating System, Exploration-Exploitation Tradeoff, Reinforcement Learning, Game Theory, Distributed Systems, Probability and Statistics, Software Engineering, Multi-Agent Systems, Policy Optimization, Asynchronous Parallelism, Monte Carlo Methods, Version Control and Checkpointing.

Implement a self play training loop where an agent improves its strategy in a two player zero sum game by playing against a periodically updated copy of itself. In self play, rather than training against a fixed opponent, the agent's opponent is a snapshot of its own past policy. This creates a curriculum of increasingly strong opponents, driving continuous improvement. Setup: A two player zero sum game is defined by a payoff matrix A of shape (num actions, num actions). When player 1 chooses action i and player 2 chooses action j, player 1 receives A[i][j]. Both players use softmax policies parameterized by logit vectors. Player 1 (the learner) has logits theta, and the opponent has logits theta opp, both initialized to initial logits. At each training iteration t (0 indexed): 1. Sync check : If t 0 and t % sync interval == 0, copy the current player's logits to the opponent. 2. Compute softmax probabilities for both players from their respective logits. 3. Compute the expected payoff for each action of player 1 against the opponent's mixed strategy. 4. Compute the expected value under player 1's current policy. 5. Compute the policy gradient and update player 1's logits using gradient ascent. The policy gradient for a softmax policy uses the advantage weighted score function. Args: payoff matrix: 2D list of shape (n, n) representing the payoff for player 1 initial logits: 1D list of shape (n,) for initial policy logits num iterations: number of training iterations learning rate: step size for gradient ascent sync interval: how often (in iterations) to copy current policy to opponent Returns: A dictionary with: 'policy logits': final player 1 logits (list, rounded to 4 decimal places) 'opponent logits': final opponent logits (list, rounded to 4 decimal places) 'policy probs': final player 1 action probabilities (list, rounded to 4 decimal places) 'expected payoff': expected payoff of final player 1 policy vs final opponent policy (float, rounded to 4 decimal places)