Deep Q-Network Implementation

Advanced & Deep RL DS practice problem on Onlearn.

Difficulty: medium.

Topics: Deep Q-Network Implementation, Bellman Equation, Target Network Synchronization, Epsilon-Greedy Exploration, Huber Loss Function, Replay Buffer Sampling, Reinforcement Learning, Deep Learning, Optimization Theory, Stochastic Processes, Computational Statistics, Temporal Difference Learning, Function Approximation, Experience Replay, Policy Gradient Methods, Value Function Estimation.

Implement a complete Deep Q Network (DQN) training step using a simple two layer neural network as the Q function approximator. Your function should perform one full training iteration given: Q network weights (W1, b1, W2, b2) for architecture: input Linear ReLU Linear Q values Target network weights (same structure, used for stable target computation) A batch of experience tuples (states, actions, rewards, next states, dones) Discount factor gamma and learning rate The function must: 1. Compute Q values for current states using the Q network 2. Extract Q values corresponding to the actions that were actually taken 3. Compute TD targets using the target network: for non terminal transitions use the reward plus discounted maximum future Q value; for terminal transitions use just the reward 4. Compute the mean squared error loss between predicted and target Q values 5. Perform backpropagation through the Q network to compute gradients 6. Update the Q network weights using gradient descent 7. Return the updated weights dictionary and the scalar loss value The network forward pass is: z1 = states @ W1 + b1, h1 = ReLU(z1), q = h1 @ W2 + b2. Gradients should be computed with respect to the MSE loss, propagated back through only the Q network (not the target network).