Action-Conditional Trace Clearing

Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.

Difficulty: medium.

Topics: Action-Conditional Trace Clearing, Eligibility Traces, Action-Conditioned Dynamics, Belief State Updating, Policy Gradient Variance Reduction, Markov Decision Process Transitions, Reinforcement Learning, Control Theory, Information Theory, Dynamical Systems, Probabilistic Graphical Models, Temporal Difference Learning, State Space Modeling, Entropy Regularization, Trajectory Optimization, Latent Variable Inference.

In reinforcement learning with eligibility traces, a key challenge arises in off policy control: when the agent takes an action that deviates from the greedy policy, the eligibility traces accumulated so far may no longer be valid for updating toward the optimal value function. Action conditional trace clearing addresses this by resetting all eligibility traces to zero whenever a non greedy action is observed. When the agent follows the greedy policy, traces decay normally by a factor of $\gamma \lambda$. When the agent deviates (e.g., explores), all traces are wiped clean so that past state action pairs do not receive credit through an exploratory step. Implement a function that processes a single episode and updates Q values using accumulating eligibility traces with action conditional clearing. Inputs: Q: numpy array of shape (n states, n actions) current Q values episode: list of tuples (state, action, reward, next state, done, next action greedy) where: state, action: current state and action taken reward: reward received next state: resulting state done: boolean, True if the episode terminates after this step next action greedy: boolean, True if the action that will be taken at next state is greedy (ignored when done=True) gamma: discount factor alpha: learning rate lam: trace decay parameter (lambda) Processing each step: Compute the TD error using the maximum Q value at the next state (use 0 for terminal transitions) Update the eligibility trace for the current state action pair (accumulating) Apply the TD update to all Q values weighted by traces Conditionally decay or clear the traces based on whether the next action is greedy Returns: Updated Q values as a numpy array, rounded to 4 decimal places