Sarsa(lambda) Algorithm with Eligibility Traces
Foundations & Tabular RL DS practice problem on Onlearn.
Difficulty: medium.
Topics: Sarsa(lambda) Algorithm with Eligibility Traces, Eligibility Traces, Lambda Parameter, Accumulating Traces, Replacing Traces, Backward View, Reinforcement Learning, Stochastic Processes, Dynamic Programming, Statistical Estimation, Control Theory, Temporal Difference Learning, Policy Evaluation, Credit Assignment, Markov Decision Processes, On-policy Control.
Implement the Sarsa(lambda) algorithm, which combines temporal difference learning with eligibility traces for on policy control. This algorithm extends one step Sarsa by maintaining a trace for each state action pair, allowing updates to propagate backward through recently visited pairs. Your function should learn an action value function Q(s, a) by interacting with a deterministic environment over multiple episodes. Inputs n states: Total number of states (int) n actions: Total number of actions (int) transitions: A dictionary mapping (state, action) to (next state, reward) tuples representing the deterministic environment dynamics terminal states: A list of terminal state indices gamma: Discount factor (float) alpha: Learning rate (float) lam: Lambda parameter controlling the trace decay (float between 0 and 1) epsilon: Exploration rate for epsilon greedy action selection (float) num episodes: Number of episodes to train (int) seed: Random seed for reproducibility (int, default 42) Algorithm Requirements Initialize Q values to zero At the start of each episode, reset eligibility traces to zero Select a random non terminal starting state uniformly Use epsilon greedy action selection based on current Q values (break ties by choosing the smallest action index) Use accumulating eligibility traces When a terminal state is reached, the episode ends (terminal states have Q value 0) For epsilon greedy: with probability epsilon choose a uniformly random action, otherwise choose greedy Output Return the learned Q table as a 2D list of shape (n states, n actions), with values rounded to 4 decimal places.