Linear Sarsa Algorithm
Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.
Difficulty: medium.
Topics: Linear Sarsa Algorithm, Feature Vector Representation, Weight Vector Update Rule, Eligibility Traces, On-policy Control, Basis Function Expansion, Reinforcement Learning, Linear Algebra, Stochastic Processes, Optimization Theory, Statistical Learning, Temporal Difference Learning, Function Approximation, Policy Evaluation, Linear Regression, Markov Decision Processes.
Implement the episodic semi gradient Sarsa algorithm with linear function approximation for estimating action value functions from pre collected episodes. In this setting, instead of maintaining a table of Q values, the action value function is approximated as a linear combination of features: Q(s, a) = w^T x(s, a) where w is a weight vector and x(s, a) is a feature vector for the state action pair (s, a). Given a list of episodes (each a list of (state, action, reward) tuples), a feature mapping from (state, action) pairs to feature vectors, a learning rate alpha, and a discount factor gamma, process each episode sequentially and update the weight vector using the Sarsa temporal difference update rule with linear function approximation. For each transition within an episode, the weight vector should be updated using the TD error computed from the current and next state action pairs. At the terminal step of each episode (the last tuple), there is no next state action pair, so the bootstrapped value is zero. Args: episodes: List of episodes. Each episode is a list of (state, action, reward) tuples. features: Dictionary mapping (state, action) tuples to feature vectors (lists of floats). n features: Dimensionality of the feature vectors. alpha: Learning rate (step size). gamma: Discount factor. Returns: The final weight vector as a list of floats, each rounded to 4 decimal places.