Acrobot Swing-Up with Sarsa(λ)
Planning, Dynamics & Decision Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: Acrobot Swing-Up with Sarsa(λ), Sarsa(λ) Update Rule, Lagrangian Formulation, Tile Coding, Accumulating Traces, Underactuated System Control, Reinforcement Learning, Classical Mechanics, Numerical Optimization, Stochastic Processes, Control Theory, Temporal Difference Learning, Rigid Body Dynamics, Function Approximation, Eligibility Traces, Policy Evaluation.
Implement the Sarsa(λ) algorithm with linear function approximation for the Acrobot swing up task. The Acrobot is a two link underactuated robot arm that must swing its tip above a height threshold. The state space is continuous (joint angles and velocities), so function approximation with binary feature vectors (e.g., from tile coding) is used instead of a tabular representation. Your function sarsa lambda acrobot should process one complete episode of experience and return the updated weight vector. You are given: features: A list of binary feature vectors, one per timestep. Each features[t] represents the active features for the state action pair (S t, A t) taken at step t. rewards: A list of scalar rewards R {t+1} received after each step. num weights: The dimensionality of the weight vector. alpha: The learning rate (step size). gamma: The discount factor. lam: The trace decay parameter (lambda). initial weights: A numpy array of initial weight values. trace type: Either 'accumulating' or 'replacing', specifying the eligibility trace variant. The episode has T steps indexed 0 through T 1. The state reached after the final step (t = T 1) is always terminal, meaning the value of the next state action pair is zero for that last transition. For all other transitions, the next state action Q value is computed from features[t+1]. The Q value for any state action pair is the dot product of the weight vector and its feature vector. At each step, first compute the TD error, then update the eligibility trace vector, and finally update the weight vector. Return the final weight vector as a list of floats.