Linear Value Function Approximation with Semi-Gradient TD(0)

Advanced RL Theory, Planning & TD Learning DS practice problem on Onlearn.

Difficulty: medium.

Topics: Linear Value Function Approximation with Semi-Gradient TD(0), Semi-Gradient TD(0) Update, Feature Vector Representation, Weight Vector Convergence, Mean Squared Value Error, Bootstrapping Bias, Reinforcement Learning, Linear Algebra, Optimization Theory, Probability and Statistics, Functional Analysis, Temporal Difference Learning, Function Approximation, Stochastic Gradient Descent, Markov Decision Processes, Vector Space Projections.

Implement semi gradient TD(0) with linear function approximation to learn a weight vector for estimating state values. In many real world reinforcement learning problems, the state space is too large to represent values in a table. Instead, we approximate the value function using features extracted from states. In linear function approximation, the estimated value of a state is the dot product of a weight vector and a feature vector representing that state. Your task is to implement the semi gradient TD(0) algorithm with linear function approximation. Parameters: episodes: A list of episodes. Each episode is a list of transitions, where each transition is a tuple (state features, reward, next state features, done). state features: list of floats representing the feature vector of the current state reward: float, the immediate reward received next state features: list of floats representing the feature vector of the next state done: bool, True if the next state is terminal (episode ends) n features: int, the dimensionality of the feature vectors gamma: float, the discount factor (between 0 and 1) alpha: float, the learning rate Returns: A numpy array of shape (n features,) containing the learned weight vector after processing all episodes sequentially. Initialize the weight vector to zeros. Process episodes in order, and within each episode process transitions in order, updating the weight vector after each transition.