Parallel Value Functions (Horde)
Representation Learning, Advanced Theory & Miscellaneous DS practice problem on Onlearn.
Difficulty: medium.
Topics: Parallel Value Functions (Horde), Cumulant Functions, Off-Policy Learning, Policy Evaluation, Discount Factors, State-Action Value Estimation, Reinforcement Learning, Distributed Systems, Representation Learning, Control Theory, Statistical Learning Theory, Temporal Difference Learning, Multi-Agent Architectures, General Value Functions, Function Approximation, Asynchronous Parallelism.
Implement a Horde architecture that learns multiple General Value Functions (GVFs) simultaneously from a single stream of experience. Each GVF is called a "demon" and makes a different prediction about the environment using its own: Cumulant : A pseudo reward signal specific to this demon's prediction target Discount factor (gamma) : Controls the prediction horizon Target policy : The policy under which the demon evaluates its prediction Trace decay (lambda) : Controls eligibility trace decay for temporal credit assignment All demons share the same stream of experience generated by a single behavior policy , but each learns independently using off policy TD(lambda) with importance sampling corrections. For each transition (s, a, s') in the experience stream, every demon should: 1. Look up its cumulant signal from cumulant[s, a, s'] 2. Compute the importance sampling ratio between its target policy and the behavior policy for the taken action 3. Update its eligibility trace vector 4. Compute the TD error using its own value estimates, cumulant, and discount factor 5. Update all state values using the TD error and eligibility traces The eligibility trace update for each demon at each step uses the formula: first decay the previous trace by gamma lambda, add the indicator for the current state, then scale the entire result by the importance sampling ratio. The value update uses: V += alpha td error eligibility trace If the behavior policy probability for the taken action is zero, treat the importance sampling ratio as zero. Args: experience: List of (state, action, next state) tuples demons: List of dicts, each with keys 'cumulant' (ndarray shape (n states, n actions, n states)), 'gamma' (float), 'target policy' (ndarray shape (n states, n actions)), 'lambd' (float) behavior policy: ndarray of shape (n states, n actions) with behavior policy probabilities n states: int n actions: int alpha: float learning rate Returns: A list of lists, one per demon, each of length n states, with values rounded to 4 decimal places.