Handling Variable Environment Reset Times

Planning, Dynamics & Decision Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: Handling Variable Environment Reset Times, Episode Termination Latency, State Transition Dynamics, Jitter Compensation, Non-Stationary Reset Distributions, Parallel Environment Vectorization, Reinforcement Learning, Control Theory, Stochastic Processes, System Identification, Computational Complexity, Markov Decision Processes, Optimal Control, Time Series Analysis, Simulation Modeling, Asynchronous Scheduling.

In parallel reinforcement learning, multiple environment instances run simultaneously to collect experience faster. A key challenge arises because each environment reaches terminal states at different timesteps, causing asynchronous resets. The agent must correctly segment trajectories into episodes and compute per episode statistics despite these variable reset times. Implement a function process variable resets that takes trajectory data from multiple parallel environments and correctly handles asynchronous episode boundaries. Inputs: rewards: a numpy array of shape (T, N) where T is the number of timesteps and N is the number of parallel environments. Entry [t, n] is the reward received by environment n at timestep t. dones: a numpy array of shape (T, N) with boolean values. dones[t, n] is True if environment n terminated at timestep t (the reward at that timestep is the final reward of that episode). gamma: a float representing the discount factor. When an environment terminates, its next timestep (if any) begins a fresh episode. Discounted returns should be computed as the sum of gamma discounted rewards within each episode segment. Returns: A dictionary with the following keys: 'episode returns': a list of floats (rounded to 4 decimals), one per completed episode, in the order episodes finished (iterating timesteps first, then environments within each timestep). 'episode lengths': a list of integers corresponding to the length of each completed episode. 'partial returns': a list of N floats (rounded to 4 decimals), the discounted return accumulated so far for each environment's ongoing (incomplete) episode. 'partial lengths': a list of N integers, the number of steps in each environment's current incomplete episode. 'mean return': float (rounded to 4 decimals), the mean of completed episode returns, or 0.0 if no episodes completed. 'total episodes': integer count of completed episodes.