EnvPool-style Asynchronous Environment Pooling

Planning, Dynamics & Decision Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: EnvPool-style Asynchronous Environment Pooling, Zero-copy Data Transfer, CPU Affinity Binding, Global Interpreter Lock Bypass, Ring Buffer Synchronization, Context Switching Overhead, Reinforcement Learning, Parallel Computing, Control Theory, Systems Engineering, Software Architecture, Vectorized Environments, Asynchronous Multiprocessing, Shared Memory Management, Task Scheduling, Inter-Process Communication.

Implement an asynchronous environment pool simulator for reinforcement learning that manages multiple environments with heterogeneous step durations. In traditional vectorized environments, all environments are stepped synchronously, meaning the system must wait for the slowest environment each round. An asynchronous pool instead dispatches actions to ready environments immediately, collects results as they complete, and keeps all environments as busy as possible to maximize throughput. Your function simulates an async environment pool over a fixed number of discrete time ticks. Each environment has a fixed step duration (in ticks) and returns a fixed reward upon completing a step. Inputs: num envs: int, number of environments in the pool step durations: list of int, how many ticks each environment takes to complete one step (length num envs) rewards per step: list of float, the reward returned by each environment upon completing a step (length num envs) total ticks: int, total number of simulation ticks to run batch size: int, maximum number of environments that can be dispatched (sent new actions) per tick At the start, all environments are in a ready state. Each tick proceeds in three phases: 1. Advance all busy environments by one tick. Any environment whose remaining time reaches zero completes its step, yielding its reward, and becomes ready. 2. Among all currently ready environments, dispatch new actions to up to batch size of them (selecting by ascending environment index). Dispatched environments become busy with their full step duration as remaining time. 3. Record the number of currently busy environments. Return a tuple of four values: total steps: int, total number of environment steps completed total reward: float, sum of all rewards collected, rounded to 4 decimal places avg envs active: float, mean number of busy environments across all ticks, rounded to 4 decimal places throughput: float, total steps divided by total ticks, rounded to 4 decimal places If total ticks is 0, return (0, 0.0, 0.0, 0.0).