Shared Memory IPC for RL Environments
Infrastructure, Parallelism & Hardware Efficiency DS practice problem on Onlearn.
Difficulty: medium.
Topics: Shared Memory IPC for RL Environments, POSIX Shared Memory, Zero-Copy Data Transfer, Atomic Synchronization Primitives, Memory-Mapped Files, Ring Buffer Implementation, Distributed Systems, Operating Systems, Reinforcement Learning, High-Performance Computing, Software Engineering, Inter-Process Communication, Memory Management, Parallel Environment Vectorization, Concurrency Control, System Resource Orchestration.
Implement a function that simulates shared memory inter process communication (IPC) for parallel reinforcement learning environments. In high throughput RL systems, transferring observations, rewards, and termination signals between worker processes and a learner process through serialization (e.g., pickle via queues) creates significant overhead. Shared memory eliminates this by allowing both the writer and reader to access the same physical memory region, enabling zero copy data exchange. Your task is to implement shared memory rl pipeline(num envs, obs dim, step data) that: 1. Allocates three separate shared memory blocks to hold environment data for all workers simultaneously: Observations: a 2D buffer of shape (num envs, obs dim) using float64 Rewards: a 1D buffer of shape (num envs,) using float64 Dones: a 1D buffer of shape (num envs,) using int8 2. Creates numpy array views over these shared memory blocks and initializes them to zero. 3. Writes the provided step data into the appropriate slots (simulating workers writing results after environment steps). 4. Opens the same shared memory blocks by their system assigned names (simulating a separate process attaching to existing shared memory), creates fresh numpy array views, and copies the data out. 5. Computes aggregate statistics from the copied data: mean obs: mean observation across all environments (per dimension) mean reward: mean reward across all environments num done: count of terminated environments active mean reward: mean reward only for non terminated environments (0.0 if all are done) total shared bytes: the number of bytes logically allocated (obs bytes + reward bytes + done bytes) 6. Properly closes and unlinks all shared memory resources, even if an error occurs. Parameters: num envs: number of parallel environment slots obs dim: dimension of observation vectors step data: list of tuples (env id, obs list, reward float, done bool). An env id may appear multiple times; later writes overwrite earlier ones. Return a dictionary with keys: observations, rewards, dones, mean obs, mean reward, num done, active mean reward, total shared bytes. All float values should be rounded to 4 decimal places. Observations and rewards are returned as nested lists; dones as a list of integers (0 or 1).