PettingZoo to Gym Environment Adapter

Planning, Dynamics & Decision Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: PettingZoo to Gym Environment Adapter, Gymnasium API Compliance, PettingZoo AEC Pattern, Observation Space Mapping, Action Masking, Reward Normalization, Reinforcement Learning, Software Engineering, Control Theory, Distributed Systems, Object-Oriented Programming, API Standardization, Multi-Agent Systems, State Space Representation, Interface Design Patterns, Asynchronous Execution.

In multi agent reinforcement learning, environments often expose a turn based API where agents act sequentially (one at a time), with properties like agent selection, terminations, truncations, etc. Standard single agent RL algorithms, however, expect a simpler Gym style API where a single step(actions) call handles an entire round and returns aggregated results. Implement a PettingZooToGymAdapter class that wraps a turn based multi agent environment and exposes it as a Gym style interface where one macro step processes all agents. The wrapped environment (passed to init ) has the following API: reset(): Initializes the environment and sets internal state (no return value) agents: A list of active agent name strings agent selection: The name of the agent whose turn it is to act observe(agent): Returns the observation for a given agent step(action): Applies the action for the current agent selection and advances to the next agent rewards: A dict mapping each agent name to its reward from the last step terminations: A dict mapping each agent name to a boolean truncations: A dict mapping each agent name to a boolean Your adapter class must implement: 1. init (self, env): Store the environment and initialize any needed state. 2. reset(self): Call the environment's reset, record the agent list, and return a dictionary mapping each agent name to its observation. 3. step(self, actions): Accept a dictionary mapping agent names to actions. Iterate through agents in order, skipping any that are already terminated or truncated, and call the environment's step with each agent's action. After all agents have acted, collect observations and done flags. Return a tuple of: observations: dict mapping agent names to their new observations total reward: float sum of all individual agent rewards from this macro step (rounded to 4 decimal places) all done: boolean that is True only when every agent is terminated or truncated info: dict with key 'agent rewards' mapping to the dict of individual rewards, and key 'agent dones' mapping to the dict of individual done booleans 4. get agents(self): Return the list of agent names recorded during reset().