Variable Population Multi-Agent Padding

Vision-Language & Cross-Modal Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: Variable Population Multi-Agent Padding, Zero-Padding, Masked Self-Attention, Permutation Invariance, Batch Normalization, Embedding Alignment, Multi-Agent Systems, Deep Learning Architectures, Sequence Modeling, Computational Geometry, Distributed Optimization, Dynamic Input Handling, Attention Mechanisms, Tensor Manipulation, Graph Neural Networks, Representation Learning.

In multi agent reinforcement learning, the number of active agents can change across timesteps. Some agents may spawn or terminate during an episode, leading to variable length data at each step. To enable efficient batch processing of such data, we need to pad all timesteps to a uniform shape and produce validity masks indicating which agent slots contain real data. Implement a function that takes a list of timestep records, where each record contains observations, actions, and rewards for a variable number of agents, and produces padded arrays suitable for batch computation. Your function should: Accept a list of timestep dictionaries, each containing 'observations' (list of agent observation vectors), 'actions' (list of integer actions), and 'rewards' (list of float rewards). The number of agents can differ across timesteps. Determine the maximum number of agents across all timesteps and the observation dimensionality from the data. Produce padded numpy arrays with consistent shapes across all timesteps. Observation padding should use zeros, action padding should use 1 (to distinguish from valid action indices), and reward padding should use 0.0. Produce a boolean mask array indicating which agent slots are valid (contain real data) at each timestep. Return a dictionary with keys 'observations', 'actions', 'rewards', 'mask', and 'agent counts'. The output shapes should be: 'observations': (num timesteps, max agents, obs dim) 'actions': (num timesteps, max agents) with integer dtype 'rewards': (num timesteps, max agents) 'mask': (num timesteps, max agents) with boolean dtype 'agent counts': (num timesteps,) with integer dtype Handle edge cases where some timesteps may have zero active agents.