Multi-GPU Communication Overhead (NVLink vs InfiniBand)
Infrastructure, Parallelism & Hardware Efficiency DS practice problem on Onlearn.
Difficulty: medium.
Topics: Multi-GPU Communication Overhead (NVLink vs InfiniBand), NVLink Bandwidth Scaling, Remote Direct Memory Access (RDMA), PCIe Bus Contention, All-Reduce Latency, GPUDirect Peer-to-Peer, Distributed Systems, Computer Architecture, High-Performance Computing, Network Engineering, Parallel Computing, Interconnect Topologies, Memory Hierarchy Management, Collective Communication Primitives, Hardware Acceleration, System Bottleneck Analysis.
Implement a function multi gpu comm overhead that models and compares the communication overhead of collective operations across multiple GPUs when using two different interconnect technologies: NVLink and InfiniBand. In distributed deep learning, gradient synchronization and tensor communication across GPUs are often the bottleneck. The choice of interconnect technology significantly impacts training throughput. This function should model the time cost of common collective operations using standard algorithms. Function Inputs message size gb (float): Size of the data payload in gigabytes (e.g., gradient tensor size) num gpus (int): Number of GPUs participating in the collective (at least 2) nvlink config (dict): NVLink interconnect specs with keys 'bandwidth gbps' (float, in GB/s) and 'latency us' (float, in microseconds) ib config (dict): InfiniBand interconnect specs with same keys as above operation (str): The collective operation, one of 'all reduce', 'all gather', or 'broadcast' Communication Models The function should model three collective operations: all reduce : Uses a ring based algorithm. The number of communication steps is 2 (N 1) where N is the number of GPUs. In each step, each GPU transfers M / N bytes over its link, where M is the message size. all gather : Uses a ring based algorithm. The number of steps is N 1, with M / N data per step. broadcast : Uses a binary tree algorithm. The number of steps is ceil(log2(N)), with the full message M transferred per step. For each interconnect, compute: Latency contribution (ms) = num steps per step latency Bandwidth contribution (ms) = num steps data per step / bandwidth (converted to ms) Total time = latency + bandwidth contributions Function Output Return a dictionary with: 'nvlink time ms': Total NVLink communication time in ms (rounded to 4 decimal places) 'ib time ms': Total InfiniBand communication time in ms (rounded to 4 decimal places) 'speedup': Ratio of InfiniBand time to NVLink time (rounded to 4 decimal places) 'nvlink bottleneck': String 'bandwidth' or 'latency' indicating which component dominates (use = for bandwidth) 'ib bottleneck': Same as above for InfiniBand