Designing 4D Parallelism for Large Model Training
Infrastructure, Parallelism & Hardware Efficiency DS practice problem on Onlearn.
Difficulty: hard.
Topics: Designing 4D Parallelism for Large Model Training, ZeRO Redundancy Optimizer, Pipeline Micro-batching, All-Reduce Collective, Activation Checkpointing, Communication Overlap, Distributed Systems, High-Performance Computing, Deep Learning Frameworks, Hardware Architecture, Network Topology, Model Parallelism, Data Parallelism, Pipeline Parallelism, Tensor Parallelism, Memory Management.
Implement a function that analyzes a 4D parallelism configuration for training large transformer models across many GPUs. 4D parallelism combines four orthogonal strategies: Data Parallelism (DP), Tensor Parallelism (TP), Pipeline Parallelism (PP), and Sequence Parallelism (SP). Your function should accept three dictionaries: model config with keys: num layers: number of transformer layers hidden size: hidden dimension H vocab size: vocabulary size V seq len: sequence length S micro batch size: micro batch size B num micro batches: number of micro batches in a training step parallel config with keys: dp: data parallelism degree tp: tensor parallelism degree pp: pipeline parallelism degree sp: sequence parallelism degree hardware config with keys: num gpus: total number of GPUs memory per gpu gb: memory per GPU in GB Return a dictionary with the following keys: valid (bool): whether dp tp pp sp equals num gpus total params (int): total model parameters, computed as 2 V H + L 12 H H param memory per gpu gb (float, rounded to 4 decimals): model state memory per GPU in GB, using 16 bytes per parameter (mixed precision with Adam optimizer states), split across TP and PP dimensions activation memory per gpu gb (float, rounded to 4 decimals): activation memory per GPU in GB using an activation footprint of 10 S B H 2 bytes per layer, divided across PP stages and split by TP and SP total memory per gpu gb (float, rounded to 4 decimals): sum of param and activation memory fits in memory (bool): whether total memory fits in GPU memory dp comm gb (float, rounded to 4 decimals): DP gradient all reduce volume using ring all reduce formula on fp16 gradients tp comm gb (float, rounded to 4 decimals): TP all reduce volume (4 all reduces per layer for forward+backward) using ring all reduce on fp16 activation tensors of size S B H pp comm gb (float, rounded to 4 decimals): PP point to point volume for forward and backward activation transfers sp comm gb (float, rounded to 4 decimals): SP all gather/reduce scatter volume (2 operations per layer) bubble ratio (float, rounded to 4 decimals): pipeline bubble fraction