Roofline Model Analysis for GPU Operations

Infrastructure, Parallelism & Hardware Efficiency DS practice problem on Onlearn.

Difficulty: medium.

Topics: Roofline Model Analysis for GPU Operations, Floating Point Operations Per Second, Operational Intensity, Global Memory Latency, Register File Pressure, Peak Theoretical Performance, Computer Architecture, Parallel Computing, Performance Engineering, Memory Hierarchy, Computational Complexity, Instruction Level Parallelism, Memory Bandwidth Analysis, Arithmetic Intensity, Cache Coherency Protocols, Throughput Optimization.

Implement a function that performs Roofline Model analysis for a set of GPU operations given hardware specifications. The Roofline Model is a visual performance model used to determine whether a computational kernel is compute bound or memory bound on a given hardware platform. It provides an upper bound on attainable performance based on the interplay between peak computational throughput and peak memory bandwidth. Your function should accept: peak gflops: Peak computational throughput of the GPU in GFLOPS (giga floating point operations per second) peak bandwidth gbs: Peak memory bandwidth in GB/s (gigabytes per second) operations: A list of dictionaries, each with keys 'name' (string), 'flops' (total floating point operations as float), and 'bytes' (total bytes transferred as float) Your function should return a dictionary containing: 'ridge point': The operational intensity (FLOP/byte) at which the hardware transitions from memory bound to compute bound behavior 'operations': A list of dictionaries, one per input operation, each containing: 'name': The operation name 'operational intensity': The ratio of FLOPs to bytes for this operation (FLOP/byte) 'attainable gflops': The maximum achievable performance in GFLOPS for this operation on the given hardware 'bottleneck': Either 'compute bound' or 'memory bound', indicating which resource limits the operation. An operation whose operational intensity is at or above the ridge point is considered compute bound. 'efficiency': The percentage of peak compute that this operation can achieve (0 to 100) Use only Python standard library and/or NumPy.