Cold Start Latency Budget Breakdown

Data Pipelines, Monitoring & Reliability DS practice problem on Onlearn.

Difficulty: medium.

Topics: Cold Start Latency Budget Breakdown, Provisioning Overhead, Container Image Pull Latency, Runtime Initialization Time, P99 Tail Latency Analysis, Lazy Loading Dependencies, Distributed Systems Architecture, Cloud Infrastructure & Orchestration, Observability & Telemetry, Performance Engineering, Machine Learning Operations (MLOps), Serverless Function Lifecycle, Container Orchestration Scheduling, Distributed Tracing & Profiling, Resource Provisioning & Auto-scaling, Model Serving & Inference Optimization.

When deploying ML models in serverless or autoscaling environments, cold start latency is the time it takes from receiving the first request to serving the first prediction after a new instance spins up. Understanding and budgeting for each phase of cold start is critical for meeting service level agreements (SLAs). Write a function cold start breakdown(config) that takes a dictionary describing a model serving deployment and returns a dictionary containing the latency contribution of each cold start phase, the total cold start latency, whether the SLA is met, the percentage of the SLA budget consumed, and the name of the bottleneck phase (the phase with the highest latency). The config dictionary has these keys: model size gb (float): Size of the model weights in gigabytes storage bandwidth gbps (float): Read bandwidth from storage in GB/s runtime init ms (float): Time for container/runtime initialization in milliseconds compilation ms per gb (float): Model deserialization and compilation time per GB of weights, in milliseconds warmup batches (int): Number of warmup inference batches to run before serving warmup batch ms (float): Time per warmup inference batch in milliseconds health check ms (float): Time for the readiness health check in milliseconds sla budget ms (float): Maximum allowed cold start latency in milliseconds The five cold start phases are: 1. Runtime Initialization 2. Weight Loading (transferring model from storage to memory) 3. Compilation (deserializing/compiling the model) 4. Warmup (running warmup inference batches) 5. Health Check Return a dictionary with: runtime init ms: latency of phase 1 (float, rounded to 2 decimal places) weight load ms: latency of phase 2 (float, rounded to 2 decimal places) compilation ms: latency of phase 3 (float, rounded to 2 decimal places) warmup ms: latency of phase 4 (float, rounded to 2 decimal places) health check ms: latency of phase 5 (float, rounded to 2 decimal places) total ms: sum of all phases (float, rounded to 2 decimal places) meets sla: boolean, True if total ms is less than or equal to sla budget ms budget used pct: percentage of SLA budget consumed (float, rounded to 2 decimal places) bottleneck: string key name of the phase with the highest latency