SLA Compliance Metrics for Model Service

Data Pipelines, Monitoring & Reliability DS practice problem on Onlearn.

Difficulty: easy.

Topics: SLA Compliance Metrics for Model Service, P99 Latency Thresholds, Burn Rate Calculation, Request-per-second (RPS) Saturation, Mean Time Between Failures (MTBF), Z-score Thresholding, Service Level Management, System Observability, Distributed Systems Architecture, Statistical Process Control, Incident Response Engineering, Latency Percentile Analysis, Error Budgeting, Throughput Capacity Planning, Availability Modeling, Anomaly Detection Pipelines.

In production ML systems, Service Level Agreement (SLA) monitoring is crucial for ensuring your model serving endpoints meet performance guarantees. Given a list of request results from a model serving endpoint, compute key SLA compliance metrics. Each request result is a dictionary with: 'latency ms': Response latency in milliseconds (float) 'status': Either 'success', 'error', or 'timeout' Write a function calculate sla metrics(requests, latency sla ms) that computes: 1. Latency SLA Compliance : Percentage of successful requests that completed within the latency threshold 2. Error Rate : Percentage of all requests that resulted in an error or timeout 3. Overall SLA Compliance : Percentage of all requests that both succeeded AND met the latency threshold The function should return a dictionary with these three metrics. If the input list is empty, return an empty dictionary. If there are no successful requests, latency sla compliance should be 0.0. All returned values should be percentages (0 100) rounded to 2 decimal places.