Request Batching for Inference
Data Pipelines, Monitoring & Reliability DS practice problem on Onlearn.
Difficulty: medium.
Topics: Request Batching for Inference, Wait-Time Thresholding, Maximum Batch Size, Buffer Overflow Management, Asynchronous Request Handling, Throughput Saturation, Distributed Systems, Performance Engineering, Inference Optimization, Queueing Theory, System Reliability, Latency-Throughput Tradeoffs, Dynamic Request Aggregation, Concurrency Control, Resource Scheduling, Load Balancing Strategies.
Implement a request batching function for ML model inference. In production ML systems, batching multiple inference requests together improves throughput by leveraging parallel processing capabilities of hardware accelerators. Your function should group incoming requests into batches based on two constraints: 1. Maximum batch size : A batch should not exceed this number of requests 2. Maximum wait time : The time elapsed since the first request in the current batch should not exceed this threshold When either constraint is violated by an incoming request, the current batch should be finalized and processed, and a new batch should start with the incoming request. Function Inputs: requests: A list of dictionaries, each containing: 'id': Integer identifier for the request 'timestamp': Float representing arrival time in seconds 'features': List of float values representing input features max batch size: Integer, maximum number of requests per batch max wait time: Float, maximum seconds to wait before processing a batch Function Output: A list of tuples, where each tuple represents a processed batch containing: List of request IDs in the batch List of feature lists (batched features) Processing timestamp (the timestamp of the last request in the batch, rounded to 4 decimals) Note: Requests should be processed in order of their timestamps, regardless of their order in the input list.