Request Batching for Inference

LLM Inference & Memory Systems DS practice problem on Onlearn.

Difficulty: medium.

Topics: Request Batching for Inference, Continuous Batching, KV Cache Paging, Throughput Optimization, Latency Minimization, Request Queuing, Distributed Systems, Computer Architecture, Parallel Computing, Operating Systems, Software Engineering, Memory Management, Task Scheduling, Network Communication, Concurrency Control, Resource Allocation.

Implement a request batching function for ML model inference. In production ML systems, batching multiple inference requests together improves throughput by leveraging parallel processing capabilities of hardware accelerators. Your function should group incoming requests into batches based on two constraints: 1. Maximum batch size : A batch should not exceed this number of requests 2. Maximum wait time : The time elapsed since the first request in the current batch should not exceed this threshold When either constraint is violated by an incoming request, the current batch should be finalized and processed, and a new batch should start with the incoming request. Function Inputs: requests: A list of dictionaries, each containing: 'id': Integer identifier for the request 'timestamp': Float representing arrival time in seconds 'features': List of float values representing input features max batch size: Integer, maximum number of requests per batch max wait time: Float, maximum seconds to wait before processing a batch Function Output: A list of tuples, where each tuple represents a processed batch containing: List of request IDs in the batch List of feature lists (batched features) Processing timestamp (the timestamp of the last request in the batch, rounded to 4 decimals) Note: Requests should be processed in order of their timestamps, regardless of their order in the input list.