Cache-Aware Request Routing Across Replicas

Data Pipelines, Monitoring & Reliability DS practice problem on Onlearn.

Difficulty: medium.

Topics: Cache-Aware Request Routing Across Replicas, Rendezvous Hashing, Least Recently Used (LRU), P99 Tail Latency, Circuit Breaker Pattern, Cache Hit Ratio, Distributed Systems Architecture, Network Traffic Engineering, Cache Management Strategies, System Observability and Telemetry, Load Balancing Algorithms, Consistent Hashing, Locality-Sensitive Routing, Cache Eviction Policies, Request Latency Profiling, Replica Health Monitoring.

In production LLM serving systems, multiple model replicas serve inference requests. Each replica maintains a KV cache of previously computed prefixes. When a new request arrives, routing it to a replica that already has its prefix cached can avoid expensive recomputation, dramatically reducing time to first token (TTFT). Implement a cache aware request router that assigns incoming requests to model replicas. For each request, the router should: 1. Compute the longest matching prefix between the request's token sequence and each cached prefix on every replica 2. Calculate a composite routing score per replica that balances cache affinity against current load 3. Assign the request to the highest scoring replica and update that replica's load 4. Skip replicas that have reached their maximum capacity (unless all replicas are full, in which case pick the one with the lowest load ratio and still compute cache affinity) The routing score for a replica is: score = alpha (longest match length / request length) beta (active requests / max capacity) Process requests sequentially, updating replica load after each assignment. Return a dictionary containing: 'assignments': list of replica indices (one per request) 'cache hit rates': list of floats (match length / request length for each assignment, rounded to 2 decimals) 'avg cache hit rate': float (mean of cache hit rates, rounded to 2 decimals) 'load distribution': list of ints (final active requests per replica after all routing)