Access-Control Queuing with R-Learning
Planning, Dynamics & Decision Systems DS practice problem on Onlearn.
Difficulty: medium.
Topics: Access-Control Queuing with R-Learning, Relative Value Function, Gain Bias Decomposition, Bellman Optimality Equation, State-Space Discretization, Policy Iteration, Reinforcement Learning, Queueing Theory, Stochastic Control, Operations Research, Robotic Systems Engineering, Average-Reward Markov Decision Processes, Dynamic Programming, Admission Control Policies, Temporal Difference Learning, System Throughput Analysis.
Implement a solution to the access control queuing problem, a classic average reward reinforcement learning task. The Environment: There are n servers servers that can each serve one customer at a time. At each time step, a new customer arrives with a priority drawn uniformly at random from a list of possible priorities. The agent must decide to accept or reject the customer. If accepted and at least one server is free, the agent receives a reward equal to the customer's priority, and one free server becomes busy. If accepted but no servers are free, or if rejected, the reward is 0. At the end of each step, each busy server independently becomes free with probability p free. The Learning Algorithm: Since this is a continuing (non episodic) task, use R learning , a tabular off policy control method for the average reward setting. The state is the tuple (customer priority, n free servers). Actions are 'accept' and 'reject'. Maintain Q values Q[(priority, n free, action)] initialized to 0. Maintain an average reward estimate rho initialized to 0. Use epsilon greedy action selection. When Q values are tied, prefer 'accept'. The function should return a tuple (policy, avg reward) where: policy is a dict mapping (priority, n free servers) to the greedy action ('accept' or 'reject') avg reward is the final average reward estimate rho, rounded to 4 decimal places