Job-Shop Scheduling with TD Learning

Planning, Dynamics & Decision Systems DS practice problem on Onlearn.

Difficulty: hard.

Topics: Job-Shop Scheduling with TD Learning, Q-Learning, Disjunctive Constraints, Bellman Equation, State-Action Value Function, Makespan Minimization, Reinforcement Learning, Operations Research, Stochastic Processes, Control Theory, Computational Complexity, Temporal Difference Learning, Combinatorial Optimization, Markov Decision Processes, Dynamic Programming, Heuristic Search.

Implement a SARSA (on policy TD(0)) learning agent for the job shop scheduling problem. In job shop scheduling, there are multiple jobs, each consisting of an ordered sequence of operations. Each operation must be processed on a specific machine for a specific duration. The goal is to find an ordering of operations that minimizes the makespan (the time at which all jobs are completed). Your function should formulate this as a reinforcement learning problem: State : A tuple representing the number of completed operations for each job (the progress of each job through its operation sequence). Actions : At each state, the agent can choose any job that still has remaining operations. The chosen job's next operation is scheduled as early as possible on its required machine. Scheduling Rule : When scheduling an operation (machine, duration), the start time is the maximum of (a) when the required machine becomes available and (b) when the job's previous operation finished. The operation occupies the machine from start time to start time + duration. Reward : 0 for non terminal transitions; negative makespan ( makespan) at the terminal state (when all operations are complete). The agent uses SARSA updates with epsilon greedy action selection over a Q table indexed by (state tuple, action int) pairs. Q values default to 0.0 for unseen state action pairs. When breaking ties among actions with equal Q values during greedy selection, choose the action with the lowest index. Parameters: jobs: list of lists, where jobs[i] is a list of (machine id, processing time) tuples for job i n machines: total number of machines n episodes: number of training episodes gamma: discount factor alpha: learning rate epsilon: exploration rate for epsilon greedy seed: random seed for reproducibility (use np.random.RandomState(seed)) Return a dictionary with: 'best makespan': the minimum makespan found across all episodes (rounded to 4 decimal places) 'best schedule': list of job indices representing the scheduling order that achieved the best makespan 'q values': dictionary mapping (state tuple, action int) to the learned Q value (rounded to 4 decimal places)