Acrobot Swing-Up Problem
Planning, Dynamics & Decision Systems DS practice problem on Onlearn.
Difficulty: hard.
Topics: Acrobot Swing-Up Problem, Underactuated Systems, Bellman Equation, Generalized Coordinates, Radial Basis Functions, Reward Shaping, Reinforcement Learning, Classical Control Theory, Classical Mechanics, Numerical Optimization, Dynamic Programming, Policy Gradient Methods, Lagrangian Dynamics, State-Space Representation, Function Approximation, Trajectory Optimization.
Implement a simulator for the Acrobot, a classic underactuated control problem in reinforcement learning. The Acrobot is a two link planar robot resembling a gymnast on a high bar. The first link is attached to a fixed pivot, and the actuator applies torque only at the joint between the two links. The state is represented as [theta1, theta2, dtheta1, dtheta2] where theta1 is the angle of the first link from the downward vertical, theta2 is the angle of the second link relative to the first link, and dtheta1/dtheta2 are their angular velocities. Your function should: 1. Compute the equations of motion for the two link pendulum system using the provided physical parameters 2. Integrate the dynamics forward using the 4th order Runge Kutta (RK4) method 3. After each step, wrap angles to [ pi, pi) and clip angular velocities to their maximum bounds 4. Check the terminal condition: the tip of the second link rises above the height of the pivot point (i.e., cos(theta1) cos(theta1 + theta2) 1.0) 5. Each step incurs a reward of 1.0 Physical parameters (fixed): Link lengths: l1 = l2 = 1.0 Link masses: m1 = m2 = 1.0 Center of mass positions: lc1 = lc2 = 0.5 Moments of inertia: I1 = I2 = 1.0 Gravity: g = 9.8 Max angular velocities: max vel1 = 4 pi, max vel2 = 9 pi Available torques for actions: { 1: 1.0, 0: 0.0, 1: 1.0} Args: initial state: list of 4 floats [theta1, theta2, dtheta1, dtheta2] actions: list of integers (each 1, 0, or 1) dt: float, time step for integration (default 0.2) Returns: Tuple of (states, total reward, done step) where: states: list of state lists rounded to 4 decimal places (includes initial state) total reward: float, sum of rewards done step: int, 1 indexed step at which terminal condition was first met, or 1 if not achieved