Get started
Start with Mathematical Foundations
Vectors & Geometry
Back to modulesModule 12 · 0/111
Module 12
Reinforcement Learning
0/11159115
Submodule 01Foundations & Tabular RL0/24
- 1Exponential Weighted Average of RewardsMed.
- 2Incremental Mean for Online Reward EstimationEasy
- 3Q-Learning Algorithm for MDPsconceptMed.
- 4Gridworld Policy EvaluationMed.
- 5The Bellman Equation for Value IterationMed.
- 6Policy Gradient with REINFORCEHard
- 7Build a Multi-Armed Bandit TestbedMed.
- 8Epsilon-Greedy Action Selection for n-Armed BanditEasy
- 9Upper Confidence Bound (UCB) Action SelectionMed.
- 10Gibbs Softmax Action SelectionMed.
- 11Discounted ReturnEasy
- 12The Discounted Return for a Given TrajectoryEasy
- 13Temporal Difference ErrorMed.
- 14The SARSA Algorithm on policyconceptMed.
- 15Double Q-Learning AlgorithmconceptMed.
- 16Every-Visit Monte Carlo PredictionMed.
- 17First-Visit Monte Carlo PredictionconceptMed.
- 18First-Visit Monte Carlo Control with Exploring StartsHard
- 19Cliff Walking: Sarsa vs Q-LearningMed.
- 20Windy Gridworld with SarsaMed.
- 21Random Walk: TD vs Monte CarloMed.
- 22N-Step TD PredictionMed.
- 23N-Step Sarsa AlgorithmconceptMed.
- 24Sarsa(lambda) Algorithm with Eligibility TracesconceptMed.
Submodule 02Advanced & Deep RL0/24
- 1Actor-Critic AlgorithmconceptMed.
- 2Advantage Actor-Critic (A2C) Batch Update from Parallel EnvironmentsHard
- 3Asynchronous Advantage Actor-Critic (A3C)Hard
- 4PPO Clipped Surrogate Loss with Clip DiagnosticsHard
- 5Experience Replay ImplementationMed.
- 6Prioritized Experience ReplayconceptMed.
- 7Rainbow DQN ImplementationHard
- 8Dueling Network ArchitectureHard
- 9Distributed On-Policy Reinforcement LearningHard
- 10Asynchronous PPO Training PipelineHard
- 11Environment Vectorization Auto-ConfigurationHard
- 12Multi-Environment Worker Process ManagementHard
- 13Parallel Environment Simulation with MultiprocessingHard
- 14GAE Advantages with Episode BoundariesconceptMed.
- 15Distributional RL: Categorical Projection (C51)conceptHard
- 16RL Training GPU Utilization TrackerconceptEasy
- 17Deep Q-Network ImplementationconceptMed.
- 18Deterministic Policy GradientconceptMed.
- 19Gaussian Policy for Continuous ControlconceptMed.
- 20Natural Policy GradientconceptMed.
- 21REINFORCE with Value BaselineconceptMed.
- 22Trust Region Policy Optimization (TRPO)conceptMed.
- 23Budget-Constrained RL LossconceptMed.
- 24GSPO: Group Sequence Policy OptimizationconceptHard
Submodule 03Advanced RL Theory, Planning & TD Learning0/43
- 1Asynchronous Dynamic Programming for Value IterationconceptMed.
- 2Compare Update Strategies in Policy EvaluationconceptMed.
- 3RL Backup Operations for Value FunctionsconceptMed.
- 4Continuous-Time DiscountingconceptMed.
- 5Control Variates in Off-Policy MethodsconceptMed.
- 6Certainty-Equivalence in TD LearningconceptMed.
- 7Differential Sarsa AlgorithmconceptMed.
- 8Dyna-Q: Model-Based RL with PlanningconceptMed.
- 9Dyna-Q+ with Exploration BonusconceptMed.
- 10Learning a Tabular Environment Model from ExperienceconceptMed.
- 11Planning with Simulated ExperienceconceptMed.
- 12Prioritized Sweeping AlgorithmconceptMed.
- 13Priority Queue in PlanningconceptMed.
- 14Fitted Value IterationconceptMed.
- 15Generalized Policy Iteration (GPI) SimulationconceptMed.
- 16Greedy Policy ImprovementconceptMed.
- 17Expected Value in a Markov Decision ProcessconceptMed.
- 18Bellman Expectation Equation for State-Value FunctionconceptMed.
- 19Bellman Expectation Equation for Action-Value FunctionconceptMed.
- 20Expected SARSA Algorithm for Policy Evaluation and ControlconceptMed.
- 21Forward vs Backward TD UpdatesconceptMed.
- 22Full TD(0) Prediction for Value EstimationconceptMed.
- 23Gradient Monte Carlo AlgorithmconceptMed.
- 24Lambda-Return ComputationconceptMed.
- 25Least-Squares Temporal Difference (LSTD) for Policy EvaluationconceptMed.
- 26Linear Sarsa AlgorithmconceptMed.
- 27Linear Value Function Approximation with Semi-Gradient TD(0)conceptMed.
- 28Off-line vs On-line TD(0) PredictionconceptMed.
- 29Off-Policy Monte Carlo Control with Weighted Importance SamplingconceptMed.
- 30Off-Policy Monte Carlo Prediction with Importance SamplingconceptMed.
- 31Off-Policy TD(lambda) with Importance-Weighted Eligibility TracesconceptMed.
- 32Off-Policy n-Step Sarsa with Importance SamplingconceptMed.
- 33Off-Policy n-Step TD Prediction with Importance SamplingconceptMed.
- 34Per-Decision Importance Sampling for Off-Policy EvaluationconceptMed.
- 35Q(lambda) with Eligibility TracesconceptMed.
- 36Q(σ) Unified AlgorithmconceptMed.
- 37Residual Gradient Algorithm for Value Function ApproximationconceptMed.
- 38Retrace(λ) ImplementationconceptMed.
- 39Semi-Gradient TD(lambda) with Eligibility Traces and Linear Function ApproximationconceptMed.
- 40TD Error Backpropagation in NetworksconceptMed.
- 41TD(λ) Forward View for Value PredictionconceptMed.
- 42True Online TD(λ) with Linear Function ApproximationconceptMed.
- 43Variance Reduction in TD Learning via Truncated Importance SamplingconceptMed.
Submodule 04RL Environments, Games & Applications0/20
- 1Markov Decision Process SimulatorconceptMed.
- 2Blackjack with Monte Carlo PredictionconceptMed.
- 3Gambler's Problem: Value IterationconceptHard
- 4Mountain Car with Function ApproximationconceptMed.
- 5Learn Tic-Tac-Toe with TDconceptMed.
- 6Minimax Algorithm for Tic-Tac-ToeconceptMed.
- 7Self-Play Training Loop for Two-Player GamesconceptMed.
- 8Two-Ply Search with Value FunctionconceptMed.
- 9Minimax Backed-Up ScoresconceptMed.
- 10Backgammon Feature EngineeringconceptMed.
- 11Backgammon with TD LearningconceptMed.
- 12TD-Gammon Position Evaluation NetworkconceptMed.
- 13Blocking Maze Environment for Testing Dyna-Q+conceptMed.
- 14Blocking Maze with Model UpdatesconceptMed.
- 15Shortcut Maze and ExplorationconceptMed.
- 16RL Algorithm Sanity Check EnvironmentsconceptMed.
- 17RL Backup Diagram GeneratorconceptMed.
- 18RL Environment Wrapper ProfilingconceptMed.
- 19RL Training Experiment TrackingconceptMed.
- 20Unified Task Handler for Multi-Task RL AgentconceptMed.