Skip to content

Blackcipher101/rl_speedrun

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RL Speedrun πŸƒβ€β™‚οΈπŸ’¨

"Why spend months learning RL when you can mass produce it in days?" - Me, probably sleep deprived

Welcome to my chaotic journey through Reinforcement Learning. This repo is basically me speedrunning RL concepts, writing everything from scratch, and pretending I know what I'm doing.

What's This?

A personal RL learning repo where I implement algorithms from first principles. No fancy libraries doing the heavy lifting - just raw NumPy energy and questionable life choices.

Repository Structure

πŸ“ rl_fundamentals/
β”œβ”€β”€ 01_mdp/                  <- MDPs: fancy way to say "states go brrr"
β”œβ”€β”€ 02_value_functions/      <- V(s) and Q(s,a) - the OG value bros
β”œβ”€β”€ 03_bellman_equations/    <- Bellman said: "it's recursive, deal with it"
β”œβ”€β”€ 04_dynamic_programming/  <- When you know everything about the world
β”œβ”€β”€ 05_env_applications/     <- DP in action
β”‚   β”œβ”€β”€ gridworld/           <- Baby's first MDP
β”‚   β”œβ”€β”€ frozenlake/          <- Slippery boi simulator
β”‚   └── taxi_v3/             <- Uber but worse
β”œβ”€β”€ 06_temporal_difference/  <- Learning from experience, one step at a time
β”‚   β”œβ”€β”€ q_learning.py        <- Off-policy TD control
β”‚   └── sarsa.py             <- On-policy TD control
β”œβ”€β”€ 07_td_applications/      <- TD algorithms in the wild
β”‚   β”œβ”€β”€ cliffwalking/        <- Q-Learning vs SARSA showdown
β”‚   └── cartpole/            <- Discretized Q-Learning
β”œβ”€β”€ 08_monte_carlo/          <- Wait for the episode to end, then learn
β”‚   └── monte_carlo.py       <- First-Visit & Every-Visit MC
β”œβ”€β”€ 09_policy_gradients/     <- Directly optimize the policy
β”‚   └── reinforce.py         <- The OG policy gradient
β”œβ”€β”€ 10_mc_pg_applications/   <- MC & PG in action
β”‚   β”œβ”€β”€ blackjack/           <- Classic MC territory
β”‚   └── cartpole_reinforce/  <- Neural network policy
β”œβ”€β”€ 11_unified_agent/        <- Modular RL agent framework
β”‚   β”œβ”€β”€ exploration_strategies.py  <- Ξ΅-greedy, Boltzmann, UCB
β”‚   └── unified_agent.py     <- Configurable Q-Learning/SARSA
β”œβ”€β”€ 12_benchmarking/         <- Systematic algorithm comparison
β”‚   └── benchmark.py         <- Multi-algorithm benchmarking
β”œβ”€β”€ 13_dqn_fundamentals/     <- Deep Q-Networks from scratch
β”‚   β”œβ”€β”€ replay_buffer.py     <- Experience replay
β”‚   β”œβ”€β”€ target_network.py    <- Stable learning targets
β”‚   └── dqn.py               <- Full DQN implementation
β”œβ”€β”€ 14_dqn_improvements/     <- DQN enhancements
β”‚   └── double_dqn.py        <- Fixing overestimation bias
β”œβ”€β”€ 15_dqn_applications/     <- DQN in the wild
β”‚   β”œβ”€β”€ cartpole_dqn/        <- CartPole with neural nets
β”‚   └── lunarlander_dqn/     <- Landing rockets with DQN
β”œβ”€β”€ 16_actor_critic/         <- Best of both worlds
β”‚   β”œβ”€β”€ advantage.py         <- GAE and advantage estimation
β”‚   β”œβ”€β”€ entropy.py           <- Exploration via entropy bonus
β”‚   └── a2c.py               <- Advantage Actor-Critic
β”œβ”€β”€ 17_actor_critic_applications/  <- A2C in action
β”‚   β”œβ”€β”€ cartpole_a2c/        <- A2C vs DQN vs REINFORCE
β”‚   └── lunarlander_a2c/     <- Landing rockets, actor-critic style
β”œβ”€β”€ 18_ppo/                  <- The algorithm that made RL practical
β”‚   └── ppo.py               <- PPO with clipping (discrete + continuous)
β”œβ”€β”€ 19_ppo_applications/     <- PPO in the wild
β”‚   β”œβ”€β”€ lunarlander_ppo/     <- Stable lunar landing
β”‚   └── bipedal_walker_ppo/  <- Teaching a robot to walk
β”œβ”€β”€ 20_trpo/                 <- PPO's predecessor (second-order optimization)
β”‚   └── trpo.py              <- TRPO: conjugate gradient + line search
β”œβ”€β”€ 21_ddpg/                 <- Off-policy continuous control
β”‚   └── ddpg.py              <- DDPG: deterministic policy + replay buffer
β”œβ”€β”€ 22_td3/                  <- Fixing DDPG's failure modes
β”‚   └── td3.py               <- TD3: twin critics, delayed updates, smoothing
β”œβ”€β”€ 23_offpolicy_applications/  <- Off-policy algorithms in the wild
β”‚   β”œβ”€β”€ pendulum_ddpg/       <- Swing up with deterministic policy
β”‚   β”œβ”€β”€ pendulum_td3/        <- Swing up, twin critic edition
β”‚   └── bipedal_walker_td3/  <- Teaching a robot to walk (off-policy)
β”œβ”€β”€ 24_mcts/                 <- Monte Carlo Tree Search from scratch
β”‚   └── mcts.py              <- UCB1 selection + random rollouts
β”œβ”€β”€ 25_alphazero/            <- Self-play with neural network MCTS
β”‚   β”œβ”€β”€ games.py             <- TicTacToe + Connect Four
β”‚   └── alphazero.py         <- PolicyValueNet + PUCT + self-play
β”œβ”€β”€ 26_muzero/               <- Planning without knowing the rules
β”‚   └── muzero.py            <- Learned dynamics + latent MCTS
└── 27_game_applications/    <- Game AI in the wild
    β”œβ”€β”€ tictactoe_mcts/      <- Pure MCTS achieves perfect play
    β”œβ”€β”€ tictactoe_alphazero/  <- AlphaZero learns TicTacToe
    β”œβ”€β”€ connect4_alphazero/   <- AlphaZero on Connect Four
    └── tictactoe_muzero/    <- MuZero: no rules needed

Week 1: Dynamic Programming

"When you have God mode enabled (full model knowledge)"

GridWorld - The Classic

Optimal policy for a 4x4 grid. Terminal states at corners. Agent just wants to go home.

FrozenLake - Slippery When Wet

That feeling when you try to go right but physics says "nah". 1/3 chance of actually going where you want.

Taxi-v3 - 500 States of Pain

Value function heatmap. Higher = closer to dropping off passengers and escaping this nightmare.


Week 2: Temporal Difference Learning

"Model-free vibes - learning from experience without knowing the rules"

CliffWalking - The Q-Learning vs SARSA Showdown

Q-Learning: "I'll walk the edge, YOLO" SARSA: "I'd rather live, thanks"

The classic demonstration of off-policy vs on-policy learning:

  • Q-Learning finds the risky optimal path (right along the cliff edge)
  • SARSA finds the safer path (stays away from the cliff because it knows it might slip)

CartPole - Discretization Station

Continuous state space? Just chop it into bins and pretend it's discrete.


Week 3: Monte Carlo & Policy Gradients

"Episode-based learning meets direct policy optimization"

Blackjack - Monte Carlo Territory

Learning to play 21 by sampling complete games. The house still wins, but less often.

Monte Carlo methods wait for the episode to end, then learn from actual returns:

  • First-Visit MC: Only count the first visit to each state
  • Every-Visit MC: Count all visits (lower variance)

CartPole REINFORCE - Neural Network Policy

Direct policy optimization: no value function needed, just gradients and vibes.

REINFORCE directly optimizes the policy using the policy gradient theorem: $$\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a|s) \cdot G_t]$$


Week 4: Unified Agents & Benchmarking

"Time to get organized and systematic"

Exploration Strategies

Implemented modular exploration strategies:

  • Ξ΅-greedy: Classic random exploration
  • Boltzmann/Softmax: Temperature-based action selection
  • UCB (Upper Confidence Bound): Optimism in the face of uncertainty

Benchmarking Framework

Systematic comparison of algorithms across environments with statistical rigor.


Week 5: Deep Q-Networks (DQN)

"When tabular methods hit their limits, neural networks enter the chat"

The DQN Revolution

Pure NumPy implementation of DQN with:

  • Experience Replay: Break correlation, reuse data
  • Target Networks: Stable learning targets
  • Double DQN: Fix overestimation bias
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   DQN Architecture                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   State β†’ [Hidden 64] β†’ [Hidden 64] β†’ Q-values             β”‚
β”‚                                                             β”‚
β”‚   Key Innovations:                                          β”‚
β”‚   1. Experience Replay Buffer                               β”‚
β”‚   2. Target Network (updated every C steps)                 β”‚
β”‚   3. Double DQN (decouple selection from evaluation)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

CartPole & LunarLander with DQN

Solving classic control problems with neural network function approximation.


Week 6: Actor-Critic Methods (A2C)

"Why choose between policy gradients and value functions when you can have both?"

Advantage Actor-Critic

Combines the best of both worlds:

  • Actor: Policy network Ο€(a|s) - what to do
  • Critic: Value network V(s) - how good is this state
                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚   Environment   β”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                        state s, reward r
                                 β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                       β”‚                       β”‚
         β–Ό                       β”‚                       β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  ACTOR   β”‚                  β”‚                β”‚  CRITIC  β”‚
   β”‚  Ο€(a|s)  │◄─── Advantage ────                β”‚   V(s)   β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    A = Q - V     β”‚                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β–Ό                       β”‚                       β–Ό
      action a               gradient                baseline

Key Components

  • GAE (Generalized Advantage Estimation): Tunable bias-variance tradeoff
  • Entropy Regularization: Prevent premature convergence
  • Shared Feature Layers: Parameter efficient actor-critic

Week 7: Proximal Policy Optimization (PPO)

"The algorithm that made deep RL actually practical"

The Clipped Surrogate Objective

PPO takes A2C and adds one powerful constraint: don't let the policy change too much in a single update.

$$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\left[\min\left(r_t(\theta) \hat{A}_t, ; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right]$$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   PPO: A2C with Guardrails                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚   Collect rollout β†’ Compute GAE advantages                   β”‚
β”‚                          ↓                                   β”‚
β”‚         β”Œβ”€β”€β”€ For K epochs (reuse data!) ───┐                β”‚
β”‚         β”‚  Shuffle into mini-batches        β”‚                β”‚
β”‚         β”‚  ratio = Ο€_new / Ο€_old            β”‚                β”‚
β”‚         β”‚  clip(ratio, 1-Ξ΅, 1+Ξ΅)           β”‚                β”‚
β”‚         β”‚  Take pessimistic (min) update    β”‚                β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚
β”‚                                                              β”‚
β”‚   Key insight: clipping prevents catastrophic policy updates β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

LunarLander & BipedalWalker with PPO

  • LunarLander: Discrete actions (4), PPO's stability shines on this harder control task
  • BipedalWalker: Continuous actions (4D Gaussian policy), teaching a robot to walk

Week 9: TRPO, DDPG & TD3

"Trust regions, deterministic policies, and twin critics"

TRPO β€” PPO's Predecessor

Uses a hard KL divergence constraint enforced via conjugate gradient + line search:

$$\max_\theta L(\theta) \quad \text{s.t.} \quad D_{KL}(\pi_{old} | \pi_{new}) \leq \delta$$

The natural gradient update: $\theta_{new} = \theta + \sqrt{2\delta / g^T F^{-1} g} \cdot F^{-1} g$

DDPG & TD3 β€” Off-Policy Continuous Control

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              DDPG β†’ TD3: Continuous Control Pipeline          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚   DDPG: DQN ideas + Actor-Critic for continuous actions      β”‚
β”‚     state β†’ [Actor ΞΌ(s)] β†’ action β†’ [Critic Q(s,a)] β†’ value β”‚
β”‚     + Replay buffer + Target networks + Exploration noise    β”‚
β”‚                                                              β”‚
β”‚   TD3 fixes DDPG's three problems:                           β”‚
β”‚     1. Twin critics: min(Q1, Q2) β†’ no overestimation        β”‚
β”‚     2. Delayed updates: actor every 2 critic steps           β”‚
β”‚     3. Target smoothing: noise on target actions             β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Environments

  • CartPole: TRPO with natural gradient (discrete)
  • Pendulum: DDPG and TD3 comparison (continuous, 1D action)
  • BipedalWalker: TD3 on hard 4D continuous control

Week 10: Game AI β€” MCTS, AlphaZero & MuZero

"From random rollouts to learning without rules"

The Evolution: MCTS β†’ AlphaZero β†’ MuZero

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Game AI: Three Generations                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                              β”‚
β”‚  MCTS (2006):  UCB1 selection + random rollouts              β”‚
β”‚       ↓        No learning, just search                      β”‚
β”‚  AlphaZero (2017): MCTS + neural network (policy + value)   β”‚
β”‚       ↓        Self-play learns from scratch                 β”‚
β”‚  MuZero (2020): MCTS + learned model (no game rules!)       β”‚
β”‚                Plans in a learned latent space                β”‚
β”‚                                                              β”‚
β”‚  Key insight: each generation replaces handcrafted           β”‚
β”‚  components with learned ones                                β”‚
β”‚                                                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Results

Algorithm Game Win Rate vs Random
MCTS (1000 sims) TicTacToe ~99%
AlphaZero (80 iters) TicTacToe ~97%
AlphaZero (100 iters) Connect Four ~98%
MuZero (80 iters) TicTacToe ~72%

MuZero achieves lower win rates than AlphaZero because it must also learn the game dynamics from scratch.


Quick Start

# Install the goods
pip install -r rl_fundamentals/requirements.txt

# === Week 1: Dynamic Programming ===
python rl_fundamentals/05_env_applications/gridworld/gridworld_dp.py
python rl_fundamentals/05_env_applications/frozenlake/solve_frozenlake.py
python rl_fundamentals/05_env_applications/taxi_v3/solve_taxi.py

# === Week 2: Temporal Difference ===
python rl_fundamentals/07_td_applications/cliffwalking/solve_cliffwalking.py
python rl_fundamentals/07_td_applications/cartpole/solve_cartpole.py

# === Week 3: Monte Carlo & Policy Gradients ===
python rl_fundamentals/10_mc_pg_applications/blackjack/solve_blackjack.py
python rl_fundamentals/10_mc_pg_applications/cartpole_reinforce/solve_cartpole_reinforce.py

# === Week 4: Unified Agent & Benchmarking ===
python rl_fundamentals/11_unified_agent/exploration_strategies.py
python rl_fundamentals/12_benchmarking/benchmark.py

# === Week 5: Deep Q-Networks ===
python rl_fundamentals/13_dqn_fundamentals/dqn.py
python rl_fundamentals/14_dqn_improvements/double_dqn.py
python rl_fundamentals/15_dqn_applications/cartpole_dqn/solve_cartpole_dqn.py

# === Week 6: Actor-Critic ===
python rl_fundamentals/16_actor_critic/a2c.py
python rl_fundamentals/17_actor_critic_applications/cartpole_a2c/solve_cartpole_a2c.py

# === Week 7: PPO ===
python rl_fundamentals/18_ppo/ppo.py
python rl_fundamentals/19_ppo_applications/lunarlander_ppo/solve_lunarlander_ppo.py
python rl_fundamentals/19_ppo_applications/bipedal_walker_ppo/solve_bipedal_walker_ppo.py

# === Week 9: TRPO, DDPG & TD3 ===
python rl_fundamentals/20_trpo/trpo.py
python rl_fundamentals/21_ddpg/ddpg.py
python rl_fundamentals/22_td3/td3.py
python rl_fundamentals/23_offpolicy_applications/pendulum_ddpg/solve_pendulum_ddpg.py
python rl_fundamentals/23_offpolicy_applications/pendulum_td3/solve_pendulum_td3.py
python rl_fundamentals/23_offpolicy_applications/bipedal_walker_td3/solve_bipedal_walker_td3.py

# === Week 10: Game AI ===
python rl_fundamentals/24_mcts/mcts.py
python rl_fundamentals/25_alphazero/alphazero.py
python rl_fundamentals/26_muzero/muzero.py
python rl_fundamentals/27_game_applications/tictactoe_mcts/solve_tictactoe_mcts.py
python rl_fundamentals/27_game_applications/tictactoe_alphazero/solve_tictactoe_alphazero.py
python rl_fundamentals/27_game_applications/connect4_alphazero/solve_connect4_alphazero.py
python rl_fundamentals/27_game_applications/tictactoe_muzero/solve_tictactoe_muzero.py

Speedrun Progress

  • Week 1: Dynamic Programming - When you have the cheat codes (full model)
  • Week 2: Temporal Difference - Q-Learning & SARSA (model-free vibes)
  • Week 3: Monte Carlo & Policy Gradients - Episode-based learning
  • Week 4: Unified Agents - Modular exploration & benchmarking
  • Week 5: Deep Q-Networks - Neural nets + experience replay + target networks
  • Week 6: Actor-Critic - Best of policy gradients + value functions
  • Week 7: PPO - Clipped surrogate, stable updates, discrete + continuous
  • Week 9: TRPO, DDPG & TD3 - Trust regions, off-policy continuous control
  • Week 10: Game AI - MCTS, AlphaZero self-play, MuZero learned model
  • Week 11+: Research Immersion - Paper reading, SAC, and beyond...

The Algorithms

Week 1: Dynamic Programming (Model-Based)

Algorithm Update Rule Requires Model?
Value Iteration V(s) ← max_a Ξ£ P(s'|s,a)[R + Ξ³V(s')] Yes
Policy Iteration Evaluate β†’ Improve β†’ Repeat Yes

Week 2: Temporal Difference (Model-Free, Bootstrapping)

Algorithm Update Rule Policy Type
Q-Learning Q(S,A) ← Q(S,A) + Ξ±[R + Ξ³Β·max_a Q(S',a) - Q(S,A)] Off-policy
SARSA Q(S,A) ← Q(S,A) + Ξ±[R + Ξ³Q(S',A') - Q(S,A)] On-policy

Week 3: Monte Carlo & Policy Gradients

Algorithm Update Rule Key Property
MC Prediction V(s) ← V(s) + Ξ±[G_t - V(s)] Unbiased, high variance
REINFORCE ΞΈ ← ΞΈ + Ξ±Β·G_tΒ·βˆ‡log Ο€(a|s) Direct policy optimization

Week 5: Deep Q-Networks

Algorithm Key Innovation Benefit
DQN Experience Replay + Target Network Stable deep RL
Double DQN Decouple selection from evaluation Reduce overestimation

Week 6: Actor-Critic

Algorithm Components Benefit
A2C Actor Ο€(a|s) + Critic V(s) Lower variance than REINFORCE
GAE Ξ»-weighted TD errors Tunable bias-variance

Week 7: Proximal Policy Optimization

Algorithm Key Innovation Benefit
PPO Clipped surrogate ratio Stable policy updates, multi-epoch reuse

Week 9: TRPO, DDPG & TD3

Algorithm Key Innovation Benefit
TRPO KL-constrained natural gradient Monotonic improvement guarantee
DDPG Deterministic policy + replay Off-policy continuous control
TD3 Twin critics + delayed + smoothing Robust continuous control

Week 10: Game AI

Algorithm Key Innovation Benefit
MCTS UCB1 tree search + random rollouts Strong play without learning
AlphaZero MCTS + policy-value network + self-play Learns from scratch, superhuman
MuZero Learned dynamics model + latent MCTS No game rules needed

Method Comparison

Method Bootstraps? Model-Free? Episode End? Bias Variance
DP Yes No N/A Low Low
TD Yes Yes No Some Medium
MC No Yes Yes None High
PG No Yes Yes None Very High
DQN Yes Yes No Some Low
A2C Yes (GAE) Yes No Tunable Medium
PPO Yes (GAE) Yes No Tunable Low
TRPO Yes (GAE) Yes No Tunable Low
DDPG Yes Yes No Some Low
TD3 Yes Yes No Low Low
MCTS No Yes No None High
AlphaZero No Yes No Some Low
MuZero Yes Yes No Some Medium

Philosophy

This repo follows the ancient wisdom:

  1. Understand the math - Actually derive things, no hand-waving
  2. Implement from scratch - Suffering builds character
  3. Visualize everything - Pretty pictures > walls of numbers
  4. Keep it real - Comments are for future confused me

Resources I'm Stealing From

  • Sutton & Barto's RL Book (the bible)
  • David Silver's lectures (goated)
  • OpenAI Spinning Up (documentation supremacy)
  • Stack Overflow (no shame)

Currently speedrunning: MCTS, AlphaZero & MuZero βœ“

Next up: Research immersion β€” paper reading & implementation!

Stars appreciated, issues tolerated, PRs celebrated ⭐

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages