"Why spend months learning RL when you can mass produce it in days?" - Me, probably sleep deprived
Welcome to my chaotic journey through Reinforcement Learning. This repo is basically me speedrunning RL concepts, writing everything from scratch, and pretending I know what I'm doing.
A personal RL learning repo where I implement algorithms from first principles. No fancy libraries doing the heavy lifting - just raw NumPy energy and questionable life choices.
π rl_fundamentals/
βββ 01_mdp/ <- MDPs: fancy way to say "states go brrr"
βββ 02_value_functions/ <- V(s) and Q(s,a) - the OG value bros
βββ 03_bellman_equations/ <- Bellman said: "it's recursive, deal with it"
βββ 04_dynamic_programming/ <- When you know everything about the world
βββ 05_env_applications/ <- DP in action
β βββ gridworld/ <- Baby's first MDP
β βββ frozenlake/ <- Slippery boi simulator
β βββ taxi_v3/ <- Uber but worse
βββ 06_temporal_difference/ <- Learning from experience, one step at a time
β βββ q_learning.py <- Off-policy TD control
β βββ sarsa.py <- On-policy TD control
βββ 07_td_applications/ <- TD algorithms in the wild
β βββ cliffwalking/ <- Q-Learning vs SARSA showdown
β βββ cartpole/ <- Discretized Q-Learning
βββ 08_monte_carlo/ <- Wait for the episode to end, then learn
β βββ monte_carlo.py <- First-Visit & Every-Visit MC
βββ 09_policy_gradients/ <- Directly optimize the policy
β βββ reinforce.py <- The OG policy gradient
βββ 10_mc_pg_applications/ <- MC & PG in action
β βββ blackjack/ <- Classic MC territory
β βββ cartpole_reinforce/ <- Neural network policy
βββ 11_unified_agent/ <- Modular RL agent framework
β βββ exploration_strategies.py <- Ξ΅-greedy, Boltzmann, UCB
β βββ unified_agent.py <- Configurable Q-Learning/SARSA
βββ 12_benchmarking/ <- Systematic algorithm comparison
β βββ benchmark.py <- Multi-algorithm benchmarking
βββ 13_dqn_fundamentals/ <- Deep Q-Networks from scratch
β βββ replay_buffer.py <- Experience replay
β βββ target_network.py <- Stable learning targets
β βββ dqn.py <- Full DQN implementation
βββ 14_dqn_improvements/ <- DQN enhancements
β βββ double_dqn.py <- Fixing overestimation bias
βββ 15_dqn_applications/ <- DQN in the wild
β βββ cartpole_dqn/ <- CartPole with neural nets
β βββ lunarlander_dqn/ <- Landing rockets with DQN
βββ 16_actor_critic/ <- Best of both worlds
β βββ advantage.py <- GAE and advantage estimation
β βββ entropy.py <- Exploration via entropy bonus
β βββ a2c.py <- Advantage Actor-Critic
βββ 17_actor_critic_applications/ <- A2C in action
β βββ cartpole_a2c/ <- A2C vs DQN vs REINFORCE
β βββ lunarlander_a2c/ <- Landing rockets, actor-critic style
βββ 18_ppo/ <- The algorithm that made RL practical
β βββ ppo.py <- PPO with clipping (discrete + continuous)
βββ 19_ppo_applications/ <- PPO in the wild
β βββ lunarlander_ppo/ <- Stable lunar landing
β βββ bipedal_walker_ppo/ <- Teaching a robot to walk
βββ 20_trpo/ <- PPO's predecessor (second-order optimization)
β βββ trpo.py <- TRPO: conjugate gradient + line search
βββ 21_ddpg/ <- Off-policy continuous control
β βββ ddpg.py <- DDPG: deterministic policy + replay buffer
βββ 22_td3/ <- Fixing DDPG's failure modes
β βββ td3.py <- TD3: twin critics, delayed updates, smoothing
βββ 23_offpolicy_applications/ <- Off-policy algorithms in the wild
β βββ pendulum_ddpg/ <- Swing up with deterministic policy
β βββ pendulum_td3/ <- Swing up, twin critic edition
β βββ bipedal_walker_td3/ <- Teaching a robot to walk (off-policy)
βββ 24_mcts/ <- Monte Carlo Tree Search from scratch
β βββ mcts.py <- UCB1 selection + random rollouts
βββ 25_alphazero/ <- Self-play with neural network MCTS
β βββ games.py <- TicTacToe + Connect Four
β βββ alphazero.py <- PolicyValueNet + PUCT + self-play
βββ 26_muzero/ <- Planning without knowing the rules
β βββ muzero.py <- Learned dynamics + latent MCTS
βββ 27_game_applications/ <- Game AI in the wild
βββ tictactoe_mcts/ <- Pure MCTS achieves perfect play
βββ tictactoe_alphazero/ <- AlphaZero learns TicTacToe
βββ connect4_alphazero/ <- AlphaZero on Connect Four
βββ tictactoe_muzero/ <- MuZero: no rules needed
"When you have God mode enabled (full model knowledge)"
Optimal policy for a 4x4 grid. Terminal states at corners. Agent just wants to go home.
That feeling when you try to go right but physics says "nah". 1/3 chance of actually going where you want.
Value function heatmap. Higher = closer to dropping off passengers and escaping this nightmare.
"Model-free vibes - learning from experience without knowing the rules"
Q-Learning: "I'll walk the edge, YOLO" SARSA: "I'd rather live, thanks"
The classic demonstration of off-policy vs on-policy learning:
- Q-Learning finds the risky optimal path (right along the cliff edge)
- SARSA finds the safer path (stays away from the cliff because it knows it might slip)
Continuous state space? Just chop it into bins and pretend it's discrete.
"Episode-based learning meets direct policy optimization"
Learning to play 21 by sampling complete games. The house still wins, but less often.
Monte Carlo methods wait for the episode to end, then learn from actual returns:
- First-Visit MC: Only count the first visit to each state
- Every-Visit MC: Count all visits (lower variance)
Direct policy optimization: no value function needed, just gradients and vibes.
REINFORCE directly optimizes the policy using the policy gradient theorem:
"Time to get organized and systematic"
Implemented modular exploration strategies:
- Ξ΅-greedy: Classic random exploration
- Boltzmann/Softmax: Temperature-based action selection
- UCB (Upper Confidence Bound): Optimism in the face of uncertainty
Systematic comparison of algorithms across environments with statistical rigor.
"When tabular methods hit their limits, neural networks enter the chat"
Pure NumPy implementation of DQN with:
- Experience Replay: Break correlation, reuse data
- Target Networks: Stable learning targets
- Double DQN: Fix overestimation bias
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DQN Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β State β [Hidden 64] β [Hidden 64] β Q-values β
β β
β Key Innovations: β
β 1. Experience Replay Buffer β
β 2. Target Network (updated every C steps) β
β 3. Double DQN (decouple selection from evaluation) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Solving classic control problems with neural network function approximation.
"Why choose between policy gradients and value functions when you can have both?"
Combines the best of both worlds:
- Actor: Policy network Ο(a|s) - what to do
- Critic: Value network V(s) - how good is this state
βββββββββββββββββββ
β Environment β
ββββββββββ¬βββββββββ
β
state s, reward r
β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
β β β
βΌ β βΌ
ββββββββββββ β ββββββββββββ
β ACTOR β β β CRITIC β
β Ο(a|s) βββββ Advantage ββββ€ β V(s) β
ββββββββββββ A = Q - V β ββββββββββββ
β β β
βΌ β βΌ
action a gradient baseline
- GAE (Generalized Advantage Estimation): Tunable bias-variance tradeoff
- Entropy Regularization: Prevent premature convergence
- Shared Feature Layers: Parameter efficient actor-critic
"The algorithm that made deep RL actually practical"
PPO takes A2C and adds one powerful constraint: don't let the policy change too much in a single update.
where
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PPO: A2C with Guardrails β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Collect rollout β Compute GAE advantages β
β β β
β ββββ For K epochs (reuse data!) ββββ β
β β Shuffle into mini-batches β β
β β ratio = Ο_new / Ο_old β β
β β clip(ratio, 1-Ξ΅, 1+Ξ΅) β β
β β Take pessimistic (min) update β β
β βββββββββββββββββββββββββββββββββββββ β
β β
β Key insight: clipping prevents catastrophic policy updates β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- LunarLander: Discrete actions (4), PPO's stability shines on this harder control task
- BipedalWalker: Continuous actions (4D Gaussian policy), teaching a robot to walk
"Trust regions, deterministic policies, and twin critics"
Uses a hard KL divergence constraint enforced via conjugate gradient + line search:
The natural gradient update:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DDPG β TD3: Continuous Control Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β DDPG: DQN ideas + Actor-Critic for continuous actions β
β state β [Actor ΞΌ(s)] β action β [Critic Q(s,a)] β value β
β + Replay buffer + Target networks + Exploration noise β
β β
β TD3 fixes DDPG's three problems: β
β 1. Twin critics: min(Q1, Q2) β no overestimation β
β 2. Delayed updates: actor every 2 critic steps β
β 3. Target smoothing: noise on target actions β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- CartPole: TRPO with natural gradient (discrete)
- Pendulum: DDPG and TD3 comparison (continuous, 1D action)
- BipedalWalker: TD3 on hard 4D continuous control
"From random rollouts to learning without rules"
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Game AI: Three Generations β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β MCTS (2006): UCB1 selection + random rollouts β
β β No learning, just search β
β AlphaZero (2017): MCTS + neural network (policy + value) β
β β Self-play learns from scratch β
β MuZero (2020): MCTS + learned model (no game rules!) β
β Plans in a learned latent space β
β β
β Key insight: each generation replaces handcrafted β
β components with learned ones β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Algorithm | Game | Win Rate vs Random |
|---|---|---|
| MCTS (1000 sims) | TicTacToe | ~99% |
| AlphaZero (80 iters) | TicTacToe | ~97% |
| AlphaZero (100 iters) | Connect Four | ~98% |
| MuZero (80 iters) | TicTacToe | ~72% |
MuZero achieves lower win rates than AlphaZero because it must also learn the game dynamics from scratch.
# Install the goods
pip install -r rl_fundamentals/requirements.txt
# === Week 1: Dynamic Programming ===
python rl_fundamentals/05_env_applications/gridworld/gridworld_dp.py
python rl_fundamentals/05_env_applications/frozenlake/solve_frozenlake.py
python rl_fundamentals/05_env_applications/taxi_v3/solve_taxi.py
# === Week 2: Temporal Difference ===
python rl_fundamentals/07_td_applications/cliffwalking/solve_cliffwalking.py
python rl_fundamentals/07_td_applications/cartpole/solve_cartpole.py
# === Week 3: Monte Carlo & Policy Gradients ===
python rl_fundamentals/10_mc_pg_applications/blackjack/solve_blackjack.py
python rl_fundamentals/10_mc_pg_applications/cartpole_reinforce/solve_cartpole_reinforce.py
# === Week 4: Unified Agent & Benchmarking ===
python rl_fundamentals/11_unified_agent/exploration_strategies.py
python rl_fundamentals/12_benchmarking/benchmark.py
# === Week 5: Deep Q-Networks ===
python rl_fundamentals/13_dqn_fundamentals/dqn.py
python rl_fundamentals/14_dqn_improvements/double_dqn.py
python rl_fundamentals/15_dqn_applications/cartpole_dqn/solve_cartpole_dqn.py
# === Week 6: Actor-Critic ===
python rl_fundamentals/16_actor_critic/a2c.py
python rl_fundamentals/17_actor_critic_applications/cartpole_a2c/solve_cartpole_a2c.py
# === Week 7: PPO ===
python rl_fundamentals/18_ppo/ppo.py
python rl_fundamentals/19_ppo_applications/lunarlander_ppo/solve_lunarlander_ppo.py
python rl_fundamentals/19_ppo_applications/bipedal_walker_ppo/solve_bipedal_walker_ppo.py
# === Week 9: TRPO, DDPG & TD3 ===
python rl_fundamentals/20_trpo/trpo.py
python rl_fundamentals/21_ddpg/ddpg.py
python rl_fundamentals/22_td3/td3.py
python rl_fundamentals/23_offpolicy_applications/pendulum_ddpg/solve_pendulum_ddpg.py
python rl_fundamentals/23_offpolicy_applications/pendulum_td3/solve_pendulum_td3.py
python rl_fundamentals/23_offpolicy_applications/bipedal_walker_td3/solve_bipedal_walker_td3.py
# === Week 10: Game AI ===
python rl_fundamentals/24_mcts/mcts.py
python rl_fundamentals/25_alphazero/alphazero.py
python rl_fundamentals/26_muzero/muzero.py
python rl_fundamentals/27_game_applications/tictactoe_mcts/solve_tictactoe_mcts.py
python rl_fundamentals/27_game_applications/tictactoe_alphazero/solve_tictactoe_alphazero.py
python rl_fundamentals/27_game_applications/connect4_alphazero/solve_connect4_alphazero.py
python rl_fundamentals/27_game_applications/tictactoe_muzero/solve_tictactoe_muzero.py- Week 1: Dynamic Programming - When you have the cheat codes (full model)
- Week 2: Temporal Difference - Q-Learning & SARSA (model-free vibes)
- Week 3: Monte Carlo & Policy Gradients - Episode-based learning
- Week 4: Unified Agents - Modular exploration & benchmarking
- Week 5: Deep Q-Networks - Neural nets + experience replay + target networks
- Week 6: Actor-Critic - Best of policy gradients + value functions
- Week 7: PPO - Clipped surrogate, stable updates, discrete + continuous
- Week 9: TRPO, DDPG & TD3 - Trust regions, off-policy continuous control
- Week 10: Game AI - MCTS, AlphaZero self-play, MuZero learned model
- Week 11+: Research Immersion - Paper reading, SAC, and beyond...
| Algorithm | Update Rule | Requires Model? |
|---|---|---|
| Value Iteration | V(s) β max_a Ξ£ P(s'|s,a)[R + Ξ³V(s')] | Yes |
| Policy Iteration | Evaluate β Improve β Repeat | Yes |
| Algorithm | Update Rule | Policy Type |
|---|---|---|
| Q-Learning | Q(S,A) β Q(S,A) + Ξ±[R + Ξ³Β·max_a Q(S',a) - Q(S,A)] | Off-policy |
| SARSA | Q(S,A) β Q(S,A) + Ξ±[R + Ξ³Q(S',A') - Q(S,A)] | On-policy |
| Algorithm | Update Rule | Key Property |
|---|---|---|
| MC Prediction | V(s) β V(s) + Ξ±[G_t - V(s)] | Unbiased, high variance |
| REINFORCE | ΞΈ β ΞΈ + Ξ±Β·G_tΒ·βlog Ο(a|s) | Direct policy optimization |
| Algorithm | Key Innovation | Benefit |
|---|---|---|
| DQN | Experience Replay + Target Network | Stable deep RL |
| Double DQN | Decouple selection from evaluation | Reduce overestimation |
| Algorithm | Components | Benefit |
|---|---|---|
| A2C | Actor Ο(a|s) + Critic V(s) | Lower variance than REINFORCE |
| GAE | Ξ»-weighted TD errors | Tunable bias-variance |
| Algorithm | Key Innovation | Benefit |
|---|---|---|
| PPO | Clipped surrogate ratio | Stable policy updates, multi-epoch reuse |
| Algorithm | Key Innovation | Benefit |
|---|---|---|
| TRPO | KL-constrained natural gradient | Monotonic improvement guarantee |
| DDPG | Deterministic policy + replay | Off-policy continuous control |
| TD3 | Twin critics + delayed + smoothing | Robust continuous control |
| Algorithm | Key Innovation | Benefit |
|---|---|---|
| MCTS | UCB1 tree search + random rollouts | Strong play without learning |
| AlphaZero | MCTS + policy-value network + self-play | Learns from scratch, superhuman |
| MuZero | Learned dynamics model + latent MCTS | No game rules needed |
| Method | Bootstraps? | Model-Free? | Episode End? | Bias | Variance |
|---|---|---|---|---|---|
| DP | Yes | No | N/A | Low | Low |
| TD | Yes | Yes | No | Some | Medium |
| MC | No | Yes | Yes | None | High |
| PG | No | Yes | Yes | None | Very High |
| DQN | Yes | Yes | No | Some | Low |
| A2C | Yes (GAE) | Yes | No | Tunable | Medium |
| PPO | Yes (GAE) | Yes | No | Tunable | Low |
| TRPO | Yes (GAE) | Yes | No | Tunable | Low |
| DDPG | Yes | Yes | No | Some | Low |
| TD3 | Yes | Yes | No | Low | Low |
| MCTS | No | Yes | No | None | High |
| AlphaZero | No | Yes | No | Some | Low |
| MuZero | Yes | Yes | No | Some | Medium |
This repo follows the ancient wisdom:
- Understand the math - Actually derive things, no hand-waving
- Implement from scratch - Suffering builds character
- Visualize everything - Pretty pictures > walls of numbers
- Keep it real - Comments are for future confused me
- Sutton & Barto's RL Book (the bible)
- David Silver's lectures (goated)
- OpenAI Spinning Up (documentation supremacy)
- Stack Overflow (no shame)
Currently speedrunning: MCTS, AlphaZero & MuZero β
Next up: Research immersion β paper reading & implementation!
Stars appreciated, issues tolerated, PRs celebrated β






