This roadmap takes you from core fundamentals to advanced distributed systems, blending theory, hands-on projects, and real-world architecture patterns so you can confidently handle systems at any scale.
| Topic | What It Is | Key Skills | Resources |
|---|---|---|---|
| CDN | Content Delivery Network caches static assets geographically close to users | Cloudflare, Akamai, AWS CloudFront | Cloudflare Learning Center |
| Caching | Store frequently accessed data in memory to reduce latency & load | Redis, Memcached, HTTP caching headers | Caching Strategies – AWS Docs |
| Sharding | Splitting DB/data across multiple nodes based on a key | Range-based vs hash-based sharding | System Design Primer – Sharding |
| Queueing | Asynchronous task processing | RabbitMQ, Kafka, SQS, Celery | Message Queues Explained |
| Replication | Copying data across servers for redundancy & scaling | Leader–Follower, Multi-Leader | Database Replication Patterns |
| Partitioning | Splitting data logically or physically | Horizontal vs Vertical partitioning | Partitioning – Microsoft Docs |
| API Gateway | Single entry point for multiple services | Kong, Nginx, AWS API Gateway | API Gateway Pattern – Microservices.io |
| Rate Limiting | Limit requests per user/IP | Token bucket, leaky bucket algorithms | Rate Limiting Algorithms |
| CAP Theorem | Trade-off between Consistency, Availability, Partition Tolerance | CP vs AP systems | CAP Theorem Illustrated |
| Microservices | Independent deployable services | API communication, scaling, monitoring | Microservices.io |
| Load Balancing | Distribute traffic across servers | Round robin, least connections | NGINX Load Balancing |
| Fault Tolerance | System keeps running despite failures | Redundancy, retries | Fault Tolerance Overview |
| Database Scaling | Vertical vs horizontal scaling | Read replicas, partitioning | Scaling Databases |
| Service Discovery | Find services dynamically | Consul, Eureka | Service Discovery Pattern |
| Consistency Models | Strong, eventual, causal | Jepsen analysis | |
| Eventual Consistency | Data converges over time | DynamoDB, Cassandra, ScyllaDB | Amazon DynamoDB Eventual Consistency |
| Distributed Transactions | Transactions across multiple systems | Two-phase commit, Saga pattern | Distributed Transactions Patterns |
| Monolith vs Microservices | Trade-offs of single app vs many services | Maintainability, complexity | Martin Fowler – MonolithFirst |
| Leader Election | Choosing a node as leader in a cluster | Raft, Paxos, Zookeeper | Raft Visualization |
| Topic | What It Is | Key Skills | Resources |
|---|---|---|---|
| Leader-Follower Replication | One leader writes, followers replicate | PostgreSQL streaming replication | PostgreSQL Replication |
| WAL (Write Ahead Log) | Log before committing to disk for durability | PostgreSQL WAL internals | WAL – PostgreSQL Docs |
| Asynchronous Processing | Do work in background | Celery, Sidekiq | Celery Docs |
| Transaction Isolation | Levels: Read Uncommitted → Serializable | PostgreSQL Isolation | |
| Read/Write Patterns | Optimize for read-heavy vs write-heavy workloads | System Design Primer – Patterns | |
| Consistent Hashing | Distribute load/data evenly | Consistent Hashing Explained | |
| Redis/Memcached | In-memory key-value store | Redis Docs | |
| Backup & Restore | Point-in-time recovery | PostgreSQL Backup | |
| Hot/Cold Storage | Hot = fast, Cold = cheap | AWS S3, Glacier | AWS Storage Classes |
| Data Partitioning | Horizontal/vertical partitioning | Best Practices | |
| Object Storage | Blob storage like S3 | S3 Docs | |
| SQL vs NoSQL | Relational vs document/columnar | Comparison | |
| Data Retention | Compliance, GDPR | Data Retention Guide | |
| Data Modeling | ER diagrams, normalization | Database Design Basics | |
| OLAP vs OLTP | Analytical vs transactional DBs | OLAP vs OLTP | |
| ACID & BASE | Transaction properties | ACID vs BASE | |
| Bloom Filters | Probabilistic membership check | Bloom Filters Explained | |
| File Systems | Ext4, NTFS, ZFS | File System Concepts | |
| S3 Basics | AWS object storage | AWS S3 Getting Started | |
| B+ Trees | DB indexing structure | B+ Tree Tutorial | |
| Indexing | Speed up queries | Database Indexing Guide |
| Topic | What It Is | Key Skills | Resources |
|---|---|---|---|
| JWT | Token-based authentication | JWT.io | |
| CORS | Cross-origin requests security | MDN CORS | |
| OAuth | Auth delegation | OAuth2 Simplified | |
| Throttling | Limit traffic | Token bucket, leaky bucket | Rate Limiting Guide |
| Serialization | JSON, ProtoBuf | Protocol Buffers Docs | |
| API Security | OWASP API Top 10 | OWASP API Security | |
| Long Polling | HTTP hold-open | MDN Long Polling | |
| WebSockets | Full-duplex communication | WebSockets Guide | |
| Idempotency | Same request multiple times → same result | Idempotency Explained | |
| Service Mesh | Istio, Linkerd | Service Mesh Intro | |
| Retry Patterns | Exponential backoff | Retry Best Practices | |
| REST vs gRPC | HTTP vs binary RPC | gRPC Docs | |
| API Versioning | URI, header-based | API Versioning Strategies | |
| Circuit Breaker | Stop cascading failures | Netflix Hystrix | |
| Fan-out/Fan-in | Split & aggregate requests | Parallel Patterns | |
| Message Queues | Kafka, RabbitMQ | Kafka Guide | |
| Dead Letter Queue | Store failed messages | DLQ in AWS SQS |
| Topic | What It Is | Key Skills | Resources |
|---|---|---|---|
| Metrics | Quantitative system data | Prometheus, Grafana | Prometheus Docs |
| Alerting | Notify on issues | Alertmanager, PagerDuty | Prometheus Alerting |
| Failover | Switch to backup system | DNS failover, DB failover | Failover Concepts |
| Logging | Centralized logs | ELK stack, Loki | Logging Best Practices |
| Rollbacks | Revert deployment | Git, Kubernetes | Deployment Strategies |
| Monitoring | Metrics + health checks | Monitoring 101 | |
| Heartbeats | Service health signals | Heartbeat Patterns | |
| Retry Logic | Retries with backoff | AWS Retry Best Practices | |
| Autoscaling | Scale based on load | Kubernetes HPA | K8s Autoscaling |
| SLO/SLI/SLA | Availability & performance goals | Google SRE Book – Chapter 4 | |
| Load Testing | Simulate high traffic | k6, Locust | Load Testing Guide |
| Error Budgets | Allowable downtime | SRE Book – Error Budgets | |
| Health Checks | Liveness/readiness probes | K8s Health Checks | |
| Incident Response | Handling outages | PagerDuty, Statuspage | Incident Management Guide |
| Chaos Engineering | Break things intentionally | Netflix Chaos Monkey | Principles of Chaos |
| Distributed Tracing | Trace requests across services | OpenTelemetry, Jaeger | OpenTelemetry Docs |
| Canary Deployments | Gradual rollout | Deployment Strategies | |
| Graceful Degradation | Reduce features on failure | Degradation Strategies | |
| Blue-Green Deployment | Swap between two environments | Blue-Green Guide |
Goal: Understand the building blocks of large-scale systems.
Topics
- Scaling basics → Vertical vs horizontal scaling
- CDN & Caching strategies → Browser cache, Redis/Memcached, HTTP caching headers
- Load Balancing → Round Robin, Least Connections, Consistent Hashing
- API Gateway basics → Routing, authentication, throttling
- Rate Limiting → Token Bucket, Leaky Bucket
- Monolith vs Microservices → When to split
Hands-On
- Build a simple FastAPI app with Redis caching and Nginx load balancing
- Add an API Gateway (Kong or Nginx) in front of it
- Implement rate limiting using Redis
Resources
Goal: Learn how to scale and structure data.
Topics
- SQL vs NoSQL – Trade-offs
- Indexing & B+ Trees
- Leader-Follower Replication & WAL
- Sharding & Partitioning
- Hot/Cold Storage
- Backup & Restore strategies
- ACID vs BASE & Consistency Models
Hands-On
- PostgreSQL with read replicas
- Implement sharding manually (range/hash partitioning)
- Add backup & restore scripts
- Compare strong vs eventual consistency with MongoDB or Cassandra
Resources
- Designing Data-Intensive Applications (DDIA) – Chapters 1-5
- PostgreSQL Replication Docs
- MongoDB Sharding Guide
Goal: Master async patterns for scalability.
Topics
- Queueing – RabbitMQ, Kafka, AWS SQS
- Dead Letter Queues
- Fan-out/Fan-in patterns
- Retry Patterns & Circuit Breaker
- Long Polling vs WebSockets
- Service Mesh basics
Hands-On
- Mini Project:
- Event-driven order processing system
- Producer → Queue → Consumer
- DLQ for failed tasks
- Retry with exponential backoff
- WebSocket for real-time status updates
Resources
Goal: Make systems resilient and debuggable.
Topics
- Fault Tolerance & Failover
- Distributed Tracing (OpenTelemetry, Jaeger)
- Metrics & Monitoring (Prometheus + Grafana)
- Alerting & Incident Response
- Health Checks & Graceful Degradation
- Chaos Engineering (Netflix Chaos Monkey)
Hands-On
- Mini Project:
- Deploy your Month 3 project on Kubernetes
- Add health checks, liveness/readiness probes
- Implement autoscaling based on CPU/memory
- Add Prometheus + Grafana dashboards
- Simulate failures & verify fault tolerance
Resources
Goal: Handle complexity of multi-node systems.
Topics
- CAP Theorem deep dive
- Leader Election (Raft, Paxos)
- Distributed Transactions – 2PC, Saga Pattern
- Eventual Consistency
- Consistent Hashing
- Service Discovery (Consul, Eureka)
Hands-On
- Mini Project:
- Build a distributed key-value store (in Go or Python)
- Implement leader election with Raft
- Support consistent hashing for partitioning
- Add service discovery with Consul
Resources
- Raft Visualization
- Distributed Systems: Principles and Paradigms – Tanenbaum
- Microservices.io Patterns
Goal: Integrate all concepts into one large-scale system.
Topics
- API Security – JWT, OAuth, CORS, Idempotency
- Blue-Green & Canary Deployments
- Data Retention & Compliance
- OLAP vs OLTP optimization
- Graceful Rollbacks
Hands-On – Capstone Project
- High-scale e-commerce backend with:
- API Gateway & microservices
- PostgreSQL sharding + read replicas
- Redis caching layer
- Kafka-based async processing
- Prometheus + Grafana monitoring
- Kubernetes autoscaling
- Canary deployment strategy
- Disaster recovery & backup scripts
Resources