Skip to content

Sagor0078/System-Design-and-Scaling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Roadmap to Backend Scaling & System Design

This roadmap takes you from core fundamentals to advanced distributed systems, blending theory, hands-on projects, and real-world architecture patterns so you can confidently handle systems at any scale.

Scaling & Architecture

Topic What It Is Key Skills Resources
CDN Content Delivery Network caches static assets geographically close to users Cloudflare, Akamai, AWS CloudFront Cloudflare Learning Center
Caching Store frequently accessed data in memory to reduce latency & load Redis, Memcached, HTTP caching headers Caching Strategies – AWS Docs
Sharding Splitting DB/data across multiple nodes based on a key Range-based vs hash-based sharding System Design Primer – Sharding
Queueing Asynchronous task processing RabbitMQ, Kafka, SQS, Celery Message Queues Explained
Replication Copying data across servers for redundancy & scaling Leader–Follower, Multi-Leader Database Replication Patterns
Partitioning Splitting data logically or physically Horizontal vs Vertical partitioning Partitioning – Microsoft Docs
API Gateway Single entry point for multiple services Kong, Nginx, AWS API Gateway API Gateway Pattern – Microservices.io
Rate Limiting Limit requests per user/IP Token bucket, leaky bucket algorithms Rate Limiting Algorithms
CAP Theorem Trade-off between Consistency, Availability, Partition Tolerance CP vs AP systems CAP Theorem Illustrated
Microservices Independent deployable services API communication, scaling, monitoring Microservices.io
Load Balancing Distribute traffic across servers Round robin, least connections NGINX Load Balancing
Fault Tolerance System keeps running despite failures Redundancy, retries Fault Tolerance Overview
Database Scaling Vertical vs horizontal scaling Read replicas, partitioning Scaling Databases
Service Discovery Find services dynamically Consul, Eureka Service Discovery Pattern
Consistency Models Strong, eventual, causal Jepsen analysis
Eventual Consistency Data converges over time DynamoDB, Cassandra, ScyllaDB Amazon DynamoDB Eventual Consistency
Distributed Transactions Transactions across multiple systems Two-phase commit, Saga pattern Distributed Transactions Patterns
Monolith vs Microservices Trade-offs of single app vs many services Maintainability, complexity Martin Fowler – MonolithFirst
Leader Election Choosing a node as leader in a cluster Raft, Paxos, Zookeeper Raft Visualization

Databases & Storage

Topic What It Is Key Skills Resources
Leader-Follower Replication One leader writes, followers replicate PostgreSQL streaming replication PostgreSQL Replication
WAL (Write Ahead Log) Log before committing to disk for durability PostgreSQL WAL internals WAL – PostgreSQL Docs
Asynchronous Processing Do work in background Celery, Sidekiq Celery Docs
Transaction Isolation Levels: Read Uncommitted → Serializable PostgreSQL Isolation
Read/Write Patterns Optimize for read-heavy vs write-heavy workloads System Design Primer – Patterns
Consistent Hashing Distribute load/data evenly Consistent Hashing Explained
Redis/Memcached In-memory key-value store Redis Docs
Backup & Restore Point-in-time recovery PostgreSQL Backup
Hot/Cold Storage Hot = fast, Cold = cheap AWS S3, Glacier AWS Storage Classes
Data Partitioning Horizontal/vertical partitioning Best Practices
Object Storage Blob storage like S3 S3 Docs
SQL vs NoSQL Relational vs document/columnar Comparison
Data Retention Compliance, GDPR Data Retention Guide
Data Modeling ER diagrams, normalization Database Design Basics
OLAP vs OLTP Analytical vs transactional DBs OLAP vs OLTP
ACID & BASE Transaction properties ACID vs BASE
Bloom Filters Probabilistic membership check Bloom Filters Explained
File Systems Ext4, NTFS, ZFS File System Concepts
S3 Basics AWS object storage AWS S3 Getting Started
B+ Trees DB indexing structure B+ Tree Tutorial
Indexing Speed up queries Database Indexing Guide

Communication & APIs

Topic What It Is Key Skills Resources
JWT Token-based authentication JWT.io
CORS Cross-origin requests security MDN CORS
OAuth Auth delegation OAuth2 Simplified
Throttling Limit traffic Token bucket, leaky bucket Rate Limiting Guide
Serialization JSON, ProtoBuf Protocol Buffers Docs
API Security OWASP API Top 10 OWASP API Security
Long Polling HTTP hold-open MDN Long Polling
WebSockets Full-duplex communication WebSockets Guide
Idempotency Same request multiple times → same result Idempotency Explained
Service Mesh Istio, Linkerd Service Mesh Intro
Retry Patterns Exponential backoff Retry Best Practices
REST vs gRPC HTTP vs binary RPC gRPC Docs
API Versioning URI, header-based API Versioning Strategies
Circuit Breaker Stop cascading failures Netflix Hystrix
Fan-out/Fan-in Split & aggregate requests Parallel Patterns
Message Queues Kafka, RabbitMQ Kafka Guide
Dead Letter Queue Store failed messages DLQ in AWS SQS

Reliability & Observability

Topic What It Is Key Skills Resources
Metrics Quantitative system data Prometheus, Grafana Prometheus Docs
Alerting Notify on issues Alertmanager, PagerDuty Prometheus Alerting
Failover Switch to backup system DNS failover, DB failover Failover Concepts
Logging Centralized logs ELK stack, Loki Logging Best Practices
Rollbacks Revert deployment Git, Kubernetes Deployment Strategies
Monitoring Metrics + health checks Monitoring 101
Heartbeats Service health signals Heartbeat Patterns
Retry Logic Retries with backoff AWS Retry Best Practices
Autoscaling Scale based on load Kubernetes HPA K8s Autoscaling
SLO/SLI/SLA Availability & performance goals Google SRE Book – Chapter 4
Load Testing Simulate high traffic k6, Locust Load Testing Guide
Error Budgets Allowable downtime SRE Book – Error Budgets
Health Checks Liveness/readiness probes K8s Health Checks
Incident Response Handling outages PagerDuty, Statuspage Incident Management Guide
Chaos Engineering Break things intentionally Netflix Chaos Monkey Principles of Chaos
Distributed Tracing Trace requests across services OpenTelemetry, Jaeger OpenTelemetry Docs
Canary Deployments Gradual rollout Deployment Strategies
Graceful Degradation Reduce features on failure Degradation Strategies
Blue-Green Deployment Swap between two environments Blue-Green Guide

6-18 Module Backend Systems Design & Scaling Learning Plan

Module 1 – Foundations & Core Concepts

Goal: Understand the building blocks of large-scale systems.

Topics

  • Scaling basics → Vertical vs horizontal scaling
  • CDN & Caching strategies → Browser cache, Redis/Memcached, HTTP caching headers
  • Load Balancing → Round Robin, Least Connections, Consistent Hashing
  • API Gateway basics → Routing, authentication, throttling
  • Rate Limiting → Token Bucket, Leaky Bucket
  • Monolith vs Microservices → When to split

Hands-On

  1. Build a simple FastAPI app with Redis caching and Nginx load balancing
  2. Add an API Gateway (Kong or Nginx) in front of it
  3. Implement rate limiting using Redis

Resources

Module 2 – Databases & Storage

Goal: Learn how to scale and structure data.

Topics

  • SQL vs NoSQL – Trade-offs
  • Indexing & B+ Trees
  • Leader-Follower Replication & WAL
  • Sharding & Partitioning
  • Hot/Cold Storage
  • Backup & Restore strategies
  • ACID vs BASE & Consistency Models

Hands-On

  1. PostgreSQL with read replicas
  2. Implement sharding manually (range/hash partitioning)
  3. Add backup & restore scripts
  4. Compare strong vs eventual consistency with MongoDB or Cassandra

Resources

Module 3 – Asynchronous Processing & Messaging

Goal: Master async patterns for scalability.

Topics

  • Queueing – RabbitMQ, Kafka, AWS SQS
  • Dead Letter Queues
  • Fan-out/Fan-in patterns
  • Retry Patterns & Circuit Breaker
  • Long Polling vs WebSockets
  • Service Mesh basics

Hands-On

  • Mini Project:
  1. Event-driven order processing system
  2. Producer → Queue → Consumer
  3. DLQ for failed tasks
  4. Retry with exponential backoff
  5. WebSocket for real-time status updates

Resources

Module 4 – Reliability, Observability & Fault Tolerance

Goal: Make systems resilient and debuggable.

Topics

  • Fault Tolerance & Failover
  • Distributed Tracing (OpenTelemetry, Jaeger)
  • Metrics & Monitoring (Prometheus + Grafana)
  • Alerting & Incident Response
  • Health Checks & Graceful Degradation
  • Chaos Engineering (Netflix Chaos Monkey)

Hands-On

  • Mini Project:
  1. Deploy your Month 3 project on Kubernetes
  2. Add health checks, liveness/readiness probes
  3. Implement autoscaling based on CPU/memory
  4. Add Prometheus + Grafana dashboards
  5. Simulate failures & verify fault tolerance

Resources

Module 5 – Advanced Distributed Systems

Goal: Handle complexity of multi-node systems.

Topics

  • CAP Theorem deep dive
  • Leader Election (Raft, Paxos)
  • Distributed Transactions – 2PC, Saga Pattern
  • Eventual Consistency
  • Consistent Hashing
  • Service Discovery (Consul, Eureka)

Hands-On

  • Mini Project:
  1. Build a distributed key-value store (in Go or Python)
  2. Implement leader election with Raft
  3. Support consistent hashing for partitioning
  4. Add service discovery with Consul

Resources

Module 6 – Real-World Scale Architecture

Goal: Integrate all concepts into one large-scale system.

Topics

  • API Security – JWT, OAuth, CORS, Idempotency
  • Blue-Green & Canary Deployments
  • Data Retention & Compliance
  • OLAP vs OLTP optimization
  • Graceful Rollbacks

Hands-On – Capstone Project

  • High-scale e-commerce backend with:
  1. API Gateway & microservices
  2. PostgreSQL sharding + read replicas
  3. Redis caching layer
  4. Kafka-based async processing
  5. Prometheus + Grafana monitoring
  6. Kubernetes autoscaling
  7. Canary deployment strategy
  8. Disaster recovery & backup scripts

Resources

Recommended resources

About

fundamental principles to advanced distributed systems, integrating theory, hands-on implementation, and industry-proven architecture patterns

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors