- Ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues
- Design Principles
- Test recovery procedures
- Use automation to simulate different failures or to recreate scenarios that led to a failure before
- Automatically recover from failure
- Anticipate and remediate failures before they occur
- Scale horizontal to icnrease aggregate system availability
- Distribute requests accross multiple, smaller resources to ensure that they don't share a common point of failure
- Stop guessing capacity
- Maintain the optimal level to satisfy demand without over or under provisioning
- Use auto scaling
- Manage change in automation
- Use automation to make changes to infrastructure
- Test recovery procedures
- Foundations
- IAM
- AWS VPC
- AWS Service Quotas (prior Service Limits)
- AWS Trusted Advisor
- Change Management
- AWS Auto Scaling
- AWS CloudWatch
- AWS CloudTrail
- AWS Config
- Failure Management
- AWS Backups
- AWS CloudFormation
- AWS S3
- AWS Route 53