General Design Principles :
- Stop guessing your capacity needs:
- Test systems at production scale:
- Automate to make architectural experimentation easier:
- Allow for evolutionary architectures:
- Drive architectures using data:
- Improve through game days:
Well Architect Framework is all about 5 pillars.
The Operational Excellence pillar includes the ability to run and monitor systems to deliver business value and to continually improve supporting processes and Procedures.
- Perform operations as code
- Annotate documentation:
- Make frequent, small, reversible changes
- Refine operations procedures frequently
- Anticipate failure
- Learn from all operational failures
- Prepare (Aws Config)
- 1: How do you determine what your priorities are?
- 2: How do you design your workload so that you can understand its state?
- 3: How do you reduce defects, ease remediation, and improve flow into production?
- 4: How do you mitigate deployment risks?
- 5: How do you know that you are ready to support a workload?
- Operate (CW)
- 6: How do you understand the health of your workload?
- 7: How do you understand the health of your operations?
- 8: How do you manage workload and operations events?
- Evolve (Amazon Elasticsearch Service)
- 9: How do you evolve operations?
The ability to protect information, system, assets and migration strategies.
- Implement a strong identity foundation
- Enable trace-ability:
- Apply security at all layers [, edge network, VPC, subnet, load balancer, every instance,operating system, and application]
- Automate security best practices
- Protect data in transit and at rest
- Keep people away from data:
- Prepare for security events:
- Identity and Access Management (IAM)
- 1: How do you manage credentials and authentication?
- 2: How do you control human access?
- 3: How do you control programmatic access?
- Detective Controls (AWS CloudTrail,AWS Config)
- 4: How do you detect and investigate security events?
- 5: How do you defend against emerging security threats?
- Infrastructure Protection (Amazon VPC, WAF, CF, ELB)
- 6: How do you protect your networks?
- 7: How do you protect your compute resources?
- Data Protection (KMS, SSE)
- 8: How do you classify your data?
- 9: How do you protect your data at rest?
- 10: How do you protect your data in transit?
- Incident Response (CW-events, Lambda, CF for create env)
- 11: How do you respond to an incident?
The Reliability pillar includes the ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.
- Test recovery procedures:
- Automatically recover from failure:
- Scale horizontally to increase aggregate system availability:
- Stop guessing capacity:
- Manage change in automation:
- Foundations [IAM, VPC]
- 1: How do you manage service limits?
- 2: How do you manage your network topology?
- Change Management [Config, cloud trial, CW, auto scaling]
- 3: How does your system adapt to changes in demand?
- 4: How do you monitor your resources?
- 5: How do you implement change?
- Failure Management [CF, durabile services : s3, galicer, KMS]
- 6: How do you back up data?
- 7: How does your system withstand component failures?
- 8: How do you test resilience?
- 9: How do you plan for disaster recovery?
The ability to use computing resource efficiently to meet system requirement and to maintain that.
- Democratize advanced technologies:
- Go global in minutes:
- Use serverless architectures:
- Experiment more often:
- Mechanical sympathy:
- 1: How do you select the best performing architecture?
- 2: How do you select your compute solution?
- 3: How do you select your storage solution?
- 4: How do you select your database solution?
- 5: How do you configure your networking solution?
- 6: How do you evolve your workload to take advantage of new releases?
- 7: How do you monitor your resources to ensure they are performing as expected?
- 8: How do you use tradeoffs to improve performance?
• Compute: Auto Scaling is key to ensuring that you have enough instances to meet demand and maintain responsiveness.
• Storage: EBS, S3
• Database: Amazon RDS provides a wide range of database features (such as PIOPS and read replicas) that allow you to optimize for your use case. Amazon DynamoDB provides single-digit millisecond latency at any scale.
• Network: Amazon Route 53 provides latency-based routing. Amazon VPC endpoints and AWS Direct Connect can reduce network distance or jitter.
The AWS Blog and the What’s New section on the AWS website are resources for learning about newly launched features and services.
Amazon CloudWatch provides metrics, alarms, and notifications that you can integrate with your existing monitoring solution, and that you can use with AWS Lambda to trigger actions.
Amazon ElastiCache, Amazon CloudFront, and AWS Snowball are services that allow you to improve performance
The ability to run systems to deliver business value at the lowest price point.
- Adopt a consumption model:
- Measure overall efficiency:
- Stop spending money on data center operations
- Analyze and attribute expenditure:
- Use managed and application level services to reduce cost of ownership:
- Expenditure Awareness
- 1: How do you govern usage?
- 2: How do you monitor usage and cost?
- 3: How do you decommission resources?
- Cost-Effective Resources
- 4: How do you evaluate cost when you select services?
- 5: How do you meet cost targets when you select resource type and size?
- 6: How do you use pricing models to reduce cost?
- 7: How do you plan for data transfer charges?
- Matching supply and demand
- 8: How do you match supply of resources with demand?
- Optimizing Over Time
- 9: How do you evaluate new services?
More about this topic : https://wa.aws.amazon.com/