Cloud Architecture
A complete set of senior-level cloud architecture interview questions covering IAM and identity, networking and VPC design, compute and serverless, storage and databases, messaging and event-driven systems, observability, security, cost optimization, high availability, disaster recovery, and the Well-Architected Framework across AWS, GCP, and Azure.
IAM & Identity
8 questionsAWS IAM controls who can do what across all AWS services. Access is denied by default — you must explicitly allow every action.
IAM Principals:
- IAM User: Long-term identity for a person. Has username/password and/or access keys. Avoid for applications — use roles instead.
- IAM Role: Identity assumed temporarily by EC2, Lambda, another account, or a federated user. STS issues short-lived credentials. The right choice for all workloads.
- IAM Group: Collection of users that share policies. Not a principal — can't be assumed or referenced in resource policies.
# Identity-based policy (attached to user/role):
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-bucket/*",
"Condition": {"StringEquals": {"aws:RequestedRegion": "us-east-1"}}
}]
}
# Resource-based policy (e.g., S3 bucket policy — enables cross-account):
{
"Principal": {"AWS": "arn:aws:iam::111222333:role/DataTeam"},
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::analytics-bucket/*"
}
# Permission evaluation order (DENY wins at every step):
# 1. Explicit Deny in ANY policy → always denied
# 2. SCP (org ceiling) → must allow
# 3. Resource-based policy OR Identity-based policy → either grants
# 4. IAM Permission Boundary → caps what identity can do
# 5. Session policy → further restricts assumed role
Best practices: Delete root access keys. Enable MFA on root and privileged users. Use roles for EC2/Lambda/ECS. Apply least privilege. Use IAM Access Analyzer to identify over-permissive policies and unused access.
SCPs (Service Control Policies) set maximum permissions for all accounts in an AWS Organization. They don't grant permissions — they restrict what IAM can grant. Even the account root user is bound by SCPs.
# SCP: prevent disabling security services and leaving the org
{
"Statement": [
{"Effect": "Deny", "Action": "organizations:LeaveOrganization", "Resource": "*"},
{"Effect": "Deny",
"Action": ["cloudtrail:StopLogging","cloudtrail:DeleteTrail","guardduty:DeleteDetector"],
"Resource": "*"},
{"Effect": "Deny",
"Action": ["iam:CreateUser", "iam:CreateAccessKey"],
"Resource": "*",
"Condition": {"StringNotEquals": {"aws:PrincipalTag/Team": "platform"}}}
]
}
# Restrict to approved regions only:
{
"Effect": "Deny",
"NotAction": ["iam:*","sts:*","support:*","trustedadvisor:*"],
"Resource": "*",
"Condition": {
"StringNotEquals": {"aws:RequestedRegion": ["us-east-1","eu-west-1"]}
}
}
# SCP hierarchy: Root OU → Production OU → Account
# Policy inherited down the tree — child cannot remove parent grants
# AWS Control Tower: managed Landing Zone with pre-built SCPs (guardrails)
# AWS IRSA (IAM Roles for Service Accounts):
# EKS cluster acts as an OIDC provider
# IAM role trust policy allows a specific Kubernetes service account:
{
"Principal": {"Federated": "arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/CLUSTER"},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"oidc.eks.us-east-1.amazonaws.com/id/CLUSTER:sub":
"system:serviceaccount:my-namespace:my-service-account"
}
}
}
# Annotate the Kubernetes service account:
kubectl annotate serviceaccount my-service-account \
eks.amazonaws.com/role-arn=arn:aws:iam::ACCOUNT:role/MyRole
# Pod SDK auto-gets credentials — no secret needed anywhere!
import boto3
s3 = boto3.client('s3') # uses IRSA credentials automatically
# GKE Workload Identity: same concept
gcloud iam service-accounts add-iam-policy-binding gsa@project.iam.gserviceaccount.com \
--role roles/iam.workloadIdentityUser \
--member "serviceAccount:project.svc.id.goog[namespace/ksa-name]"
# Benefits:
# No static AWS_ACCESS_KEY_ID in secrets
# Credentials rotate automatically (temporary STS tokens)
# Per-pod IAM — each service has minimal permissions
# Audit trail: CloudTrail shows which pod assumed which role
# Architecture: Corporate IdP (Azure AD / Okta) → IAM Identity Center → AWS Accounts
# One login portal → employees access all assigned accounts and roles
# Generates temporary credentials — no long-lived access keys
# Setup:
# 1. Connect external IdP via SCIM (auto-provision users/groups) + SAML (authentication)
# 2. Create Permission Sets (IAM role templates):
# - AdministratorAccess (Platform team only)
# - DeveloperAccess (read most things, write to own services)
# - ReadOnly (security auditors, management)
# 3. Assign: Group "engineering-team" → Permission Set "DeveloperAccess" → Account "prod-app"
# Assignment example:
aws sso-admin create-account-assignment \
--instance-arn arn:aws:sso:::instance/ssoins-xxx \
--target-id 123456789012 \ # AWS account ID
--target-type AWS_ACCOUNT \
--permission-set-arn arn:aws:sso:::permissionSet/ssoins-xxx/ps-xxx \
--principal-type GROUP \
--principal-id group-id-from-idp
# User experience:
# Employee opens AWS access portal → selects account → selects role → opens console
# CLI: aws sso login → credentials cached for session
# Temporary credentials: 1-12hr session duration
# Key benefit: deprovisioning
# Employee leaves company → disable in AD → SCIM removes from IAM Identity Center
# → All AWS access revoked automatically within minutes
A permission boundary caps the maximum permissions a user or role can have. Effective permissions = intersection of the identity policy AND the boundary. Key use case: allow developers to create IAM roles without being able to escalate privileges.
# Platform team creates a boundary: developers can use app services, not IAM admin
{
"Statement": [
{"Effect": "Allow", "Action": ["s3:*","dynamodb:*","lambda:*","logs:*","xray:*"], "Resource": "*"},
{"Effect": "Deny", "Action": ["iam:CreatePolicy","iam:AttachRolePolicy","organizations:*"], "Resource": "*"}
]
}
# Platform team's CI/CD role can create roles, BUT only with the boundary attached:
{
"Effect": "Allow",
"Action": ["iam:CreateRole","iam:PutRolePolicy"],
"Resource": "*",
"Condition": {
"StringEquals": {
"iam:PermissionsBoundary": "arn:aws:iam::ACCOUNT:policy/DeveloperBoundary"
}
}
}
# Result: developer CI/CD can create roles for their Lambda functions
# but those roles can NEVER have IAM admin permissions
# even if developer's policy accidentally grants it, boundary blocks it
# CDK bootstrap uses permission boundaries this way
# AWS CDK --cloudformation-execution-policies flag for CI/CD safety
# GCP IAM: "member has role on resource"
# Binding: serviceAccount:sa@project.iam → roles/storage.objectViewer → bucket/my-bucket
# Resource hierarchy (policies inherit down):
# Organization → Folder → Project → Resource
# A binding at folder level applies to all projects in that folder
gcloud projects add-iam-policy-binding my-project \
--member="serviceAccount:my-sa@my-project.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer" \
--condition="expression=request.time < timestamp('2025-01-01T00:00:00Z'),title=temp-access"
# Role types:
# Basic: Owner > Editor > Viewer (too broad — avoid)
# Predefined: roles/storage.objectViewer (narrowly scoped)
# Custom: exact permission set you define
# Organization Policy constraints (≈ AWS SCPs):
# Enforce security baselines across all projects in the org
gcloud org-policies set-policy constraints/compute.requireOsLogin --organization=ORG_ID
gcloud org-policies set-policy constraints/iam.disableServiceAccountKeyCreation --organization=ORG_ID
gcloud org-policies set-policy constraints/compute.restrictCloudRunRegion --organization=ORG_ID
# Recommended policies:
# disableServiceAccountKeyCreation: force use of workload identity
# requireOsLogin: SSH key management via IAM
# iam.allowedPolicyMemberDomains: only allow your org's domain as IAM members
# storage.uniformBucketLevelAccess: no per-object ACLs (simpler security model)
Azure has two separate RBAC systems that are often confused — understanding the difference is critical.
- Azure AD roles (Entra ID): Manage identities in Azure AD itself. Global Administrator, User Administrator, Application Administrator. Directory-level, not resource-level.
- Azure resource roles: Control access to Azure resources. Owner, Contributor, Reader, Storage Blob Data Contributor. Applied at Management Group → Subscription → Resource Group → Resource scope.
# Assign resource role:
az role assignment create \
--assignee my-managed-identity \
--role "Storage Blob Data Reader" \
--scope "/subscriptions/SUB-ID/resourceGroups/my-rg"
# Custom role:
az role definition create --role-definition '{
"Name": "App Restarter",
"Actions": ["Microsoft.Web/sites/restart/action"],
"Scope": ["/subscriptions/SUB-ID"]
}'
# Azure PIM (Privileged Identity Management):
# Just-in-time role activation — no standing admin privileges
# Eligible vs Active assignments
# Approval workflows, MFA required, time-limited (max 8 hours)
# All activations audited to Azure AD audit log
# Managed Identity (≈ AWS IAM role for workloads):
# System-assigned: tied to the resource lifecycle (VM, App Service)
# User-assigned: standalone identity, can be assigned to multiple resources
# App Service/Function with managed identity → calls Key Vault without credentials
# Cross-account role assumption:
# Account B creates a role with trust policy allowing Account A:
{
"Principal": {"AWS": "arn:aws:iam::ACCOUNT-A:root"},
"Action": "sts:AssumeRole",
"Condition": {"Bool": {"aws:MultiFactorAuthPresent": "true"}} # optional MFA requirement
}
# Account A assumes the role:
import boto3
assumed = boto3.client('sts').assume_role(
RoleArn="arn:aws:iam::TARGET:role/CrossAccountRole",
RoleSessionName="my-session",
DurationSeconds=3600
)
# Use temp credentials to call services in Account B
# Multi-account strategy:
# 1. Account-per-environment: dev / staging / prod in separate accounts
# → No accidental prod deployments from dev
# → Blast radius containment
# 2. Account-per-team: each team owns their AWS account
# → Clear billing attribution (cost per team)
# → Independent IAM policies
# 3. Shared services account:
# → Central DNS, logging, security tooling, transit gateway
# → Spoke accounts trust hub for shared services
# AWS Resource Access Manager (RAM):
# Share resources across accounts without complex resource policies
# Share: VPC subnets, Transit Gateways, Route 53 Resolver rules, License Manager
aws ram create-resource-share \
--name my-subnet-share \
--resource-arns arn:aws:ec2:us-east-1:ACCOUNT:subnet/subnet-xxx \
--principals arn:aws:organizations::ACCOUNT:organization/o-xxx
Networking & VPC
8 questions# Three-tier VPC across 3 Availability Zones:
VPC: 10.0.0.0/16 (65,536 IPs)
# Public subnets (ALB, NAT Gateway, Bastion):
# 10.0.1.0/24 10.0.2.0/24 10.0.3.0/24
# Route: 0.0.0.0/0 → Internet Gateway
# Private app subnets (ECS tasks, EKS nodes, Lambda in VPC):
# 10.0.11.0/24 10.0.12.0/24 10.0.13.0/24
# Route: 0.0.0.0/0 → NAT Gateway (one per AZ for resilience)
# Isolated data subnets (RDS, ElastiCache — no internet):
# 10.0.21.0/24 10.0.22.0/24 10.0.23.0/24
# Route: local only
# Security Groups (stateful, per-resource):
# ALB SG: inbound 443 from 0.0.0.0/0
# App SG: inbound 8080 from ALB SG only
# DB SG: inbound 5432 from App SG only
# Network ACLs (stateless, per-subnet — extra layer):
# Block known bad IPs at subnet level
# Must allow ephemeral ports 1024-65535 for return traffic
# VPC Flow Logs → S3/CloudWatch for network visibility
# Gateway endpoints for S3/DynamoDB (free, no NAT needed)
# Interface endpoints for ECR, CloudWatch, Secrets Manager (keep traffic private)
VPC Peering: Direct one-to-one link between two VPCs. No transitive routing — if A↔B and B↔C, A cannot reach C via B. For n VPCs, requires n(n-1)/2 connections. Simple and cheap for a small number of VPCs.
Transit Gateway: Hub-and-spoke — all VPCs attach to one TGW, enabling transitive routing. Route tables on the TGW control which attachments can reach which. Scales to thousands of VPCs.
# VPC Peering: 10 VPCs = 45 connections; 100 VPCs = 4,950 connections
# Transit Gateway: 100 VPCs = 100 attachments — manageable
# TGW segmentation with multiple route tables:
# prod-rt: production VPCs ↔ shared-services VPC (isolated from dev)
# dev-rt: dev VPCs ↔ shared-services VPC (can't reach prod)
# → Strong isolation even though all use the same TGW
# Centralized egress via TGW (cost saving):
# All private VPCs → TGW → single egress VPC with NAT Gateway
# Avoids one NAT Gateway per VPC (saves ~$160/month per VPC)
# On-premises connectivity via TGW:
# Direct Connect / VPN → TGW → all VPCs (single connection serves all)
# AWS RAM: share TGW across accounts in org
# Network account owns TGW, spoke accounts attach their VPCs
# PrivateLink: private connectivity to AWS services or your own services
# Traffic stays within the AWS network backbone — never touches the internet
# Gateway endpoints (S3 and DynamoDB — free):
# Route table entry points S3/DynamoDB traffic to endpoint directly
# Lambda/ECS in private subnet → S3 without NAT Gateway
# Interface endpoints (all other services — paid $0.01/hr + $0.01/GB):
# Creates an ENI in your subnet with a private IP
# DNS resolution: s3.us-east-1.amazonaws.com → 10.0.1.x (private IP)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-xxx \
--service-name com.amazonaws.us-east-1.secretsmanager \
--subnet-ids subnet-xxx \
--security-group-ids sg-xxx \
--private-dns-enabled
# Why this matters:
# Lambda/ECS calling Secrets Manager without NAT Gateway → saves cost + latency
# Compliance: PCI-DSS/HIPAA require no internet exposure for sensitive calls
# Security: internal traffic can't be intercepted at internet layer
# PrivateLink for your own services (cross-account/SaaS):
# Create NLB → VPC Endpoint Service
# Consumers create Interface Endpoints to reach your service privately
# Snowflake, Datadog, MongoDB Atlas all support PrivateLink connectivity
# Site-to-Site VPN:
# IPSec tunnel over the internet. Up to 1.25 Gbps per tunnel.
# Setup in minutes. Cost: ~$0.05/hr + data transfer.
# Use for: quick setup, moderate bandwidth, backup connectivity
# AWS Direct Connect:
# Dedicated fiber circuit from on-prem to AWS Direct Connect location
# 1/10/100 Gbps dedicated bandwidth. Consistent latency (~1ms in same metro).
# Setup: weeks/months (physical provisioning). Cost: $300-$14,000/month port fee.
# Use for: high bandwidth (>1 Gbps), compliance (no internet), financial workloads
# Resilient hybrid architecture:
# Primary: Direct Connect (dedicated 10 Gbps)
# Secondary: Site-to-Site VPN (automatic BGP failover if DX fails)
# DX route advertised with better MED → VPN used only when DX is down
# Direct Connect Gateway:
# One DX circuit → VPCs in multiple AWS regions
# Attach to TGW in each region → all VPCs reach on-premises
# Pricing comparison (100GB/month data transfer to on-prem):
# VPN: $36 + $9 data = $45/month
# DX 1 Gbps: $220 port + $2 data = $222/month (better at higher volumes)
# Route 53 routing policies:
# Simple: single value, no health checks
# Weighted: A/B testing, gradual migrations
# api.example.com → v1-alb: weight=80, v2-alb: weight=20
# Latency-based: route to lowest-latency region
# US users → us-east-1 ALB; EU users → eu-west-1 ALB
# Geolocation: data residency (GDPR)
# EU users → EU endpoint; North America → US endpoint
# Default record required for unmatched locations
# Failover: active-passive
# Primary: prod-alb (healthy = serves all traffic)
# Secondary: dr-alb (used only when primary health check fails)
# Health check: HTTP 200 on /health every 30s from 3 regions
# Geoproximity: traffic shifting with bias adjustment
# Multivalue: up to 8 healthy IPs (basic client-side load balancing)
# Health check configuration:
aws route53 create-health-check --caller-reference mycheck \
--health-check-config '{
"IPAddress": "1.2.3.4",
"Port": 443, "Type": "HTTPS",
"ResourcePath": "/health",
"RequestInterval": 30,
"FailureThreshold": 3
}'
# TTL recommendation: 60s (balance responsiveness vs DNS caching)
# Private hosted zones: internal DNS within VPC (api.internal → 10.0.1.15)
# CloudFront request flow:
# User → Nearest Edge (450+ PoPs) → Cache hit? Serve. Miss? → Origin (ALB/S3/API GW)
# Origins:
# S3 with OAC (Origin Access Control): only CloudFront can read S3 (not public)
# ALB: restrict via security group allowing only CloudFront IP ranges
# Cache behaviors by path:
# /api/* → ALB origin, TTL=0 (dynamic, no caching), forward all headers
# /static/* → S3 origin, TTL=86400 (1 day), cache headers/cookies stripped
# Default → ALB origin, TTL=300
# Lambda@Edge vs CloudFront Functions:
# CF Functions: <1ms, 250+ PoPs, viewer request/response only
# → URL rewrites, header manipulation, simple auth token check
# Lambda@Edge: full Lambda, ~13 regional edge caches, all 4 events
# → JWT validation, A/B testing, SSR, geo-based content
# Common patterns:
# 1. SPA hosting: S3 → CloudFront (HTTPS, custom domain, DDoS protection)
# 2. API caching: cache GET responses at edge, skip POST/PUT
# 3. Signed URLs: time-limited access to private S3 content
# 4. WAF at edge: block attacks before they reach origin
# 5. Geographic restriction: deny specific countries at edge
# ALB (Application LB) — Layer 7, HTTP/HTTPS/gRPC:
# Content-based routing: path (/api → service-a, /v2 → service-b)
# Host-based routing: api.example.com vs www.example.com
# Target types: EC2, IPs (containers), Lambda
# WAF integration, request/response headers, redirect rules
# Use for: web applications, microservices, REST APIs, gRPC services
# NLB (Network LB) — Layer 4, TCP/UDP:
# Extreme performance: millions req/s, ~100µs latency
# Static IPs per AZ (whitelist in customer firewalls)
# Preserves source IP address (ALB does not)
# TLS passthrough (ALB terminates TLS; NLB can pass through to backend)
# Use for: high-performance APIs, PrivateLink services, gaming, IoT, VoIP
# GWLB (Gateway LB) — Layer 3, for network appliances:
# Transparent bump-in-the-wire for security appliances (Palo Alto, Fortinet)
# GENEVE protocol encapsulation; traffic inspected then forwarded
# Use for: centralized ingress/egress inspection, IDS/IPS
# Target group health check tips:
# Reduce interval to 10s for faster failure detection
# Deregistration delay: reduce from 300s to 30s for faster deployments
# Evaluate target health: ALB/NLB check actual backend health, not just port
GCP's VPC is fundamentally global, while AWS VPC is regional. A single GCP VPC spans all regions — subnets are regional but the network is not. This means VMs in different regions can communicate privately without VPC peering or transit gateways.
# GCP VPC — one VPC, multiple regional subnets:
gcloud compute networks create my-vpc --subnet-mode=custom
gcloud compute networks subnets create us-subnet \
--network=my-vpc --region=us-central1 --range=10.0.1.0/24
gcloud compute networks subnets create eu-subnet \
--network=my-vpc --region=europe-west1 --range=10.0.2.0/24
# VM in us-central1 and VM in europe-west1 communicate privately by default!
# Shared VPC (equivalent to AWS RAM for subnets):
# Host project owns the VPC and subnets
# Service projects can create resources using host project's subnets
# Centralized network control, distributed app ownership
# Common: one network team manages Shared VPC; dev teams deploy apps in it
# GCP Firewall rules (equivalent to AWS security groups):
gcloud compute firewall-rules create allow-web \
--network=my-vpc \
--allow=tcp:443 \
--source-ranges=0.0.0.0/0 \
--target-tags=web-server # apply to VMs with this network tag
# Cloud Interconnect (≈ Direct Connect):
# Dedicated: 10/100 Gbps to GCP PoP
# Partner: through telco partner (lower bandwidth options)
# Cloud Router: BGP routing between GCP and on-prem
# GCP Global Load Balancer:
# Single anycast IP serves all regions
# Routes to nearest healthy backend globally
# HTTP(S) LB → backends in multiple regions = global HA with one IP
Compute & Serverless
8 questions# Purchasing options:
# On-Demand: full price, no commitment → dev/test, spiky unpredictable loads
# Spot: up to 90% off, 2-min interruption notice → batch, ML training, CI/CD
# Reserved (1 or 3 year): 40-72% off → steady-state production workloads
# Savings Plans (Compute): commit to $/hr, covers EC2+Fargate+Lambda → most flexible
# Dedicated Host: physical server (BYOL licensing, compliance)
# Graviton (ARM): m7g/c7g/r7g — 20-40% better price/perf for most workloads
# Instance families:
# m-series: general purpose; c-series: compute (CPU-bound)
# r-series: memory (in-memory DBs, caches); i-series: storage (local NVMe)
# p/g-series: GPU (ML); inf2: ML inference
# Right-sizing process:
# 1. Enable CloudWatch detailed monitoring + CW Agent for memory
# 2. AWS Compute Optimizer: ML-based recommendations from actual utilization
# 3. Target: p95 CPU 40-70%, p95 memory 60-80%
# 4. Downsize if p95 CPU < 20% consistently over 14 days
# 5. Benchmark: test Graviton equivalent — usually 20-40% cheaper, same performance
# 6. Switch to Savings Plans AFTER right-sizing (commit to right size)
# Lambda execution environment lifecycle:
# Cold start: new Firecracker microVM → runtime init → function init → handler
# Warm: reuse existing environment → handler only
# Cold start durations by runtime:
# Node.js/Python: ~200-500ms Go: ~100-200ms
# Java (no SnapStart): 1-3s Java (SnapStart): ~200ms
# .NET: ~300-600ms
# Minimizing cold starts:
# 1. Provisioned concurrency (guaranteed warm environments):
aws lambda put-provisioned-concurrency-config \
--function-name my-fn --qualifier prod \
--provisioned-concurrent-executions 10
# Use auto-scaling for provisioned concurrency: scale on schedule or metric
# 2. Lambda SnapStart (Java): snapshot initialized JVM, resume from snapshot
# 3. Reduce package size: smaller zip = faster download and init
# 4. Move init outside handler:
import boto3
_s3 = boto3.client('s3') # initialized once per warm container
_db_conn = connect_db() # pooled across invocations
def handler(event, context):
return _s3.get_object(...) # reuses warm client — fast!
# 5. Use Lambda layers for shared dependencies (cached separately)
# 6. Lambda URLs + Function URLs: direct HTTPS endpoint without API GW overhead
# Lambda limits: 15 min max, 10GB memory, 250MB deployment package (unzipped 512MB)
# Concurrency: 3,000 burst initial + 500/min scaling (US regions)
# ECS (Elastic Container Service): AWS-native orchestration
# Simpler than Kubernetes; integrates deeply with ALB, IAM, CloudWatch
# Task Definition: container spec (image, CPU, memory, env vars, IAM role)
# Service: maintains desired count, integrates with load balancer
# Cluster: logical grouping; launch type: EC2 or Fargate
# EKS (Elastic Kubernetes Service): managed Kubernetes control plane
# AWS manages: API server, etcd, scheduler, controller manager (HA, auto-patched)
# You manage: worker nodes, add-ons (CoreDNS, VPC CNI, Load Balancer Controller)
# Choose when: existing K8s expertise, complex scheduling needs, multi-cloud portability
# Fargate: serverless containers (no EC2 management)
# Each pod/task gets its own Firecracker microVM (strong isolation)
# Per-vCPU-second billing; ~20-30% premium over equivalent EC2
# No daemonsets, no privileged containers, limited observability
# Best for: teams avoiding node management, variable workloads
# Karpenter (EKS node autoscaler — recommended over Cluster Autoscaler):
# Watches for unschedulable pods → provisions right-sized EC2 in ~30s
# Bin-packing: consolidates underutilized nodes automatically
# Mixed instance types: automatically picks cheapest/best fit
# Decision:
# AWS-native, no K8s expertise → ECS on Fargate
# Scale, cost optimization, K8s ecosystem → EKS + EC2 + Karpenter
# Avoid all node management, lower scale → ECS/EKS Fargate
# Event-driven, short-lived, ≤15 min → Lambda
# Target Tracking (recommended — like a thermostat):
aws autoscaling put-scaling-policy \
--policy-type TargetTrackingScaling \
--target-tracking-configuration '{
"PredefinedMetricSpecification": {"PredefinedMetricType": "ASGAverageCPUUtilization"},
"TargetValue": 60.0,
"ScaleOutCooldown": 60, # fast scale-out
"ScaleInCooldown": 300 # slow scale-in (stability)
}'
# Scale out aggressively, scale in conservatively:
# Scale out: fire alarm approach — add capacity fast at first sign of load
# Scale in: wait for sustained low load (avoid thrashing)
# Scheduled scaling (known traffic patterns):
aws autoscaling put-scheduled-update-group-action \
--scheduled-action-name workday-start \
--recurrence "0 8 * * MON-FRI" \ # 8am weekdays UTC
--desired-capacity 20 --min-size 10
# Predictive scaling (ML-based — pre-scales before load arrives):
# Analyzes 14 days of history → forecasts and pre-warms instances
# Avoids lag from reactive scaling — instances ready before traffic arrives
# Best for: regular diurnal patterns (office hours, business cycles)
# ECS Service Auto Scaling:
# Scale on: ALB RequestCountPerTarget (per-task), SQS queue depth (for workers)
# Application Auto Scaling API — works for ECS, Lambda (concurrency), DynamoDB
# Scale-to-zero with Lambda and Fargate spot:
# Lambda: inherently scales to zero, pay only when invoked
# Fargate: min-tasks=0 + SQS trigger to scale from 0 → saves cost for batch
# Spot: spare EC2 capacity at 70-90% discount; 2-min interruption notice
# Interruption rate: ~1-5%/day depending on instance type and AZ
# ASG mixed instances (best practice):
aws autoscaling create-auto-scaling-group --mixed-instances-policy '{
"InstancesDistribution": {
"OnDemandBaseCapacity": 2, # 2 on-demand always
"OnDemandPercentageAboveBaseCapacity": 20, # 80% spot above base
"SpotAllocationStrategy": "price-capacity-optimized"
},
"LaunchTemplate": {
"Overrides": [
{"InstanceType": "m5.large"},
{"InstanceType": "m5a.large"},
{"InstanceType": "m6i.large"},
{"InstanceType": "m6a.large"} # multiple types = higher availability
]
}
}'
# Handle 2-minute interruption notice:
import requests, time
def check_interruption():
try:
r = requests.get(
"http://169.254.169.254/latest/meta-data/spot/instance-action",
timeout=1
)
if r.status_code == 200:
return True # interruption imminent
except: pass
return False
# On interruption: drain from ALB, checkpoint state, graceful shutdown
# Spot Capacity Blocks: reserve GPU spot capacity in advance for ML training runs
# Cloud Run: serverless containers — deploy any HTTP container
gcloud run deploy my-service \
--image gcr.io/my-project/my-app:latest \
--region us-central1 \
--min-instances 0 \ # scale to zero
--max-instances 100 \
--concurrency 80 \ # requests per container instance
--memory 512Mi \
--cpu 1
# Cloud Run vs Lambda:
# Cloud Run: any container, any language, 60-min timeout, up to 32GB/8vCPU
# Lambda: function-as-a-service, 15-min max, 10GB memory, tighter AWS integration
# Cloud Run concurrency: multiple requests per container (like a web server)
# Lambda: one request per invocation (scale = containers)
# Cold starts: both ~1-2s; Cloud Run min-instances eliminates cold starts
# Cloud Run Jobs: batch/scheduled tasks (not HTTP-triggered)
# Cloud Run for Anthos: deploy to GKE using same Cloud Run API
# Azure Container Apps (equivalent):
# Serverless containers with KEDA-based scaling (HTTP, queue depth, custom metrics)
# Scale to zero, VNet integration, Dapr support for microservices
# Managed Kubernetes underneath, fully abstracted
# AWS App Runner (closest to Cloud Run on AWS):
# Source: container image or GitHub repo → fully managed HTTP service
# Auto-scales, no VPC/IAM complexity to set up
# More limited than ECS Fargate but much simpler to get started
# Step Functions: serverless state machines (Amazon States Language)
# States: Task, Choice, Parallel, Map, Wait, Pass, Succeed, Fail
# Example: order processing with parallel stock check + payment + error handling
{
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:::function:validate",
"Catch": [{"ErrorEquals": ["ValidationError"], "Next": "Reject"}],
"Next": "ProcessInParallel"
},
"ProcessInParallel": {
"Type": "Parallel",
"Branches": [
{"StartAt": "ProcessPayment", "States": {"ProcessPayment": {"Type": "Task", "Resource": "..."}}},
{"StartAt": "ReserveStock", "States": {"ReserveStock": {"Type": "Task", "Resource": "..."}}}
],
"Next": "Ship"
}
}
}
# Express vs Standard:
# Standard: exactly-once, up to 1 year, full audit history → business processes
# Express: at-least-once, up to 5 min, high throughput → event processing
# SDK integrations: call 200+ AWS services directly without Lambda glue code
# Built-in retry with exponential backoff, error handling, timeouts
# Use cases:
# Multi-step ETL pipelines (Glue + EMR + Athena)
# Human approval workflows (SES email → approval callback)
# Saga pattern for distributed transactions
# ML pipelines (SageMaker training + evaluation + deployment)
# AWS manages (EKS control plane):
# Kubernetes API server, etcd, scheduler, controller manager
# Multi-AZ deployment, automatic failover, version patches
# etcd backups
# Control plane HA SLA: 99.95%
# You manage (data plane):
# Worker nodes (EC2 instances in your VPC)
# Node OS patching and AMI updates (managed node groups auto-update)
# EKS add-ons: CoreDNS, kube-proxy, VPC CNI, EBS CSI driver
# Karpenter or Cluster Autoscaler (node provisioning)
# Container images, application configs, Kubernetes RBAC
# Cluster networking: security groups, VPC CIDR planning
# EKS VPC CNI: each pod gets a VPC IP (no overlay network)
# → Pods directly routable within VPC → simpler security group rules
# → Limitation: IP exhaustion if CIDR too small (plan /19 per subnet minimum)
# Prefix delegation: /28 prefixes per ENI → 16 IPs per prefix → more pods per node
# EKS Fargate profiles (no EC2 nodes):
# Each pod runs in its own Firecracker microVM
# Profile selector: namespace + label → Fargate pod
# No daemonsets, no privileged pods, no node-level access
# Billing: per pod vCPU-second (more expensive than EC2 for consistent workloads)
# Version management:
# EKS supports n-3 Kubernetes versions
# Control plane upgrade: 30 min, in-place (but coordinate with data plane)
# Worker node upgrade: rolling update via managed node groups
Storage & Databases
8 questions# Storage classes (cost vs access tradeoff):
# Standard: $0.023/GB — frequent access, ms latency
# Standard-IA: $0.0125/GB — infrequent access, retrieval fee
# One Zone-IA: $0.010/GB — single AZ, recreatable data
# Glacier Instant: $0.004/GB — quarterly access, ms retrieval
# Glacier Flexible: $0.0036/GB — minutes-hours retrieval
# Glacier Deep Archive: $0.00099/GB — years retention, 12h retrieval
# Intelligent-Tiering: auto-moves between tiers based on access patterns
# Lifecycle policy (automated cost management):
{
"Rules": [{
"ID": "archive-logs",
"Prefix": "logs/",
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER_IR"},
{"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
],
"Expiration": {"Days": 2555} # 7-year deletion
}]
}
# Performance optimization:
# Throughput: 3,500 PUT/5,500 GET per second PER PREFIX
# High-throughput: randomize prefix (avoid sequential) → more parallelism
# Large objects (>100MB): multipart upload (parallelizes, allows resume)
# Global reads: byte-range fetches, CloudFront caching, Transfer Acceleration
# S3 Express One Zone: 10x throughput, single-digit ms (2023) — for ML data loading
# RDS: managed MySQL/PostgreSQL/SQL Server/Oracle
# Multi-AZ: synchronous standby → automatic failover (~1-2 min)
# Read replicas: async, up to 5, cross-region
# Use for: existing relational workloads, complex SQL, ACID transactions
# Aurora (MySQL/PostgreSQL compatible):
# 3-5x faster than standard MySQL; distributed storage (6 copies, 3 AZs)
# Aurora Serverless v2: auto-scales compute in 0.5 ACU increments
# Aurora Global Database: cross-region, <1s replication lag
# Use instead of RDS when: production workload, need 99.99%+ availability
# DynamoDB (NoSQL key-value/document):
# Single-digit ms at any scale, serverless, 99.999% SLA
# Global Tables: multi-region active-active
# Use for: high-scale simple access patterns, gaming, IoT, session storage
# ElastiCache:
# Redis: sub-ms in-memory, persistence, pub/sub, sorted sets, cluster mode
# Memcached: simple distributed cache, multi-threaded
# Use for: database query caching, session storage, leaderboards, rate limiting
# Decision:
# Complex SQL + ACID → Aurora (new) or RDS (existing)
# Simple access patterns + massive scale → DynamoDB
# In-memory caching + pub/sub → ElastiCache Redis
# Time-series → Amazon Timestream
# Full-text search → OpenSearch Service
# Data warehouse → Redshift
# DynamoDB: access patterns drive the model — opposite of relational
# Identify ALL access patterns first, then design keys around them
# Primary key: PK (partition key) alone, or PK + SK (sort key)
# PK determines partition; SK enables range queries within a partition
# GSI: alternate PK+SK — enables new access patterns on existing data
# Single-table design (e-commerce example):
# Entity type prefix pattern:
# PK=USER#alice, SK=PROFILE → user profile
# PK=USER#alice, SK=ORDER#2024-01-15#001 → user's order
# PK=ORDER#001, SK=ITEM#widget-a → order line item
# PK=PRODUCT#widget, SK=METADATA → product info
# GSI for inverted access:
# GSI1PK=ORDER#2024-01-15, GSI1SK=STATUS#SHIPPED
# Query: "all orders on this date with status SHIPPED"
# Access patterns supported:
# GetUser(userId): PK=USER#alice, SK=PROFILE
# GetUserOrders(userId): PK=USER#alice, SK begins_with ORDER#
# GetOrderItems(orderId): PK=ORDER#001, SK begins_with ITEM#
# GetOrdersByDate(date, status): GSI1 query
# Hot partition prevention:
# Avoid high-cardinality PK all writing to one partition
# Add shard suffix: PK=ORDERS#3 (random 1-10) for write-heavy entities
# Write sharded, read from all shards and merge
# Capacity: 1 WCU = 1KB write; 1 RCU = 4KB strongly consistent read
# Item size limit: 400KB max
# Aurora Global Database:
# Primary region: all writes
# Secondary regions (up to 5): read-only replicas, <1s replication lag
# RPO: <1 second | RTO: <1 minute (managed failover)
# Use cases:
# 1. Read scale: route regional users to nearest region for low-latency reads
# 2. Disaster recovery: fast failover if primary region fails
# 3. Compliance: data in specific regions for data residency
# Managed failover (planned, e.g., region maintenance):
# Promotes secondary in <35s, no data loss
# vs DynamoDB Global Tables:
# DynamoDB: active-active (all regions write), eventual consistency
# Aurora Global: active-passive (writes only in primary), strongly consistent
# Multi-region write patterns:
# Single primary (Aurora Global): consistent, writes bottleneck to one region
# Sharded by region: EU users → EU DB, US users → US DB (no cross-region reads)
# Active-active (DynamoDB / CockroachDB): all regions accept writes,
# eventual consistency, conflict resolution required
# RDS Proxy: connection pooling for Lambda/serverless → Aurora
# Prevents connection exhaustion when Lambda scales to thousands of concurrent executions
# Also provides: IAM authentication for DB, automatic failover handling
# Lake house: S3 as central store, query in place (no ETL to data warehouse)
# Zones: raw (landing) → curated (cleaned/partitioned) → consumption (aggregated)
# Ingestion:
# Batch: AWS Glue ETL (Spark) or EMR → S3
# Streaming: Kinesis Firehose → S3 (buffered micro-batch, 1-15 min)
# CDC from databases: DMS → Kinesis → S3 or direct to S3
# Storage (open table formats — replacing Hive):
# Apache Iceberg: ACID transactions on S3, time travel, schema evolution
# Delta Lake: Databricks-native, now open source, same capabilities
# Both supported by Athena v3, Glue, EMR natively
# Querying:
# Athena: serverless SQL on S3, $5/TB scanned → use Parquet + partition pruning
# Redshift Spectrum: query S3 from Redshift (join warehouse + lake)
# EMR: Spark/Hive for complex transformations
# Catalog and governance:
# Glue Catalog: Hive-compatible metastore (schemas, partitions)
# Lake Formation: column/row-level access control on the lake
# Glue Crawler: auto-discovers schemas from S3
# Performance best practices:
# Parquet/ORC format: columnar → 10x less data scanned for analytical queries
# Partition by date: WHERE dt='2024-01-15' → prune non-matching partitions
# Compaction: merge many small files into fewer large files (Iceberg auto-compaction)
# Predicate pushdown: Athena pushes filters into Parquet reader
# Cache-aside (lazy loading — most common):
def get_user(user_id):
user = redis.get(f"user:{user_id}")
if user: return json.loads(user)
user = db.query("SELECT * FROM users WHERE id = %s", user_id)
redis.setex(f"user:{user_id}", 3600, json.dumps(user))
return user
# Pro: only caches accessed data; Con: cache miss = 2 trips
# Write-through: write DB and cache on every update
def update_user(user_id, data):
db.update(user_id, data)
redis.setex(f"user:{user_id}", 3600, json.dumps(data))
# Pro: cache always fresh; Con: extra write latency
# Cache stampede prevention (thundering herd):
# On cache miss, hundreds of requests hit DB simultaneously
# Fix: probabilistic early expiration or mutex lock on key
lock = redis.set(f"lock:user:{uid}", 1, nx=True, ex=10)
if not lock: time.sleep(0.1); return get_user_with_lock(uid)
# CloudFront as API cache:
# Cache GET /products/* for 5 minutes at edge (300 ALB requests → 1)
# Vary cache by Accept-Language header for localized content
# Tiered caching:
# L1: Local in-process cache (LRU dict, 100 entries) — microseconds
# L2: ElastiCache Redis (shared across instances) — sub-ms
# L3: Database query result — ms
# Cache invalidation on write:
# Simple TTL: accept up to TTL stale data
# Write-through: update cache synchronously
# Event-based: publish invalidation event → listeners delete cache keys
# Expand-Contract pattern (for schema changes):
# Phase 1 — Expand (backward-compatible):
ALTER TABLE users ADD COLUMN phone VARCHAR(20) NULL;
# Deploy code that writes to BOTH old and new columns
# Deploy code that reads from new column (fallback to old)
# Phase 2 — Backfill (batched to avoid table lock):
UPDATE users SET phone = legacy_phone
WHERE phone IS NULL AND id BETWEEN %s AND %s; # batch by primary key
# Run during off-peak hours; monitor replication lag
# Phase 3 — Contract (after all code reads new column only):
ALTER TABLE users DROP COLUMN legacy_phone;
# AWS DMS for version upgrades or engine migrations:
# Full load: copy all existing data to target
# CDC (Change Data Capture): replicate ongoing changes during cutover window
# Cutover: stop writes to source → wait for DMS lag = 0 → switch connection strings
# RDS Blue-Green Deployments (for major version upgrades):
# Creates a green RDS with the new version
# Replicates from blue via binlog
# Switchover: promotes green, renames endpoints (~30s downtime)
# Rollback window: blue kept for 3 days after switchover
# Flyway / Liquibase in CI/CD:
# Versioned migrations: V001__initial.sql, V002__add_column.sql
# Runs automatically in deployment pipeline
# Tracks executed migrations in schema_history table
# Idempotent: running twice is safe
# Cloud SQL (≈ RDS): managed MySQL/PostgreSQL, single-region, familiar
# $50-500/month depending on instance size
# Use for: standard OLTP, lift-and-shift, most relational workloads
# Cloud Spanner: globally distributed, strongly consistent RDBMS
# ACID transactions across rows, tables, AND continents
# Horizontal read/write scaling (unlike traditional RDBMS)
# 99.999% SLA multi-region, auto-sharding
# Cost: ~$650/month minimum (1 node)
# Use for: global financial systems, inventory consistency, >1 TB relational data
# Powers: Google Ads, Google Play, Pokémon GO
CREATE TABLE Orders (
customer_id INT64 NOT NULL,
order_id INT64 NOT NULL,
amount NUMERIC,
) PRIMARY KEY (customer_id, order_id),
INTERLEAVE IN PARENT Customers ON DELETE CASCADE;
# Interleaved tables: Orders physically co-located with Customers → fast joins
# Cloud Bigtable: wide-column NoSQL, petabyte-scale
# HBase-compatible API, consistent low latency at scale
# Schema: rows keyed by row key; columns in column families
# Use for: time-series (IoT, metrics), ML features, click stream, Google Maps tiles
# Not for: ad-hoc queries, SQL joins, small datasets
# Comparison:
# Cloud SQL: familiar, cheap, single-region → most web apps
# Spanner: global ACID, expensive → fintech, global inventory
# Bigtable: massive scale, simple access patterns → IoT, analytics
# BigQuery: serverless OLAP → analytics, dashboards (not OLTP)
Messaging & Events
7 questions# SQS — point-to-point work queue, at-least-once:
# Standard: unlimited throughput, out-of-order; FIFO: ordered, exactly-once, 3000/s
# Visibility timeout: message hidden while processing, requeued on failure
# DLQ: messages failing N times go to Dead Letter Queue
# Use for: decoupling services, async work queues, rate-limiting downstream
# SNS — pub/sub fan-out, push delivery:
# One publish → many subscribers (SQS, Lambda, HTTP, email, mobile push)
# Message filtering: subscribers receive only matching messages
# Fan-out pattern: SNS topic → multiple SQS queues → independent consumers
# Use for: notifications, triggering multiple actions from one event
# EventBridge — event bus with content-based routing:
# Default bus: all AWS service events; Custom bus: your app events
# Rules match events by pattern → 20+ target types
# Schema registry, archive/replay, cross-account routing
# Use for: event-driven architecture, reacting to AWS service changes
# Kinesis Data Streams — ordered, replayable, real-time:
# Shard: 1 MB/s write, 2 MB/s read; Retention: 24h to 365 days
# Enhanced fan-out: 2 MB/s per consumer (no sharing)
# Use for: real-time analytics, ordered events, replay needed, IoT
# Decision guide:
# Work queue: SQS Standard/FIFO
# Fan-out notifications: SNS → SQS
# AWS service event routing: EventBridge
# Real-time streaming + replay: Kinesis Data Streams
# Managed Kafka workloads: Amazon MSK
Distributed transactions across microservices can't use traditional 2PC reliably at scale. The Saga pattern breaks a transaction into a sequence of local transactions. If one fails, compensating transactions undo previous steps.
# Choreography saga (event-driven — no central coordinator):
# OrderService: creates order → publishes OrderCreated
# PaymentService: receives OrderCreated → charges card → publishes PaymentProcessed
# InventoryService: receives PaymentProcessed → reserves stock → publishes StockReserved
# ShippingService: receives StockReserved → creates shipment → publishes OrderShipped
# Compensation on failure:
# InventoryService: out of stock → publishes StockFailed
# PaymentService: receives StockFailed → refunds card → publishes PaymentRefunded
# OrderService: receives PaymentRefunded → marks order Cancelled
# Orchestration saga (Step Functions — centralized):
# State machine calls each service explicitly
# On failure: state machine triggers compensation steps
# Clearer flow, easier debugging; harder to scale to many services
# Idempotency requirement (events delivered at-least-once):
processed = set()
def handle_event(event):
if event["event_id"] in processed: return # deduplicate
processed.add(event["event_id"])
do_work(event)
# Outbox pattern (ensures event published if DB write succeeds):
# Write event to outbox table IN SAME DB TRANSACTION as business data
# Separate process reads outbox → publishes to message bus → marks sent
# Complete streaming pipeline:
# App → Kinesis Data Streams → Kinesis Firehose → S3 → Glue Catalog → Athena
# Firehose: auto-scales, no shards to manage; delivers in micro-batches
# Buffer: flush when 128MB OR 300s reached (whichever first)
# Compression: GZIP reduces storage cost by 70-80%
# Firehose with Lambda transformation:
{
"DeliveryStreamName": "app-events",
"ExtendedS3DestinationConfiguration": {
"BucketARN": "arn:aws:s3:::my-data-lake",
"Prefix": "events/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
"ErrorOutputPrefix": "errors/",
"BufferingHints": {"SizeInMBs": 128, "IntervalInSeconds": 300},
"CompressionFormat": "GZIP",
"DataFormatConversionConfiguration": {
"OutputFormatConfiguration": {"Serializer": {"ParquetSerDe": {}}}
},
"ProcessingConfiguration": {
"Processors": [{"Type": "Lambda", "Parameters": [...]}]
}
}
}
# Glue Crawler auto-discovers partitions and schema → Athena can query immediately
# Athena query cost: $5/TB scanned → partitioning + Parquet = 100x less data scanned
# Real-time dashboard:
# Kinesis Data Analytics (Apache Flink) → tumbling windows, anomaly detection
# Output to DynamoDB or OpenSearch for sub-second dashboard queries
# MSK: managed Apache Kafka — runs real Kafka brokers in your VPC
# Provisioned: dedicated brokers; Serverless: pay per throughput (auto-scales)
# Choose Kafka/MSK when:
# ✅ Existing Kafka workloads (lift-and-shift)
# ✅ Kafka ecosystem: Debezium CDC, Kafka Connect, ksqlDB, Kafka Streams
# ✅ Exactly-once transactions (Kafka transactional API)
# ✅ Many consumers with high throughput (no Kinesis fan-out fees)
# ✅ Indefinite retention (replay old events without re-ingesting)
# Choose SQS/Kinesis when:
# ✅ Native AWS integrations: Lambda triggers, EventBridge, Firehose
# ✅ Operational simplicity: no broker sizing, no ZooKeeper/KRaft
# ✅ Cost: SQS/Kinesis scales to zero; MSK minimum ~$700/month
# ✅ Simple patterns: one producer → one consumer (SQS is perfect)
# MSK Serverless vs Provisioned:
# Serverless: automatic capacity, pay per throughput → variable workloads
# Provisioned: predictable performance, better for sustained high throughput
# MSK Connect (managed Kafka Connect):
# Source: Debezium → capture MySQL/PostgreSQL changes → Kafka topics
# Sink: Kafka topics → S3 (S3 Sink Connector), OpenSearch, Redshift
# Fully managed connectors, no infrastructure to manage
# Backpressure: producer faster than consumer → queue grows → memory/latency problems
# Strategy 1: Scale consumers based on queue depth
# CloudWatch alarm: SQS ApproximateNumberOfMessagesVisible > 1000
# → Scale out ECS tasks or Lambda concurrency
# Strategy 2: Lambda + SQS (built-in backpressure management)
# Lambda polls SQS, scales up to concurrency limit automatically
# Set reserved concurrency = DB connection pool max (prevent connection exhaustion)
# Batch size: 10 messages per invocation; report batch item failures for partial retry
aws lambda create-event-source-mapping \
--function-name processor \
--event-source-arn arn:aws:sqs:::my-queue \
--batch-size 10 \
--function-response-types ReportBatchItemFailures
# Strategy 3: Rate limiting at producer
# API Gateway throttling: 1000 req/s → 429 when exceeded
# Token bucket per user: prevents any single tenant from flooding the queue
# Strategy 4: Circuit breaker
# If downstream DB is failing, stop consuming from queue
# Prevents cascade: queue → retry storm → DB overload → full outage
# Dead Letter Queue management:
# After 3 failed attempts → DLQ
# CloudWatch alarm on DLQ depth > 0 → immediate page
# DLQ redrive: after fixing root cause, replay from DLQ to main queue
aws sqs start-message-move-task \
--source-arn arn:aws:sqs:::my-queue-dlq \
--destination-arn arn:aws:sqs:::my-queue
# GCP Pub/Sub (≈ SNS + SQS in one):
# Topic → multiple Subscriptions → each subscription gets every message
# Within a subscription: only one consumer gets each message (queue semantics)
# Global by default, at-least-once; exactly-once with ordering key enabled
# Push mode: Pub/Sub calls your HTTPS endpoint (great for Cloud Run)
# Pull mode: consumer polls (standard queue consumer)
# BigQuery subscription: auto-stream messages to BigQuery table
# Ordering:
publisher.publish(topic, data=b"event", ordering_key="order-123")
# All messages with same ordering_key delivered in order to same consumer
# Azure Service Bus (≈ SQS + partial SNS):
# Queues: point-to-point, FIFO, sessions (ordered groups), dead letter
# Topics + Subscriptions: pub/sub (each subscription = filtered copy)
# Sessions: group related messages → processed by single consumer in order
# Advanced: message deferral, scheduled delivery, duplicate detection
# Premium tier: dedicated capacity, VNet integration, 100MB messages
# Azure Event Hubs (≈ Kinesis Data Streams):
# Partitioned event streaming, consumer groups
# Kafka-compatible API: run existing Kafka code against Event Hubs!
# Capture: auto-write to Azure Blob Storage (≈ Kinesis → Firehose → S3)
# Cross-cloud messaging (hybrid scenarios):
# Confluent Cloud / Confluent Platform: Kafka on any cloud
# MQ services: Amazon MQ (ActiveMQ/RabbitMQ managed) for lift-and-shift
# EventBridge Pipes: point-to-point event pipeline (source → filter → enrich → target)
# Replaces Lambda glue code for common integration patterns
# Sources: SQS, Kinesis, DynamoDB Streams, MSK, MQ
# Filter: only forward matching events (e.g., only OrderCreated events)
# Enrichment: Lambda, API GW, Step Functions (transform/augment the event)
# Targets: Lambda, SQS, EventBridge bus, Step Functions, Kinesis, API GW, etc.
# Example: DynamoDB stream → Pipe → EventBridge bus
aws pipes create-pipe \
--name dynamo-to-eventbridge \
--source arn:aws:dynamodb:::stream/... \
--source-parameters '{
"DynamoDBStreamParameters": {
"StartingPosition": "LATEST",
"BatchSize": 10
},
"FilterCriteria": {
"Filters": [{"Pattern": "{\"dynamodb\": {\"NewImage\": {\"status\": {\"S\": [\"SHIPPED\"]}}}}"}]
}
}' \
--enrichment arn:aws:lambda:::function:enrich-order \
--target arn:aws:events:::event-bus/order-events
# EventBridge Archive + Replay:
# Archive all events for 30 days → replay after a bug fix
aws events start-replay \
--replay-name bug-fix-replay \
--source my-event-bus \
--event-start-time 2024-01-15T00:00:00 \
--event-end-time 2024-01-16T00:00:00 \
--destination '{"Arn": "arn:aws:events:::event-bus/my-bus", "FilterArns": [...]}'
Observability
6 questionsMetrics answer "Is my system healthy?", Logs answer "What happened?", and Traces answer "Where is the slowdown?"
# Metrics: CloudWatch
import boto3
cw = boto3.client('cloudwatch')
cw.put_metric_data(Namespace='MyApp', MetricData=[{
'MetricName': 'OrdersProcessed',
'Value': 1, 'Unit': 'Count',
'Dimensions': [{'Name': 'Environment', 'Value': 'prod'}]
}])
# Alarms on threshold or anomaly detection
# Dashboards: composite views with SLO burn rates
# Logs: CloudWatch Logs Insights
# Structured JSON logging essential for querying:
{"level":"ERROR","msg":"Payment failed","order_id":"123","error":"timeout","trace_id":"abc"}
# Query:
fields @timestamp, order_id, error
| filter level = "ERROR"
| stats count(*) as errors by bin(5m)
# Traces: AWS X-Ray + OpenTelemetry (ADOT)
from aws_xray_sdk.core import xray_recorder, patch_all
patch_all() # auto-instrument boto3, requests, SQLAlchemy
@xray_recorder.capture('process-order')
def process_order(order_id):
with xray_recorder.in_subsegment('db-query'):
return db.get_order(order_id)
# OpenTelemetry on AWS (vendor-neutral):
# ADOT Collector → X-Ray, CloudWatch, Prometheus, Grafana, Datadog
# Benefit: switch backends without code changes
# SLI/SLO alerting (SRE methodology):
# SLI: the metric you measure (e.g., p99 latency, availability)
# SLO: the target (99.9% availability = 8.7 hours downtime/year)
# Error Budget: 1 - SLO = 0.1% = how much badness you can tolerate
# Key SLIs for web APIs:
# Availability: (successful / total) > 99.9%
# Latency: p99 < 500ms
# Error rate: (4xx+5xx / total) < 1%
# Throughput: requests/second (capacity planning)
# Error budget burn rate alerts:
# 1x burn = consuming budget at steady state (neutral)
# 14x burn for 1h = consumed 10% of monthly budget → P1 page
# 6x burn for 6h = consumed 10% → P2 notify
aws cloudwatch put-metric-alarm \
--alarm-name "P99-Latency-High" \
--metric-name TargetResponseTime \
--namespace AWS/ApplicationELB \
--extended-statistic p99 \
--period 60 --evaluation-periods 5 \
--threshold 0.5 \
--comparison-operator GreaterThanThreshold
# Composite alarms (reduce noise):
aws cloudwatch put-composite-alarm \
--alarm-name "Service-Degraded" \
--alarm-rule "ALARM(ErrorRateHigh) AND ALARM(LatencyHigh)"
# Alert only when BOTH conditions are true simultaneously
# Alert hierarchy:
# P1 (page immediately): SLO burn rate 14x+, complete outage
# P2 (notify in 30 min): SLO burn rate 6x+, partial degradation
# P3 (review next day): capacity warning, non-critical anomaly
# Centralized logging architecture:
# Each account: App → CloudWatch Logs → Kinesis Firehose subscription
# Central Log Archive account: Firehose → S3 (compressed, partitioned)
# Organization-level trail (CloudTrail):
aws cloudtrail create-trail \
--name org-trail \
--s3-bucket-name central-cloudtrail-logs \
--is-multi-region-trail \
--enable-log-file-validation \
--is-organization-trail
# Application logs via Fluent Bit (EKS/ECS):
[OUTPUT]
Name kinesis_firehose
Match app.*
region us-east-1
delivery_stream app-logs-${ACCOUNT_ID}
# Firehose → S3 with account prefix:
# s3://central-logs/account=111222333/year=2024/month=01/app-logs.gz
# OpenSearch for searchable logs:
# Firehose can also deliver to OpenSearch Service
# Fine-grained access: per-index permissions (team A sees only their logs)
# Log retention tiers:
# Application debug logs: 7 days (CloudWatch) → delete
# Application error logs: 30 days hot → 1 year S3 IA
# Security/audit logs: 1 year hot → 7 years S3 Glacier (compliance)
# Protect against log tampering:
# S3 Object Lock (WORM) on security logs
# SCPs deny s3:DeleteObject on the central logs bucket
# CloudTrail log file validation (SHA-256 hash per log file)
# CloudTrail: records every AWS API call
# Who (principal), what (API action), when, where (IP), on what (resource ARN)
# Critical CloudWatch Metric Filters on CloudTrail logs:
# Root account usage:
'{ $.userIdentity.type = "Root" && $.userIdentity.invokedBy NOT EXISTS }'
# IAM policy changes:
'{ $.eventName = "PutUserPolicy" || $.eventName = "AttachRolePolicy" }'
# Unauthorized API calls:
'{ $.errorCode = "AccessDenied" || $.errorCode = "UnauthorizedOperation" }'
# Console login without MFA:
'{ $.eventName = "ConsoleLogin" && $.additionalEventData.MFAUsed = "No" }'
# Security group changes:
'{ $.eventName = "AuthorizeSecurityGroupIngress" }'
# CloudTrail Lake (serverless SQL queries on API history):
SELECT userIdentity.arn, eventName, errorCode, COUNT(*) as count
FROM trail_event_data_store
WHERE errorCode IS NOT NULL
AND eventTime > '2024-01-01'
GROUP BY 1, 2, 3
ORDER BY count DESC
LIMIT 20;
# Retention: 7 years; query in seconds; no S3 setup required
# Protect against disabling CloudTrail:
# SCP: Deny cloudtrail:StopLogging, cloudtrail:DeleteTrail for all non-platform accounts
# SNS notification on any StopLogging event
# S3 Object Lock on trail bucket (WORM mode)
# Trace: unique ID for a complete request journey across services
# Span: one service's work within a trace (start/end time, attributes, errors)
# Context propagation: trace ID passed in HTTP headers between services
# W3C Trace Context header: traceparent: 00-{traceId}-{spanId}-{flags}
# AWS X-Ray header: X-Amzn-Trace-Id: Root=1-xxx;Parent=xxx;Sampled=1
# OpenTelemetry instrumentation:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process-payment") as span:
span.set_attribute("payment.amount", amount)
span.set_attribute("payment.currency", "USD")
try:
result = charge_stripe(amount)
span.set_attribute("payment.status", "success")
except Exception as e:
span.record_exception(e)
span.set_status(Status(StatusCode.ERROR))
raise
# Sampling strategies:
# Head-based: decision made at trace start (simple, cheaper)
# - Fixed rate: sample 5% → misses rare errors
# - Always sample errors: 5% + 100% for 4xx/5xx
# Tail-based (OTEL Collector): decision after trace completes
# - Sample 100% of slow (>1s) and error traces; 1% of normal
# - More useful data, higher collector cost
# Correlate traces with logs:
# Add trace_id to log entries → click log → see full trace
import logging
logging.info("Payment processed", extra={"trace_id": get_trace_id(), "order_id": oid})
# AWS CloudWatch:
# Strengths: deep AWS service integration, composite alarms, anomaly detection,
# Lambda Insights, Contributor Insights, cross-account dashboards
# Logs Insights: fast SQL-like queries; EMF (Embedded Metric Format) for custom metrics
# Gaps: Grafana integration needed for advanced visualization
# GCP Cloud Monitoring:
# Uptime checks: HTTP/TCP from global locations (≈ Route 53 health checks)
# SLO monitoring built-in: define SLOs, auto-track error budget
# Cloud Logging: structured logs, log-based metrics, BigQuery export for analytics
# Cloud Trace: auto-instrumented for GCP services, linked with logs
# Managed Prometheus: GCP manages Prometheus backend storage
# Azure Monitor:
# Application Insights: full APM — requests, exceptions, dependencies, live metrics
# Log Analytics Workspace: central log store with KQL (Kusto Query Language)
# KQL example: requests | where duration > 1000 | summarize count() by bin(timestamp, 5m)
# Azure Monitor Alerts: metric, log-based, resource health, activity log
# Azure Managed Grafana: hosted Grafana with native Azure AD integration
# Third-party tools (cloud-agnostic):
# Datadog: best unified APM+infra, expensive
# New Relic: full-stack observability, competitive pricing
# Grafana Cloud: Prometheus + Loki + Tempo (metrics + logs + traces) — open-source based
# Honeycomb: best for high-cardinality event data, tail-based sampling
Security & Compliance
6 questions# GuardDuty: ML-based threat detection
# Analyzes: CloudTrail, VPC Flow Logs, DNS logs, EKS audit logs
# Detects: compromised credentials, crypto-mining, unusual API patterns, port scans
# Enable with one click, no agents, no configuration
aws guardduty create-detector --enable --finding-publishing-frequency SIX_HOURS
# Finding types:
# UnauthorizedAccess:IAMUser/MaliciousIPCaller
# CryptoCurrency:EC2/BitcoinTool.B
# Recon:EC2/PortProbeUnprotectedPort
# PrivilegeEscalation:IAMUser/AdministrativePermissions
# Security Hub: aggregates findings from GuardDuty, Inspector, Macie + third-party
# Runs automated compliance checks (CIS AWS Foundations, AWS Foundational Security)
# Security score per account and per control
# Aggregates across all org accounts into one pane
# Macie: ML-based sensitive data discovery in S3
# Detects: PII, financial data, credentials, health records
# Inventory: which S3 buckets have sensitive data
# Alerts: unencrypted buckets, public buckets, anomalous access patterns
# AWS Config: continuous compliance monitoring
# Rule: "all S3 buckets must have encryption enabled"
# Auto-remediation: Lambda or SSM Document runs when non-compliant
aws config put-config-rule --config-rule '{
"ConfigRuleName": "s3-bucket-server-side-encryption-enabled",
"Source": {"Owner": "AWS", "SourceIdentifier": "S3_BUCKET_SERVER_SIDE_ENCRYPTION_ENABLED"}
}'
# Encryption at rest — enforce via SCP + Config rules:
# S3: default encryption with SSE-S3 (free) or SSE-KMS (audit trail)
# RDS/Aurora: enable storage encryption at creation (can't change after)
# DynamoDB: encrypted by default (SSE with AWS managed key)
# EBS: encrypted by default (account-level setting)
# Secrets Manager / Parameter Store: encrypted by KMS
# Envelope encryption with KMS:
# KMS Customer Managed Key (CMK) encrypts DEK (Data Encryption Key)
# DEK encrypts the actual data locally (fast, no KMS API per record)
# Only one KMS call to decrypt DEK → decrypt all data with local DEK
import boto3
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
import os
kms = boto3.client('kms')
KEY_ID = 'arn:aws:kms:us-east-1:ACCOUNT:key/KEY-ID'
def encrypt(plaintext):
dek = os.urandom(32) # generate local DEK
nonce = os.urandom(12)
ciphertext = AESGCM(dek).encrypt(nonce, plaintext, None)
encrypted_dek = kms.encrypt(KeyId=KEY_ID, Plaintext=dek)['CiphertextBlob']
return nonce + ciphertext, encrypted_dek
# Encryption in transit:
# TLS 1.3 for all external traffic (ALB policy: ELBSecurityPolicy-TLS13-1-2-2021-06)
# mTLS for service-to-service (Istio/App Mesh in EKS)
# ACM (Certificate Manager): free TLS certs, auto-renewal for ALB/CloudFront/API GW
# VPC traffic: encrypted by default within AWS network for Nitro instances
# Top cloud misconfigurations and prevention:
# 1. Public S3 buckets (Capital One breach vector):
# Prevention: S3 Block Public Access at account level (default since 2023)
aws s3control put-public-access-block --account-id ACCOUNT --public-access-block-configuration \
'{"BlockPublicAcls":true,"IgnorePublicAcls":true,"BlockPublicPolicy":true,"RestrictPublicBuckets":true}'
# AWS Config rule: s3-bucket-public-read-prohibited
# 2. Overly permissive security groups (0.0.0.0/0 on admin ports):
# Prevention: AWS Config rule: restricted-ssh, restricted-common-ports
# SCP: deny creating SG rules allowing 0.0.0.0/0 on 22/3389
# Use SSM Session Manager instead of SSH (no port 22 needed)
# 3. IAM users with access keys (static, long-lived):
# Prevention: SCP deny iam:CreateAccessKey; require use of roles
# AWS Config: iam-no-inline-policy-check, iam-user-no-policies-check
# AWS Access Analyzer: find unused access
# 4. Unencrypted databases and S3:
# Prevention: enforce encryption via SCP (Deny if ServerSideEncryptionAlgorithm != aws:kms)
# AWS Config: rds-storage-encrypted, s3-bucket-server-side-encryption-enabled
# 5. IMDSv1 enabled on EC2 (SSRF → metadata credentials):
# Prevention: require IMDSv2 at account level:
aws ec2 modify-instance-metadata-defaults \
--http-tokens required \
--http-put-response-hop-limit 1
# Tools:
# AWS Security Hub: automated compliance checks
# Prowler: open-source CIS Benchmark scanner
# ScoutSuite: multi-cloud security auditing
# Checkov / tfsec: IaC security scanning in CI/CD
# AWS Network Firewall: managed stateful Layer 7 firewall
# Rules: Suricata-compatible (IDS/IPS rules), domain-based filtering
# Use cases: block malicious domains, inspect egress, compliance
# Centralized egress inspection architecture:
# All VPC spoke accounts → TGW → Inspection VPC (Network Firewall) → NAT GW → Internet
# Egress VPC: public subnet (NAT GW + NFW), firewall subnet
# Firewall rule group:
{
"RuleGroupName": "block-malicious-domains",
"Type": "STATEFUL",
"RuleGroup": {
"RulesSource": {
"RulesSourceList": {
"Targets": [".malware-site.com", ".phishing.com"],
"TargetTypes": ["HTTP_HOST", "TLS_SNI"],
"GeneratedRulesType": "DENYLIST"
}
}
}
}
# Domain allowlist for egress (zero-trust egress):
# Only allow specific domains: api.stripe.com, s3.amazonaws.com, etc.
# Block everything else: GeneratedRulesType: "ALLOWLIST"
# Compliance: PCI-DSS requires all outbound connections to be justified and controlled
# vs WAF:
# WAF: protects ingress (HTTP requests to your APIs)
# Network Firewall: protects egress (traffic leaving your VPCs) and can inspect ingress at Layer 4/7
# GCP equivalent: Cloud Next Generation Firewall (Cloud NGFW)
# Azure equivalent: Azure Firewall Premium
Shared Responsibility Model: The cloud provider is responsible for security of the cloud (physical hardware, network infrastructure, hypervisor). The customer is responsible for security in the cloud (OS, applications, data, IAM, network configuration).
# AWS handles: physical facilities, network, hardware, managed service security
# You handle: guest OS, application code, IAM, data encryption, network configuration
# Compliance frameworks and AWS:
# PCI-DSS: payment card data
# - AWS has PCI DSS Level 1 compliant services
# - You: encrypt cardholder data, access controls, audit logs, WAF
# - Use: Macie to detect card data in S3, VPC isolation for cardholder environment
# HIPAA: healthcare data (US)
# - Sign AWS Business Associate Agreement (BAA)
# - Use only HIPAA-eligible services
# - Encrypt PHI at rest + in transit; access controls; audit logs
# SOC 2: security, availability, confidentiality
# - AWS provides SOC 2 Type II report (covers their infrastructure)
# - You: application controls, change management, access reviews
# - AWS Audit Manager: automates evidence collection for SOC 2
# GDPR: EU personal data
# - Data residency: use EU regions only (eu-west-1, eu-central-1)
# - Data subject rights: deletion, portability (design into your application)
# - DPA: sign AWS Data Processing Addendum
# Infrastructure as Code for compliance:
# Terraform Sentinel policies: enforce compliance rules on every apply
# AWS Config: continuous compliance monitoring with auto-remediation
# AWS Audit Manager: automated evidence collection → compliance reports
# AWS WAF: Layer 7 firewall, attach to ALB, CloudFront, API GW, AppSync
# Rule groups: AWS Managed (free), Marketplace (paid), Custom
# Web ACL with common protections:
aws wafv2 create-web-acl --name my-web-acl --scope REGIONAL \
--default-action Allow={} \
--rules '[
{
"Name": "AWSManagedRulesCommonRuleSet",
"Priority": 1,
"OverrideAction": {"None": {}},
"Statement": {"ManagedRuleGroupStatement": {
"VendorName": "AWS", "Name": "AWSManagedRulesCommonRuleSet"
}},
"VisibilityConfig": {...}
},
{
"Name": "RateLimit",
"Priority": 2,
"Action": {"Block": {}},
"Statement": {"RateBasedStatement": {
"Limit": 2000,
"AggregateKeyType": "IP"
}},
"VisibilityConfig": {...}
}
]'
# Managed rule groups:
# AWSManagedRulesCommonRuleSet: OWASP Top 10 (SQLi, XSS, LFI, RFI)
# AWSManagedRulesKnownBadInputsRuleSet: Log4Shell, Spring4Shell, etc.
# AWSManagedRulesAmazonIpReputationList: known malicious IPs, botnets
# AWSManagedRulesBotControlRuleSet: bot detection ($10/million requests)
# Geo-blocking:
# Block specific countries or allow-list only your target countries
# Note: VPN/proxy bypass is possible — defense in depth, not sole control
# WAF logs → Kinesis Firehose → S3 → Athena for analysis
# Detect attack patterns, tune rules, build custom block lists
Cost Optimization
6 questionsFinOps (Cloud Financial Operations) is a practice that brings engineering, finance, and business together to take ownership of cloud spending — enabling faster product delivery while maintaining financial control.
# FinOps lifecycle: Inform → Optimize → Operate
# Inform: make costs visible
# AWS Cost Explorer: visualize spending by service, account, tag
# Cost Allocation Tags: tag every resource (app, team, environment, cost-center)
# Budget alerts: SNS notification when spending forecast exceeds threshold
aws budgets create-budget --account-id ACCOUNT --budget '{
"BudgetName": "payment-service-monthly",
"BudgetLimit": {"Amount": "1000", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"BudgetType": "COST",
"CostFilters": {"TagKeyValue": ["user:app$payment-service"]}
}' --notifications-with-subscribers '[{
"Notification": {"NotificationType": "ACTUAL", "ComparisonOperator": "GREATER_THAN",
"Threshold": 80},
"Subscribers": [{"SubscriptionType": "SNS", "Address": "arn:aws:sns:::cost-alerts"}]
}]'
# Unit economics: cost per transaction, cost per user, cost per API call
# Enables developers to understand the financial impact of their code
# Cloud billing accountability:
# Chargeback: allocate costs to teams/products (internal billing)
# Showback: show costs without internal charging (awareness without punishment)
# Both require consistent tagging and org structure
# Cost anomaly detection:
aws ce create-anomaly-monitor --anomaly-monitor '{
"MonitorName": "service-anomalies",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE"
}'
# Alerts when a service's spend deviates significantly from historical pattern
# Savings Plans: commit to $/hr spend, automatic application
# Compute Savings Plans: most flexible — EC2 any family/region/OS + Fargate + Lambda
# EC2 Instance Savings Plans: locked to instance family + region, highest discount (up to 72%)
# SageMaker Savings Plans: ML training/inference
# Reserved Instances: commit to specific instance configuration
# Standard RI: exact instance type/AZ/OS, up to 72% discount
# Convertible RI: can exchange for different type, up to 54% discount
# Scope: regional (flexible) or zonal (capacity reservation)
# Decision framework:
# 1. Right-size BEFORE committing (waste money on wrong size)
# 2. Analyze 3+ months of usage: identify stable baseline
# 3. Commit baseline to Compute Savings Plans (most flexible)
# 4. On-demand for variable peak capacity above baseline
# Typical strategy:
# - 60-70% of normalized compute → 1-year Compute Savings Plans
# - Additional 10-20% for specific heavy workloads → EC2 Instance SP or RIs
# - Remainder → On-demand and Spot
# Expected savings: 30-50% reduction in compute costs
# Payment options:
# All Upfront: best discount but ties up capital
# Partial Upfront: balanced
# No Upfront: slightly lower discount, no capital required
# At 8% cost of capital, All Upfront is only better if discount > 8% differential
# AWS Cost Explorer recommendations:
# Analyzes your usage → recommends specific SPs/RIs with projected savings
# Start with recommendations but validate against growth plans
# Top cost optimization opportunities:
# 1. EC2 right-sizing (often 30-50% over-provisioned):
# AWS Compute Optimizer: ML-based recommendations from actual utilization
# Action: downsize instances with <20% p95 CPU; switch to Graviton (20-40% saving)
# 2. Unattached resources (pure waste):
# EBS volumes not attached to instances
# Unused Elastic IPs (~$4/month each)
# Idle load balancers with no targets
# Old snapshots (set lifecycle policies)
aws ec2 describe-volumes --filters Name=status,Values=available # unattached volumes
aws ec2 describe-addresses --filters Name=domain,Values=vpc | jq '.Addresses[] | select(.AssociationId == null)'
# 3. Data transfer costs (often 20-30% of bill):
# S3 → CloudFront instead of S3 → Internet (CloudFront transfer is cheaper)
# VPC Gateway endpoints for S3/DynamoDB (free — avoids NAT Gateway fees)
# Same-AZ traffic: ensure app and DB are in same AZ
# NAT Gateway: $0.045/GB → replace with Interface Endpoints for AWS services
# 4. RDS over-provisioned:
# Multi-AZ standby instance is running 24/7 — use Aurora Serverless v2 for variable load
# Read replicas not needed → remove during off-peak
# 5. Over-retained CloudWatch Logs (can be 5-15% of bill):
# Set log group retention: 7 days for debug, 30 days for app, 365 for audit
# Export to S3 after retention window instead of keeping in CW
# 6. S3 storage class optimization:
# S3 Intelligent-Tiering or lifecycle policies to move to IA/Glacier
# S3 Lens: identify large buckets with infrequent access
# AWS data transfer pricing (approximate):
# Inbound to AWS: FREE
# EC2 → Internet: $0.09/GB (first 10 TB), then lower tiers
# EC2 → S3 (same region): FREE
# EC2 → S3 (different region): $0.02/GB
# EC2 AZ-to-AZ (same region): $0.01/GB each direction
# CloudFront → Internet: $0.0085/GB (after 10TB, even cheaper)
# Optimization strategies:
# 1. CloudFront for S3 content delivery (80% cost reduction):
# S3 → Internet: $0.09/GB; S3 → CloudFront: $0.02/GB; CF → Internet: $0.0085/GB
# 10TB/month: S3 direct = $900, via CloudFront = ~$85
# 2. VPC Interface/Gateway Endpoints (eliminate NAT Gateway costs):
# NAT Gateway: $0.045/GB processed + $0.045/hr
# Interface endpoint: $0.01/GB + $0.01/hr
# For Lambda/ECS calling S3, Secrets Manager, ECR: big savings
# 3. Same-AZ for high-volume traffic:
# Put RDS Read Replica and app in same AZ for reads ($0 vs $0.01/GB)
# ElastiCache cluster nodes in same AZ as primary consumers
# 4. Compress data:
# Enable S3 Firehose GZIP compression (70% less data = 70% less transfer cost)
# Enable ALB/CloudFront compression for HTTP responses
# 5. Direct Connect for large on-prem transfers:
# DX egress: $0.002/GB vs Internet: $0.09/GB
# Break-even at ~2TB/month of on-prem traffic (depending on DX port cost)
# Lambda cost: $0.0000002/request + $0.0000166667/GB-second
# Optimization:
# 1. Right-size memory: more memory = faster = potentially cheaper
# Lambda Power Tuning tool: test 10 memory sizes → find optimal price/perf
# Often: doubling memory cuts duration by >50% → net cost reduction
# 2. Reduce duration: move initialization outside handler, efficient libraries
# 3. ARM/Graviton2 (arm64): 20% cheaper + 19% better price-performance
# 4. Batch SQS messages: 10 messages per invocation vs 10 separate invocations
# 5. Reuse HTTP connections: requests.Session() reused across warm invocations
# Fargate cost: ~20-30% premium over equivalent EC2 (no management overhead)
# When EC2 is cheaper: steady-state workloads > 50% utilization
# When Fargate wins: variable workloads, Spot interruptions, ops complexity cost
# DynamoDB cost optimization:
# On-demand vs Provisioned:
# On-demand: $1.25/million WCU writes, $0.25/million RCU reads — pay per use
# Provisioned: $0.00065/WCU-hour, $0.00013/RCU-hour — commit to capacity
# On-demand is 6x more expensive per unit than provisioned!
# Break-even: use provisioned if utilization > 20% of time
# DynamoDB cost killers:
# Full table scans (Scan API) consume capacity proportional to table size
# Large items: 10KB item = 10 WCUs to write vs 1KB item = 1 WCU
# Returning all attributes when you need 2: use ProjectionExpression
# GSI storage: each GSI stores a copy of projected attributes
# Tagging strategy — required tags enforced via SCP:
{
"Effect": "Deny",
"Action": ["ec2:RunInstances", "rds:CreateDBInstance", "lambda:CreateFunction"],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/app": "true", # must have 'app' tag
"aws:RequestTag/env": "true", # must have 'env' tag
"aws:RequestTag/team": "true" # must have 'team' tag
}
}
}
# Required tags: app, env (prod/staging/dev), team, cost-center
# Enforce via SCP for creates; use Config rule for existing resources
# Budget hierarchy:
# Org total → account → team → application
# Budget action: auto-apply SCP to deny non-essential creates when 80% exceeded
# Auto-remediation for waste:
# Lambda: find and terminate EC2 instances idle > 7 days
# EventBridge scheduled: daily scan for unattached EBS volumes → SNS alert
# AWS Config remediation: attach IAM password policy, enable S3 versioning
# Tagging compliance report:
aws resourcegroupstaggingapi get-resources \
--tag-filters Key=env \
--resource-type-filters ec2:instance
# Find resources missing required tags → send to team Slack channel
# Cost allocation: AWS Cost and Usage Report (CUR) → S3 → Athena → QuickSight
# Dashboard: cost by team, cost per feature, savings vs last month
HA & Disaster Recovery
6 questionsRPO (Recovery Point Objective): Maximum acceptable data loss measured in time. "We can afford to lose up to 1 hour of orders." RPO determines backup/replication frequency.
RTO (Recovery Time Objective): Maximum acceptable downtime. "We must be back online within 4 hours." RTO determines how fast you must failover.
# DR strategies (cheapest to most expensive, slowest to fastest recovery):
# 1. Backup & Restore (RTO: hours; RPO: hours)
# Cost: low — just backup storage
# S3 backups + AMIs; restore from backup when disaster strikes
# Use for: non-critical systems, dev/test environments
# 2. Pilot Light (RTO: 30-60 min; RPO: minutes)
# Core DB replication running; minimal EC2 in DR region
# Scale out EC2 on disaster; redirect DNS
# Use for: important but not time-critical systems
# 3. Warm Standby (RTO: 10-30 min; RPO: seconds)
# Reduced-capacity replica running at all times in DR region
# Scale up on disaster (already running, just needs more capacity)
# Use for: important systems, moderate budget
# 4. Active-Active / Multi-Site (RTO: near-zero; RPO: near-zero)
# Full capacity in all regions simultaneously; traffic split
# Route 53 latency/geolocation routing
# Use for: critical revenue-generating systems
# Cost: 2x infrastructure
# Matching strategy to business requirements:
# RPO=0, RTO=0: Active-active (most expensive)
# RPO<1min, RTO<15min: Warm standby
# RPO<1hr, RTO<4hr: Pilot light
# RPO<24hr, RTO<24hr: Backup and restore
# AZ: physically separate data center in same AWS region
# Correlated failures in same AZ → application continues from other AZs
# Multi-AZ HA architecture:
# ALB: spans 3 AZs automatically, fails over in seconds
# EC2 ASG: min=3, desired=6, spread across 3 AZs (AZRebalance policy)
# RDS Multi-AZ: synchronous standby in second AZ, ~1-2min failover
# ElastiCache: multi-AZ with auto-failover enabled
# ECS/EKS: tasks/pods spread with topologySpreadConstraints:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels: {app: my-service}
# Aurora: 6 storage copies across 3 AZs by default; 1-2 reader instances in other AZs
# Failure testing with AWS Fault Injection Simulator (FIS):
# Simulate AZ failure: terminate all instances in us-east-1b
# Verify: application continues from us-east-1a and us-east-1c
# Chaos engineering: validate theoretical HA with actual failure injection
# Availability SLAs:
# Single EC2: 99.5% (4.38 hrs downtime/year)
# Multi-AZ (2 AZ): 99.99% (52 min downtime/year)
# Multi-AZ (3 AZ): 99.999% (5.2 min downtime/year)
# Multi-region failover architecture:
# Primary: us-east-1; Secondary: us-west-2
# Route 53 failover routing with health checks
# Data layer options:
# Aurora Global Database: <1s replication, managed failover (<1min)
# DynamoDB Global Tables: multi-master, eventually consistent
# S3 Cross-Region Replication: async (~minutes)
# RDS Read Replica: manual promotion on failure
# Infrastructure failover:
# Route 53 health check: monitors /health endpoint in primary region
# On failure: DNS TTL expires → traffic shifts to secondary
# With 60s TTL: recovery in ~1-2 minutes (DNS propagation)
# Challenges:
# 1. Data consistency during failover:
# Async replication → potential data loss (RPO = replication lag)
# Prevent split-brain: ensure primary is truly down before promoting secondary
# 2. Regional service dependencies:
# Does your app depend on services not available in DR region?
# Test full failover regularly (not just in theory)
# 3. Runbook complexity:
# DNS failover is automatic; DB promotion is often manual
# Automate everything: EventBridge → Lambda → promote RDS replica → update Secrets Manager
# 4. Cost:
# Warm standby = 50-100% additional cost for the standby region
# Balance against cost of downtime (revenue/hr * expected outage hours/year)
# AWS Global Accelerator (alternative to Route 53 failover):
# Anycast IPs, health checks, sub-minute failover (no DNS TTL issue)
# Better for latency-sensitive applications
# Chaos engineering: deliberately inject failures to validate resilience
# Principle: "Break things in controlled ways to find weaknesses before real failures do"
# AWS Fault Injection Simulator (FIS):
# Terminate EC2 instances (simulate AZ failure)
# Throttle EC2 API calls (simulate service degradation)
# Inject network latency (simulate slow dependencies)
# Fail RDS Multi-AZ failover
# Drain ECS tasks from a load balancer
# Sample FIS experiment: AZ failure simulation
{
"description": "Terminate instances in us-east-1b",
"actions": {
"TerminateInstances": {
"actionId": "aws:ec2:terminate-instances",
"parameters": {},
"targets": {"Instances": "ProductionInstances"}
}
},
"targets": {
"ProductionInstances": {
"resourceType": "aws:ec2:instance",
"resourceTags": {"env": "prod"},
"filters": [{"path": "Placement.AvailabilityZone", "values": ["us-east-1b"]}],
"selectionMode": "ALL"
}
},
"stopConditions": [{"source": "aws:cloudwatch:alarm", "value": "ErrorRateCritical"}]
}
# Chaos engineering process:
# 1. Define steady state (normal system behavior with metrics)
# 2. Form hypothesis: "Terminating AZ-b instances will not affect availability"
# 3. Inject failure in staging, then production (during business hours, with paging)
# 4. Observe: does steady state hold? What failed unexpectedly?
# 5. Fix weaknesses found; repeat
# GameDay: scheduled chaos exercises with full team present
# Netflix Chaos Monkey → Chaos Monkey for AWS → FIS integration
# RDS / Aurora automated backups:
# Daily snapshots + transaction logs → point-in-time recovery to any second
# Retention: 1-35 days; 0 = disable (not recommended for production)
# Cross-region copy: copy snapshots to DR region for additional protection
aws rds copy-db-snapshot \
--source-db-snapshot-identifier arn:aws:rds:us-east-1:ACCOUNT:snapshot:my-db-snap \
--target-db-snapshot-identifier my-db-snap-dr \
--source-region us-east-1
# DynamoDB Point-in-Time Recovery (PITR):
aws dynamodb update-continuous-backups \
--table-name my-table \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
# Restore to any point in last 35 days (per-second granularity)
# AWS Backup: centralized backup across RDS, DynamoDB, EFS, EC2, S3, EKS
# Backup plan: schedule, retention, cross-region/cross-account copy
aws backup create-backup-plan --backup-plan '{
"BackupPlanName": "daily-backup",
"Rules": [{
"RuleName": "daily",
"TargetBackupVaultName": "prod-backups",
"ScheduleExpression": "cron(0 3 * * ? *)", # 3am daily
"StartWindowMinutes": 60,
"CompletionWindowMinutes": 180,
"Lifecycle": {"DeleteAfterDays": 90},
"CopyActions": [{"DestinationBackupVaultArn": "arn:aws:backup:us-west-2:..."}]
}]
}'
# Restore testing (critical — validate backups actually work!):
# Monthly: restore DB snapshot to test environment and validate data integrity
# AWS Backup restore testing: automated restore + validation Lambda
Cell-based architecture (used by Amazon, Netflix, AWS itself) partitions workloads into independent cells that share nothing. A failure in one cell affects only the users mapped to that cell — not the entire system.
# Cell: complete, independent deployment with its own:
# - EC2/EKS cluster
# - Database (RDS or DynamoDB table)
# - Cache (ElastiCache)
# - Load balancer
# No shared infrastructure between cells
# Cell mapping: which users go to which cell?
# Consistent hashing: user_id % num_cells → cell assignment
# Stored in a lightweight "routing plane" (DynamoDB or Route 53 GeoDNS)
# Benefits:
# Blast radius: a bug or infrastructure failure affects 1/N users (N = cell count)
# Deployments: roll out to 1 cell first, validate, then deploy to remaining cells
# Noisy neighbor: a misbehaving customer in cell 3 can't affect customers in cell 7
# How AWS itself uses cells:
# EC2 control plane: each AZ is a cell (launch instance in us-east-1a = cell 1)
# DynamoDB: each storage node is a cell
# S3: each partition (prefix range) served by independent cells
# When to consider cell-based architecture:
# You need better-than-multi-region availability (99.999%+)
# Large enough to justify the operational complexity
# Compliance: data locality requirements per tenant
# Examples: financial institutions, healthcare, large SaaS platforms
# AWS Shuffle Sharding (variant):
# Assign each customer a unique subset of cells (e.g., 2 of 8 servers)
# Blast radius: customer A failure only affects customers sharing the same servers
# Mathematically: probability of two customers sharing the same subset is very low
Well-Architected & Multi-Cloud
7 questionsThe AWS Well-Architected Framework provides a consistent approach for evaluating architectures and implementing designs that scale over time.
- Operational Excellence: Run and monitor systems effectively. Infrastructure as code, small frequent reversible changes, anticipate failure, operations runbooks. Key services: CloudFormation, Systems Manager, CloudWatch.
- Security: Protect data and systems. Least privilege, defense in depth, encryption everywhere, incident response. Key services: IAM, KMS, GuardDuty, Security Hub.
- Reliability: Recover from failures, meet demand. Multi-AZ, auto scaling, chaos testing, DR. Key services: Route 53, ALB, ASG, RDS Multi-AZ.
- Performance Efficiency: Use resources efficiently. Right instance types, managed services, CDN, caching. Key services: CloudFront, ElastiCache, Kinesis.
- Cost Optimization: Avoid unnecessary costs. Right-sizing, Savings Plans, spot instances, serverless. Key services: Cost Explorer, Compute Optimizer, Trusted Advisor.
- Sustainability: (Added 2021) Minimize environmental impact. Right-size, use managed services, maximize utilization. Graviton for energy efficiency, serverless for on-demand compute.
# Well-Architected Tool: guided review of your architecture
# Create workload → answer questions per pillar → get recommendations
# Trusted Advisor: automated checks for security, cost, performance, fault tolerance
# Common review findings and fixes:
# Security: S3 bucket public, root account used, MFA not enabled
# Reliability: single-AZ, no health checks, manual scaling
# Cost: overprovisioned EC2, unattached EBS, no Savings Plans
# Performance: no caching layer, synchronous everything, wrong instance family
# CloudFormation: AWS-native IaC (JSON/YAML templates)
# Deep AWS integration: first-class support for every AWS service
# StackSets: deploy same stack across multiple accounts/regions
# Drift detection: detect manual changes to stack resources
# CDK (Cloud Development Kit): define AWS infrastructure in Python/TypeScript/Java
# Synthesizes to CloudFormation under the hood
# Constructs: reusable components at L1 (raw), L2 (opinionated), L3 (patterns)
import aws_cdk as cdk
from aws_cdk import aws_s3 as s3, aws_lambda as lambda_
class MyStack(cdk.Stack):
def __init__(self, scope, id, **kwargs):
super().__init__(scope, id, **kwargs)
bucket = s3.Bucket(self, "MyBucket", versioned=True, encryption=s3.BucketEncryption.S3_MANAGED)
fn = lambda_.Function(self, "Handler",
runtime=lambda_.Runtime.PYTHON_3_12,
handler="index.handler",
code=lambda_.Code.from_asset("lambda"),
environment={"BUCKET": bucket.bucket_name})
bucket.grant_read(fn) # CDK handles IAM policy automatically!
# Terraform (HCL): multi-cloud, largest community
# State management: remote state in S3 + DynamoDB lock
# Provider ecosystem: AWS, GCP, Azure, Kubernetes, Datadog, etc.
# Atlantis / Spacelift: Terraform GitOps automation
# Pulumi: general-purpose languages (Python, TypeScript, Go, Java)
# Same code can provision across clouds
# Strong typing, IDE support, unit testing of infrastructure
# Decision:
# AWS-only + simplicity: CDK (with CloudFormation backing)
# Multi-cloud / existing Terraform org: Terraform
# Complex abstractions in real code: Pulumi
# Avoid: ClickOps (console) and raw CloudFormation JSON for new projects
# GitOps: Git repository is the single source of truth for infrastructure
# Any change to infra → PR → review → merge → automated pipeline applies it
# Desired state in Git; operator ensures cluster matches desired state
# GitHub Actions CI/CD pipeline for EKS:
name: Deploy to EKS
on:
push:
branches: [main]
permissions:
id-token: write # OIDC for AWS
contents: read
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::ACCOUNT:role/GitHubActionsDeployRole
aws-region: us-east-1
- name: Build and push container
run: |
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REPO
docker build -t $ECR_REPO:$GITHUB_SHA .
docker push $ECR_REPO:$GITHUB_SHA
- name: Deploy to EKS
run: |
aws eks update-kubeconfig --name my-cluster --region us-east-1
helm upgrade --install my-app ./charts/my-app \
--set image.tag=$GITHUB_SHA \
--wait --timeout 10m
# ArgoCD (Kubernetes GitOps operator):
# Watches Git repo → applies manifests to cluster automatically
# Drift detection: alert if cluster state diverges from Git
# Rollback: git revert → ArgoCD automatically reverts deployment
# AWS CodePipeline for multi-account deployments:
# Source: CodeCommit/GitHub → Build: CodeBuild → Deploy: CloudFormation StackSets
# Cross-account deployment: assume role in target account to deploy
Multi-cloud uses services from two or more cloud providers. It's a spectrum from "we use AWS but also Cloudflare" to "our core workload runs on both AWS and GCP simultaneously."
Legitimate reasons for multi-cloud:
- Best-of-breed services: GCP BigQuery for analytics, AWS for compute, Azure for M365 integration
- Negotiating leverage: Realistic ability to switch prevents vendor lock-in and improves pricing negotiations
- Regulatory: Some regions require specific local cloud providers
- M&A: Acquired a company on a different cloud
Multi-cloud pitfalls:
- Operational complexity: Two sets of tools, two sets of IAM, two sets of monitoring — double the skill requirements
- Data transfer costs: Egress between clouds is expensive ($0.08-0.09/GB). Workloads that talk cross-cloud will have high bills.
- Lowest-common-denominator: Using cloud-agnostic tools (plain Kubernetes, Terraform) foregoes managed services that save operational overhead
- "Multi-cloud resilience" is mostly a myth: A region outage (us-east-1) doesn't necessitate a different cloud provider — AWS has other regions. For true resilience, multi-region on one cloud is simpler and sufficient.
# Landing Zone: pre-configured, multi-account AWS environment with security best practices
# AWS Control Tower (managed Landing Zone):
# Sets up the following automatically:
# - Management account (billing, org policies)
# - Log Archive account (central CloudTrail, Config, S3 access logs)
# - Audit account (security tooling, GuardDuty aggregator, Security Hub)
# - Core OU structure (Security, Sandbox, Workloads)
# - Preventive guardrails (SCPs): cannot disable CloudTrail, GuardDuty
# - Detective guardrails (Config rules): detect non-compliant resources
# - Account Factory: provision new accounts with consistent baseline
# Account Factory for Terraform (AFT):
# Git-based account provisioning
# PR → new account request → pipeline creates account with CT baseline
# Customizations: additional SCPs, IAM Identity Center assignments
# Must-have baseline for every new account:
resource "aws_cloudtrail" "main" {
is_multi_region_trail = true
enable_log_file_validation = true
include_global_service_events = true
s3_bucket_name = var.central_log_bucket
}
resource "aws_guardduty_detector" "main" { enable = true }
resource "aws_securityhub_account" "main" {}
resource "aws_config_configuration_recorder" "main" { ... }
# Account Vending Machine concept:
# 1. Developer requests new account via ServiceNow/Jira
# 2. Approval workflow
# 3. Control Tower + AFT provisions account in ~20 minutes
# 4. Developer gets access via IAM Identity Center
# 5. Account has full baseline: logging, security, networking connected to org TGW
# Complete microservices CI/CD pipeline:
# Git push → GitHub Actions → ECR → EKS (dev → staging → prod)
# Pipeline stages:
# 1. Code quality gate: lint, unit tests, SAST (Semgrep), secret scan (gitleaks)
# 2. Build: docker build → push to ECR (tagged with git SHA)
# 3. Deploy to dev: helm upgrade, smoke test
# 4. Integration tests: Postman/k6 against dev endpoint
# 5. Deploy to staging: same chart, prod-like config
# 6. Load test: k6 performance test, SLO validation
# 7. Approval gate: manual or auto (if SLOs pass)
# 8. Deploy to prod: canary (10%) → monitor 30min → full rollout
# Blue-Green deployment on EKS:
# Two deployments: blue (current) and green (new)
# Service selector switches traffic when green is healthy
# Rollback: flip selector back to blue (seconds)
# Canary with Argo Rollouts:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 10 # 10% traffic to new version
- pause: {duration: 30m} # wait and monitor
- setWeight: 50
- pause: {duration: 15m}
- setWeight: 100 # full rollout
# Feature flags (decouple deploy from release):
# Rollout to 100% of infra but only show to 1% of users
# LaunchDarkly / AWS AppConfig / Unleash
# Instant rollback: turn off the flag, no re-deployment needed
Platform engineering builds internal developer platforms (IDPs) that abstract cloud complexity — developers self-serve without needing deep cloud expertise. This is the "golden path" pattern.
# Internal Developer Platform components:
# 1. Service catalog (golden paths):
# Pre-built templates: "new microservice", "data pipeline", "ML workload"
# Developer creates new service → selects template → platform provisions:
# - AWS account or EKS namespace with RBAC
# - ECR repo, CodePipeline, GitHub repo with CI config
# - Monitoring dashboards, logging configuration, IAM roles
# Service catalog tools: Backstage (Spotify), Port, OpsLevel
# 2. Backstage (open-source developer portal):
# Software catalog: all services, owners, runbooks in one place
# Golden path scaffolding: generate new services from templates
# Kubernetes plugin: cluster and workload visibility
# TechDocs: auto-generated documentation from markdown in repos
# 3. Crossplane (Kubernetes-native infrastructure provisioning):
# Developers request infrastructure (RDS, S3) via Kubernetes manifests
# Platform team defines compositions (what developers can request)
# Crossplane calls AWS APIs to provision
apiVersion: database.example.org/v1alpha1
kind: PostgreSQLInstance
metadata: {name: my-app-db}
spec:
parameters: {storageGB: 20, version: "14"}
compositeDeletePolicy: Foreground
writeConnectionSecretToRef: {name: db-credentials}
# 4. Metrics: platform adoption, mean-time-to-deploy, developer NPS
# Goal: reduce cognitive load → developers ship faster, more reliably
# Platform team as product team: build features developers actually use