☁️ Cloud AWS · GCP · Azure Serverless · Kubernetes ~90 questions

Cloud Architecture

A complete set of senior-level cloud architecture interview questions covering IAM and identity, networking and VPC design, compute and serverless, storage and databases, messaging and event-driven systems, observability, security, cost optimization, high availability, disaster recovery, and the Well-Architected Framework across AWS, GCP, and Azure.

No questions match your search. Try a different keyword.

IAM & Identity

8 questions
1How does AWS IAM work? Explain users, roles, policies, and how permissions are evaluated.

AWS IAM controls who can do what across all AWS services. Access is denied by default — you must explicitly allow every action.

IAM Principals:

  • IAM User: Long-term identity for a person. Has username/password and/or access keys. Avoid for applications — use roles instead.
  • IAM Role: Identity assumed temporarily by EC2, Lambda, another account, or a federated user. STS issues short-lived credentials. The right choice for all workloads.
  • IAM Group: Collection of users that share policies. Not a principal — can't be assumed or referenced in resource policies.
# Identity-based policy (attached to user/role):
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:PutObject"],
    "Resource": "arn:aws:s3:::my-bucket/*",
    "Condition": {"StringEquals": {"aws:RequestedRegion": "us-east-1"}}
  }]
}

# Resource-based policy (e.g., S3 bucket policy — enables cross-account):
{
  "Principal": {"AWS": "arn:aws:iam::111222333:role/DataTeam"},
  "Effect": "Allow",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::analytics-bucket/*"
}

# Permission evaluation order (DENY wins at every step):
# 1. Explicit Deny in ANY policy → always denied
# 2. SCP (org ceiling) → must allow
# 3. Resource-based policy OR Identity-based policy → either grants
# 4. IAM Permission Boundary → caps what identity can do
# 5. Session policy → further restricts assumed role

Best practices: Delete root access keys. Enable MFA on root and privileged users. Use roles for EC2/Lambda/ECS. Apply least privilege. Use IAM Access Analyzer to identify over-permissive policies and unused access.

2What are SCPs and how do they enforce guardrails across an AWS Organization?

SCPs (Service Control Policies) set maximum permissions for all accounts in an AWS Organization. They don't grant permissions — they restrict what IAM can grant. Even the account root user is bound by SCPs.

# SCP: prevent disabling security services and leaving the org
{
  "Statement": [
    {"Effect": "Deny", "Action": "organizations:LeaveOrganization", "Resource": "*"},
    {"Effect": "Deny",
     "Action": ["cloudtrail:StopLogging","cloudtrail:DeleteTrail","guardduty:DeleteDetector"],
     "Resource": "*"},
    {"Effect": "Deny",
     "Action": ["iam:CreateUser", "iam:CreateAccessKey"],
     "Resource": "*",
     "Condition": {"StringNotEquals": {"aws:PrincipalTag/Team": "platform"}}}
  ]
}

# Restrict to approved regions only:
{
  "Effect": "Deny",
  "NotAction": ["iam:*","sts:*","support:*","trustedadvisor:*"],
  "Resource": "*",
  "Condition": {
    "StringNotEquals": {"aws:RequestedRegion": ["us-east-1","eu-west-1"]}
  }
}

# SCP hierarchy: Root OU → Production OU → Account
# Policy inherited down the tree — child cannot remove parent grants
# AWS Control Tower: managed Landing Zone with pre-built SCPs (guardrails)
3How do IRSA and workload identity eliminate static credentials in Kubernetes?
# AWS IRSA (IAM Roles for Service Accounts):
# EKS cluster acts as an OIDC provider
# IAM role trust policy allows a specific Kubernetes service account:
{
  "Principal": {"Federated": "arn:aws:iam::ACCOUNT:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/CLUSTER"},
  "Action": "sts:AssumeRoleWithWebIdentity",
  "Condition": {
    "StringEquals": {
      "oidc.eks.us-east-1.amazonaws.com/id/CLUSTER:sub":
        "system:serviceaccount:my-namespace:my-service-account"
    }
  }
}

# Annotate the Kubernetes service account:
kubectl annotate serviceaccount my-service-account \
  eks.amazonaws.com/role-arn=arn:aws:iam::ACCOUNT:role/MyRole

# Pod SDK auto-gets credentials — no secret needed anywhere!
import boto3
s3 = boto3.client('s3')  # uses IRSA credentials automatically

# GKE Workload Identity: same concept
gcloud iam service-accounts add-iam-policy-binding gsa@project.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:project.svc.id.goog[namespace/ksa-name]"

# Benefits:
# No static AWS_ACCESS_KEY_ID in secrets
# Credentials rotate automatically (temporary STS tokens)
# Per-pod IAM — each service has minimal permissions
# Audit trail: CloudTrail shows which pod assumed which role
4How do you design a centralized identity architecture using AWS IAM Identity Center?
# Architecture: Corporate IdP (Azure AD / Okta) → IAM Identity Center → AWS Accounts
# One login portal → employees access all assigned accounts and roles
# Generates temporary credentials — no long-lived access keys

# Setup:
# 1. Connect external IdP via SCIM (auto-provision users/groups) + SAML (authentication)
# 2. Create Permission Sets (IAM role templates):
#    - AdministratorAccess (Platform team only)
#    - DeveloperAccess (read most things, write to own services)
#    - ReadOnly (security auditors, management)
# 3. Assign: Group "engineering-team" → Permission Set "DeveloperAccess" → Account "prod-app"

# Assignment example:
aws sso-admin create-account-assignment \
  --instance-arn arn:aws:sso:::instance/ssoins-xxx \
  --target-id 123456789012 \   # AWS account ID
  --target-type AWS_ACCOUNT \
  --permission-set-arn arn:aws:sso:::permissionSet/ssoins-xxx/ps-xxx \
  --principal-type GROUP \
  --principal-id group-id-from-idp

# User experience:
# Employee opens AWS access portal → selects account → selects role → opens console
# CLI: aws sso login → credentials cached for session
# Temporary credentials: 1-12hr session duration

# Key benefit: deprovisioning
# Employee leaves company → disable in AD → SCIM removes from IAM Identity Center
# → All AWS access revoked automatically within minutes
5What are IAM permission boundaries and when do you use them?

A permission boundary caps the maximum permissions a user or role can have. Effective permissions = intersection of the identity policy AND the boundary. Key use case: allow developers to create IAM roles without being able to escalate privileges.

# Platform team creates a boundary: developers can use app services, not IAM admin
{
  "Statement": [
    {"Effect": "Allow", "Action": ["s3:*","dynamodb:*","lambda:*","logs:*","xray:*"], "Resource": "*"},
    {"Effect": "Deny",  "Action": ["iam:CreatePolicy","iam:AttachRolePolicy","organizations:*"], "Resource": "*"}
  ]
}

# Platform team's CI/CD role can create roles, BUT only with the boundary attached:
{
  "Effect": "Allow",
  "Action": ["iam:CreateRole","iam:PutRolePolicy"],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "iam:PermissionsBoundary": "arn:aws:iam::ACCOUNT:policy/DeveloperBoundary"
    }
  }
}

# Result: developer CI/CD can create roles for their Lambda functions
# but those roles can NEVER have IAM admin permissions
# even if developer's policy accidentally grants it, boundary blocks it

# CDK bootstrap uses permission boundaries this way
# AWS CDK --cloudformation-execution-policies flag for CI/CD safety
6How does GCP IAM work? Explain the resource hierarchy and organization policies.
# GCP IAM: "member has role on resource"
# Binding: serviceAccount:sa@project.iam → roles/storage.objectViewer → bucket/my-bucket

# Resource hierarchy (policies inherit down):
# Organization → Folder → Project → Resource
# A binding at folder level applies to all projects in that folder

gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:my-sa@my-project.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer" \
  --condition="expression=request.time < timestamp('2025-01-01T00:00:00Z'),title=temp-access"

# Role types:
# Basic: Owner > Editor > Viewer (too broad — avoid)
# Predefined: roles/storage.objectViewer (narrowly scoped)
# Custom: exact permission set you define

# Organization Policy constraints (≈ AWS SCPs):
# Enforce security baselines across all projects in the org
gcloud org-policies set-policy constraints/compute.requireOsLogin --organization=ORG_ID
gcloud org-policies set-policy constraints/iam.disableServiceAccountKeyCreation --organization=ORG_ID
gcloud org-policies set-policy constraints/compute.restrictCloudRunRegion --organization=ORG_ID

# Recommended policies:
# disableServiceAccountKeyCreation: force use of workload identity
# requireOsLogin: SSH key management via IAM
# iam.allowedPolicyMemberDomains: only allow your org's domain as IAM members
# storage.uniformBucketLevelAccess: no per-object ACLs (simpler security model)
7How does Azure RBAC work? What is the difference between Azure AD roles and Azure resource roles?

Azure has two separate RBAC systems that are often confused — understanding the difference is critical.

  • Azure AD roles (Entra ID): Manage identities in Azure AD itself. Global Administrator, User Administrator, Application Administrator. Directory-level, not resource-level.
  • Azure resource roles: Control access to Azure resources. Owner, Contributor, Reader, Storage Blob Data Contributor. Applied at Management Group → Subscription → Resource Group → Resource scope.
# Assign resource role:
az role assignment create \
  --assignee my-managed-identity \
  --role "Storage Blob Data Reader" \
  --scope "/subscriptions/SUB-ID/resourceGroups/my-rg"

# Custom role:
az role definition create --role-definition '{
  "Name": "App Restarter",
  "Actions": ["Microsoft.Web/sites/restart/action"],
  "Scope": ["/subscriptions/SUB-ID"]
}'

# Azure PIM (Privileged Identity Management):
# Just-in-time role activation — no standing admin privileges
# Eligible vs Active assignments
# Approval workflows, MFA required, time-limited (max 8 hours)
# All activations audited to Azure AD audit log

# Managed Identity (≈ AWS IAM role for workloads):
# System-assigned: tied to the resource lifecycle (VM, App Service)
# User-assigned: standalone identity, can be assigned to multiple resources
# App Service/Function with managed identity → calls Key Vault without credentials
8What is cross-account access in AWS and what are the multi-account patterns?
# Cross-account role assumption:
# Account B creates a role with trust policy allowing Account A:
{
  "Principal": {"AWS": "arn:aws:iam::ACCOUNT-A:root"},
  "Action": "sts:AssumeRole",
  "Condition": {"Bool": {"aws:MultiFactorAuthPresent": "true"}}  # optional MFA requirement
}

# Account A assumes the role:
import boto3
assumed = boto3.client('sts').assume_role(
    RoleArn="arn:aws:iam::TARGET:role/CrossAccountRole",
    RoleSessionName="my-session",
    DurationSeconds=3600
)
# Use temp credentials to call services in Account B

# Multi-account strategy:
# 1. Account-per-environment: dev / staging / prod in separate accounts
#    → No accidental prod deployments from dev
#    → Blast radius containment

# 2. Account-per-team: each team owns their AWS account
#    → Clear billing attribution (cost per team)
#    → Independent IAM policies

# 3. Shared services account:
#    → Central DNS, logging, security tooling, transit gateway
#    → Spoke accounts trust hub for shared services

# AWS Resource Access Manager (RAM):
# Share resources across accounts without complex resource policies
# Share: VPC subnets, Transit Gateways, Route 53 Resolver rules, License Manager
aws ram create-resource-share \
  --name my-subnet-share \
  --resource-arns arn:aws:ec2:us-east-1:ACCOUNT:subnet/subnet-xxx \
  --principals arn:aws:organizations::ACCOUNT:organization/o-xxx

Networking & VPC

8 questions
1Design a production-grade AWS VPC. What are the subnets, routing, and security layers?
# Three-tier VPC across 3 Availability Zones:
VPC: 10.0.0.0/16  (65,536 IPs)

# Public subnets (ALB, NAT Gateway, Bastion):
# 10.0.1.0/24  10.0.2.0/24  10.0.3.0/24
# Route: 0.0.0.0/0 → Internet Gateway

# Private app subnets (ECS tasks, EKS nodes, Lambda in VPC):
# 10.0.11.0/24  10.0.12.0/24  10.0.13.0/24
# Route: 0.0.0.0/0 → NAT Gateway (one per AZ for resilience)

# Isolated data subnets (RDS, ElastiCache — no internet):
# 10.0.21.0/24  10.0.22.0/24  10.0.23.0/24
# Route: local only

# Security Groups (stateful, per-resource):
# ALB SG:  inbound 443 from 0.0.0.0/0
# App SG:  inbound 8080 from ALB SG only
# DB SG:   inbound 5432 from App SG only

# Network ACLs (stateless, per-subnet — extra layer):
# Block known bad IPs at subnet level
# Must allow ephemeral ports 1024-65535 for return traffic

# VPC Flow Logs → S3/CloudWatch for network visibility
# Gateway endpoints for S3/DynamoDB (free, no NAT needed)
# Interface endpoints for ECR, CloudWatch, Secrets Manager (keep traffic private)
2What is Transit Gateway vs VPC Peering? When do you use each?

VPC Peering: Direct one-to-one link between two VPCs. No transitive routing — if A↔B and B↔C, A cannot reach C via B. For n VPCs, requires n(n-1)/2 connections. Simple and cheap for a small number of VPCs.

Transit Gateway: Hub-and-spoke — all VPCs attach to one TGW, enabling transitive routing. Route tables on the TGW control which attachments can reach which. Scales to thousands of VPCs.

# VPC Peering: 10 VPCs = 45 connections; 100 VPCs = 4,950 connections
# Transit Gateway: 100 VPCs = 100 attachments — manageable

# TGW segmentation with multiple route tables:
# prod-rt:   production VPCs ↔ shared-services VPC (isolated from dev)
# dev-rt:    dev VPCs ↔ shared-services VPC (can't reach prod)
# → Strong isolation even though all use the same TGW

# Centralized egress via TGW (cost saving):
# All private VPCs → TGW → single egress VPC with NAT Gateway
# Avoids one NAT Gateway per VPC (saves ~$160/month per VPC)

# On-premises connectivity via TGW:
# Direct Connect / VPN → TGW → all VPCs (single connection serves all)

# AWS RAM: share TGW across accounts in org
# Network account owns TGW, spoke accounts attach their VPCs
3What is AWS PrivateLink and what problems does it solve?
# PrivateLink: private connectivity to AWS services or your own services
# Traffic stays within the AWS network backbone — never touches the internet

# Gateway endpoints (S3 and DynamoDB — free):
# Route table entry points S3/DynamoDB traffic to endpoint directly
# Lambda/ECS in private subnet → S3 without NAT Gateway

# Interface endpoints (all other services — paid $0.01/hr + $0.01/GB):
# Creates an ENI in your subnet with a private IP
# DNS resolution: s3.us-east-1.amazonaws.com → 10.0.1.x (private IP)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxx \
  --service-name com.amazonaws.us-east-1.secretsmanager \
  --subnet-ids subnet-xxx \
  --security-group-ids sg-xxx \
  --private-dns-enabled

# Why this matters:
# Lambda/ECS calling Secrets Manager without NAT Gateway → saves cost + latency
# Compliance: PCI-DSS/HIPAA require no internet exposure for sensitive calls
# Security: internal traffic can't be intercepted at internet layer

# PrivateLink for your own services (cross-account/SaaS):
# Create NLB → VPC Endpoint Service
# Consumers create Interface Endpoints to reach your service privately
# Snowflake, Datadog, MongoDB Atlas all support PrivateLink connectivity
4What is Direct Connect vs Site-to-Site VPN? How do you architect hybrid connectivity?
# Site-to-Site VPN:
# IPSec tunnel over the internet. Up to 1.25 Gbps per tunnel.
# Setup in minutes. Cost: ~$0.05/hr + data transfer.
# Use for: quick setup, moderate bandwidth, backup connectivity

# AWS Direct Connect:
# Dedicated fiber circuit from on-prem to AWS Direct Connect location
# 1/10/100 Gbps dedicated bandwidth. Consistent latency (~1ms in same metro).
# Setup: weeks/months (physical provisioning). Cost: $300-$14,000/month port fee.
# Use for: high bandwidth (>1 Gbps), compliance (no internet), financial workloads

# Resilient hybrid architecture:
# Primary: Direct Connect (dedicated 10 Gbps)
# Secondary: Site-to-Site VPN (automatic BGP failover if DX fails)
# DX route advertised with better MED → VPN used only when DX is down

# Direct Connect Gateway:
# One DX circuit → VPCs in multiple AWS regions
# Attach to TGW in each region → all VPCs reach on-premises

# Pricing comparison (100GB/month data transfer to on-prem):
# VPN: $36 + $9 data = $45/month
# DX 1 Gbps: $220 port + $2 data = $222/month (better at higher volumes)
5How does Route 53 work? Explain routing policies and DNS failover.
# Route 53 routing policies:

# Simple: single value, no health checks
# Weighted: A/B testing, gradual migrations
#   api.example.com → v1-alb: weight=80, v2-alb: weight=20

# Latency-based: route to lowest-latency region
#   US users → us-east-1 ALB; EU users → eu-west-1 ALB

# Geolocation: data residency (GDPR)
#   EU users → EU endpoint; North America → US endpoint
#   Default record required for unmatched locations

# Failover: active-passive
#   Primary: prod-alb (healthy = serves all traffic)
#   Secondary: dr-alb (used only when primary health check fails)
#   Health check: HTTP 200 on /health every 30s from 3 regions

# Geoproximity: traffic shifting with bias adjustment
# Multivalue: up to 8 healthy IPs (basic client-side load balancing)

# Health check configuration:
aws route53 create-health-check --caller-reference mycheck \
  --health-check-config '{
    "IPAddress": "1.2.3.4",
    "Port": 443, "Type": "HTTPS",
    "ResourcePath": "/health",
    "RequestInterval": 30,
    "FailureThreshold": 3
  }'

# TTL recommendation: 60s (balance responsiveness vs DNS caching)
# Private hosted zones: internal DNS within VPC (api.internal → 10.0.1.15)
6How does CloudFront work? Explain caching behaviors, origins, and Lambda@Edge.
# CloudFront request flow:
# User → Nearest Edge (450+ PoPs) → Cache hit? Serve. Miss? → Origin (ALB/S3/API GW)

# Origins:
# S3 with OAC (Origin Access Control): only CloudFront can read S3 (not public)
# ALB: restrict via security group allowing only CloudFront IP ranges

# Cache behaviors by path:
# /api/*  → ALB origin, TTL=0 (dynamic, no caching), forward all headers
# /static/* → S3 origin, TTL=86400 (1 day), cache headers/cookies stripped
# Default → ALB origin, TTL=300

# Lambda@Edge vs CloudFront Functions:
# CF Functions: <1ms, 250+ PoPs, viewer request/response only
#   → URL rewrites, header manipulation, simple auth token check
# Lambda@Edge: full Lambda, ~13 regional edge caches, all 4 events
#   → JWT validation, A/B testing, SSR, geo-based content

# Common patterns:
# 1. SPA hosting: S3 → CloudFront (HTTPS, custom domain, DDoS protection)
# 2. API caching: cache GET responses at edge, skip POST/PUT
# 3. Signed URLs: time-limited access to private S3 content
# 4. WAF at edge: block attacks before they reach origin
# 5. Geographic restriction: deny specific countries at edge
7Compare ALB, NLB, and GWLB. When do you use each?
# ALB (Application LB) — Layer 7, HTTP/HTTPS/gRPC:
# Content-based routing: path (/api → service-a, /v2 → service-b)
# Host-based routing: api.example.com vs www.example.com
# Target types: EC2, IPs (containers), Lambda
# WAF integration, request/response headers, redirect rules
# Use for: web applications, microservices, REST APIs, gRPC services

# NLB (Network LB) — Layer 4, TCP/UDP:
# Extreme performance: millions req/s, ~100µs latency
# Static IPs per AZ (whitelist in customer firewalls)
# Preserves source IP address (ALB does not)
# TLS passthrough (ALB terminates TLS; NLB can pass through to backend)
# Use for: high-performance APIs, PrivateLink services, gaming, IoT, VoIP

# GWLB (Gateway LB) — Layer 3, for network appliances:
# Transparent bump-in-the-wire for security appliances (Palo Alto, Fortinet)
# GENEVE protocol encapsulation; traffic inspected then forwarded
# Use for: centralized ingress/egress inspection, IDS/IPS

# Target group health check tips:
# Reduce interval to 10s for faster failure detection
# Deregistration delay: reduce from 300s to 30s for faster deployments
# Evaluate target health: ALB/NLB check actual backend health, not just port
8How does GCP's networking model differ from AWS? What is a shared VPC?

GCP's VPC is fundamentally global, while AWS VPC is regional. A single GCP VPC spans all regions — subnets are regional but the network is not. This means VMs in different regions can communicate privately without VPC peering or transit gateways.

# GCP VPC — one VPC, multiple regional subnets:
gcloud compute networks create my-vpc --subnet-mode=custom
gcloud compute networks subnets create us-subnet \
  --network=my-vpc --region=us-central1 --range=10.0.1.0/24
gcloud compute networks subnets create eu-subnet \
  --network=my-vpc --region=europe-west1 --range=10.0.2.0/24
# VM in us-central1 and VM in europe-west1 communicate privately by default!

# Shared VPC (equivalent to AWS RAM for subnets):
# Host project owns the VPC and subnets
# Service projects can create resources using host project's subnets
# Centralized network control, distributed app ownership
# Common: one network team manages Shared VPC; dev teams deploy apps in it

# GCP Firewall rules (equivalent to AWS security groups):
gcloud compute firewall-rules create allow-web \
  --network=my-vpc \
  --allow=tcp:443 \
  --source-ranges=0.0.0.0/0 \
  --target-tags=web-server    # apply to VMs with this network tag

# Cloud Interconnect (≈ Direct Connect):
# Dedicated: 10/100 Gbps to GCP PoP
# Partner: through telco partner (lower bandwidth options)
# Cloud Router: BGP routing between GCP and on-prem

# GCP Global Load Balancer:
# Single anycast IP serves all regions
# Routes to nearest healthy backend globally
# HTTP(S) LB → backends in multiple regions = global HA with one IP

Compute & Serverless

8 questions
1What are EC2 purchasing options and how do you right-size instances?
# Purchasing options:
# On-Demand:  full price, no commitment → dev/test, spiky unpredictable loads
# Spot:       up to 90% off, 2-min interruption notice → batch, ML training, CI/CD
# Reserved (1 or 3 year): 40-72% off → steady-state production workloads
# Savings Plans (Compute): commit to $/hr, covers EC2+Fargate+Lambda → most flexible
# Dedicated Host: physical server (BYOL licensing, compliance)

# Graviton (ARM): m7g/c7g/r7g — 20-40% better price/perf for most workloads
# Instance families:
# m-series: general purpose; c-series: compute (CPU-bound)
# r-series: memory (in-memory DBs, caches); i-series: storage (local NVMe)
# p/g-series: GPU (ML); inf2: ML inference

# Right-sizing process:
# 1. Enable CloudWatch detailed monitoring + CW Agent for memory
# 2. AWS Compute Optimizer: ML-based recommendations from actual utilization
# 3. Target: p95 CPU 40-70%, p95 memory 60-80%
# 4. Downsize if p95 CPU < 20% consistently over 14 days
# 5. Benchmark: test Graviton equivalent — usually 20-40% cheaper, same performance
# 6. Switch to Savings Plans AFTER right-sizing (commit to right size)
2How does AWS Lambda work? What causes cold starts and how do you minimize them?
# Lambda execution environment lifecycle:
# Cold start: new Firecracker microVM → runtime init → function init → handler
# Warm: reuse existing environment → handler only

# Cold start durations by runtime:
# Node.js/Python: ~200-500ms    Go: ~100-200ms
# Java (no SnapStart): 1-3s     Java (SnapStart): ~200ms
# .NET: ~300-600ms

# Minimizing cold starts:
# 1. Provisioned concurrency (guaranteed warm environments):
aws lambda put-provisioned-concurrency-config \
  --function-name my-fn --qualifier prod \
  --provisioned-concurrent-executions 10
# Use auto-scaling for provisioned concurrency: scale on schedule or metric

# 2. Lambda SnapStart (Java): snapshot initialized JVM, resume from snapshot
# 3. Reduce package size: smaller zip = faster download and init
# 4. Move init outside handler:
import boto3
_s3 = boto3.client('s3')         # initialized once per warm container
_db_conn = connect_db()           # pooled across invocations

def handler(event, context):
    return _s3.get_object(...)    # reuses warm client — fast!

# 5. Use Lambda layers for shared dependencies (cached separately)
# 6. Lambda URLs + Function URLs: direct HTTPS endpoint without API GW overhead

# Lambda limits: 15 min max, 10GB memory, 250MB deployment package (unzipped 512MB)
# Concurrency: 3,000 burst initial + 500/min scaling (US regions)
3How do you choose between ECS, EKS, and Fargate? What are the operational trade-offs?
# ECS (Elastic Container Service): AWS-native orchestration
# Simpler than Kubernetes; integrates deeply with ALB, IAM, CloudWatch
# Task Definition: container spec (image, CPU, memory, env vars, IAM role)
# Service: maintains desired count, integrates with load balancer
# Cluster: logical grouping; launch type: EC2 or Fargate

# EKS (Elastic Kubernetes Service): managed Kubernetes control plane
# AWS manages: API server, etcd, scheduler, controller manager (HA, auto-patched)
# You manage: worker nodes, add-ons (CoreDNS, VPC CNI, Load Balancer Controller)
# Choose when: existing K8s expertise, complex scheduling needs, multi-cloud portability

# Fargate: serverless containers (no EC2 management)
# Each pod/task gets its own Firecracker microVM (strong isolation)
# Per-vCPU-second billing; ~20-30% premium over equivalent EC2
# No daemonsets, no privileged containers, limited observability
# Best for: teams avoiding node management, variable workloads

# Karpenter (EKS node autoscaler — recommended over Cluster Autoscaler):
# Watches for unschedulable pods → provisions right-sized EC2 in ~30s
# Bin-packing: consolidates underutilized nodes automatically
# Mixed instance types: automatically picks cheapest/best fit

# Decision:
# AWS-native, no K8s expertise → ECS on Fargate
# Scale, cost optimization, K8s ecosystem → EKS + EC2 + Karpenter
# Avoid all node management, lower scale → ECS/EKS Fargate
# Event-driven, short-lived, ≤15 min → Lambda
4How does auto scaling work? Compare target tracking, scheduled, and predictive scaling.
# Target Tracking (recommended — like a thermostat):
aws autoscaling put-scaling-policy \
  --policy-type TargetTrackingScaling \
  --target-tracking-configuration '{
    "PredefinedMetricSpecification": {"PredefinedMetricType": "ASGAverageCPUUtilization"},
    "TargetValue": 60.0,
    "ScaleOutCooldown": 60,    # fast scale-out
    "ScaleInCooldown": 300     # slow scale-in (stability)
  }'

# Scale out aggressively, scale in conservatively:
# Scale out: fire alarm approach — add capacity fast at first sign of load
# Scale in: wait for sustained low load (avoid thrashing)

# Scheduled scaling (known traffic patterns):
aws autoscaling put-scheduled-update-group-action \
  --scheduled-action-name workday-start \
  --recurrence "0 8 * * MON-FRI" \  # 8am weekdays UTC
  --desired-capacity 20 --min-size 10

# Predictive scaling (ML-based — pre-scales before load arrives):
# Analyzes 14 days of history → forecasts and pre-warms instances
# Avoids lag from reactive scaling — instances ready before traffic arrives
# Best for: regular diurnal patterns (office hours, business cycles)

# ECS Service Auto Scaling:
# Scale on: ALB RequestCountPerTarget (per-task), SQS queue depth (for workers)
# Application Auto Scaling API — works for ECS, Lambda (concurrency), DynamoDB

# Scale-to-zero with Lambda and Fargate spot:
# Lambda: inherently scales to zero, pay only when invoked
# Fargate: min-tasks=0 + SQS trigger to scale from 0 → saves cost for batch
5What are Spot Instances and how do you architect fault-tolerant workloads around them?
# Spot: spare EC2 capacity at 70-90% discount; 2-min interruption notice
# Interruption rate: ~1-5%/day depending on instance type and AZ

# ASG mixed instances (best practice):
aws autoscaling create-auto-scaling-group --mixed-instances-policy '{
  "InstancesDistribution": {
    "OnDemandBaseCapacity": 2,                   # 2 on-demand always
    "OnDemandPercentageAboveBaseCapacity": 20,   # 80% spot above base
    "SpotAllocationStrategy": "price-capacity-optimized"
  },
  "LaunchTemplate": {
    "Overrides": [
      {"InstanceType": "m5.large"},
      {"InstanceType": "m5a.large"},
      {"InstanceType": "m6i.large"},
      {"InstanceType": "m6a.large"}   # multiple types = higher availability
    ]
  }
}'

# Handle 2-minute interruption notice:
import requests, time
def check_interruption():
    try:
        r = requests.get(
            "http://169.254.169.254/latest/meta-data/spot/instance-action",
            timeout=1
        )
        if r.status_code == 200:
            return True  # interruption imminent
    except: pass
    return False

# On interruption: drain from ALB, checkpoint state, graceful shutdown
# Spot Capacity Blocks: reserve GPU spot capacity in advance for ML training runs
6What is GCP Cloud Run and how does it compare to AWS Lambda?
# Cloud Run: serverless containers — deploy any HTTP container
gcloud run deploy my-service \
  --image gcr.io/my-project/my-app:latest \
  --region us-central1 \
  --min-instances 0 \     # scale to zero
  --max-instances 100 \
  --concurrency 80 \      # requests per container instance
  --memory 512Mi \
  --cpu 1

# Cloud Run vs Lambda:
# Cloud Run: any container, any language, 60-min timeout, up to 32GB/8vCPU
# Lambda: function-as-a-service, 15-min max, 10GB memory, tighter AWS integration
# Cloud Run concurrency: multiple requests per container (like a web server)
# Lambda: one request per invocation (scale = containers)
# Cold starts: both ~1-2s; Cloud Run min-instances eliminates cold starts

# Cloud Run Jobs: batch/scheduled tasks (not HTTP-triggered)
# Cloud Run for Anthos: deploy to GKE using same Cloud Run API

# Azure Container Apps (equivalent):
# Serverless containers with KEDA-based scaling (HTTP, queue depth, custom metrics)
# Scale to zero, VNet integration, Dapr support for microservices
# Managed Kubernetes underneath, fully abstracted

# AWS App Runner (closest to Cloud Run on AWS):
# Source: container image or GitHub repo → fully managed HTTP service
# Auto-scales, no VPC/IAM complexity to set up
# More limited than ECS Fargate but much simpler to get started
7What is AWS Step Functions and when do you use it for workflow orchestration?
# Step Functions: serverless state machines (Amazon States Language)
# States: Task, Choice, Parallel, Map, Wait, Pass, Succeed, Fail

# Example: order processing with parallel stock check + payment + error handling
{
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:::function:validate",
      "Catch": [{"ErrorEquals": ["ValidationError"], "Next": "Reject"}],
      "Next": "ProcessInParallel"
    },
    "ProcessInParallel": {
      "Type": "Parallel",
      "Branches": [
        {"StartAt": "ProcessPayment", "States": {"ProcessPayment": {"Type": "Task", "Resource": "..."}}},
        {"StartAt": "ReserveStock",   "States": {"ReserveStock":   {"Type": "Task", "Resource": "..."}}}
      ],
      "Next": "Ship"
    }
  }
}

# Express vs Standard:
# Standard: exactly-once, up to 1 year, full audit history → business processes
# Express: at-least-once, up to 5 min, high throughput → event processing

# SDK integrations: call 200+ AWS services directly without Lambda glue code
# Built-in retry with exponential backoff, error handling, timeouts

# Use cases:
# Multi-step ETL pipelines (Glue + EMR + Athena)
# Human approval workflows (SES email → approval callback)
# Saga pattern for distributed transactions
# ML pipelines (SageMaker training + evaluation + deployment)
8What does AWS manage vs what do you manage in EKS?
# AWS manages (EKS control plane):
# Kubernetes API server, etcd, scheduler, controller manager
# Multi-AZ deployment, automatic failover, version patches
# etcd backups
# Control plane HA SLA: 99.95%

# You manage (data plane):
# Worker nodes (EC2 instances in your VPC)
# Node OS patching and AMI updates (managed node groups auto-update)
# EKS add-ons: CoreDNS, kube-proxy, VPC CNI, EBS CSI driver
# Karpenter or Cluster Autoscaler (node provisioning)
# Container images, application configs, Kubernetes RBAC
# Cluster networking: security groups, VPC CIDR planning

# EKS VPC CNI: each pod gets a VPC IP (no overlay network)
# → Pods directly routable within VPC → simpler security group rules
# → Limitation: IP exhaustion if CIDR too small (plan /19 per subnet minimum)
# Prefix delegation: /28 prefixes per ENI → 16 IPs per prefix → more pods per node

# EKS Fargate profiles (no EC2 nodes):
# Each pod runs in its own Firecracker microVM
# Profile selector: namespace + label → Fargate pod
# No daemonsets, no privileged pods, no node-level access
# Billing: per pod vCPU-second (more expensive than EC2 for consistent workloads)

# Version management:
# EKS supports n-3 Kubernetes versions
# Control plane upgrade: 30 min, in-place (but coordinate with data plane)
# Worker node upgrade: rolling update via managed node groups

Storage & Databases

8 questions
1How does S3 work at scale? Explain storage classes, lifecycle policies, and performance.
# Storage classes (cost vs access tradeoff):
# Standard:              $0.023/GB  — frequent access, ms latency
# Standard-IA:           $0.0125/GB — infrequent access, retrieval fee
# One Zone-IA:           $0.010/GB  — single AZ, recreatable data
# Glacier Instant:       $0.004/GB  — quarterly access, ms retrieval
# Glacier Flexible:      $0.0036/GB — minutes-hours retrieval
# Glacier Deep Archive:  $0.00099/GB — years retention, 12h retrieval
# Intelligent-Tiering:   auto-moves between tiers based on access patterns

# Lifecycle policy (automated cost management):
{
  "Rules": [{
    "ID": "archive-logs",
    "Prefix": "logs/",
    "Transitions": [
      {"Days": 30,  "StorageClass": "STANDARD_IA"},
      {"Days": 90,  "StorageClass": "GLACIER_IR"},
      {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
    ],
    "Expiration": {"Days": 2555}  # 7-year deletion
  }]
}

# Performance optimization:
# Throughput: 3,500 PUT/5,500 GET per second PER PREFIX
# High-throughput: randomize prefix (avoid sequential) → more parallelism
# Large objects (>100MB): multipart upload (parallelizes, allows resume)
# Global reads: byte-range fetches, CloudFront caching, Transfer Acceleration
# S3 Express One Zone: 10x throughput, single-digit ms (2023) — for ML data loading
2How do you choose between RDS, Aurora, DynamoDB, and ElastiCache?
# RDS: managed MySQL/PostgreSQL/SQL Server/Oracle
# Multi-AZ: synchronous standby → automatic failover (~1-2 min)
# Read replicas: async, up to 5, cross-region
# Use for: existing relational workloads, complex SQL, ACID transactions

# Aurora (MySQL/PostgreSQL compatible):
# 3-5x faster than standard MySQL; distributed storage (6 copies, 3 AZs)
# Aurora Serverless v2: auto-scales compute in 0.5 ACU increments
# Aurora Global Database: cross-region, <1s replication lag
# Use instead of RDS when: production workload, need 99.99%+ availability

# DynamoDB (NoSQL key-value/document):
# Single-digit ms at any scale, serverless, 99.999% SLA
# Global Tables: multi-region active-active
# Use for: high-scale simple access patterns, gaming, IoT, session storage

# ElastiCache:
# Redis: sub-ms in-memory, persistence, pub/sub, sorted sets, cluster mode
# Memcached: simple distributed cache, multi-threaded
# Use for: database query caching, session storage, leaderboards, rate limiting

# Decision:
# Complex SQL + ACID → Aurora (new) or RDS (existing)
# Simple access patterns + massive scale → DynamoDB
# In-memory caching + pub/sub → ElastiCache Redis
# Time-series → Amazon Timestream
# Full-text search → OpenSearch Service
# Data warehouse → Redshift
3How does DynamoDB data modeling work? Explain single-table design and GSIs.
# DynamoDB: access patterns drive the model — opposite of relational
# Identify ALL access patterns first, then design keys around them

# Primary key: PK (partition key) alone, or PK + SK (sort key)
# PK determines partition; SK enables range queries within a partition
# GSI: alternate PK+SK — enables new access patterns on existing data

# Single-table design (e-commerce example):
# Entity type prefix pattern:
# PK=USER#alice,     SK=PROFILE              → user profile
# PK=USER#alice,     SK=ORDER#2024-01-15#001 → user's order
# PK=ORDER#001,      SK=ITEM#widget-a        → order line item
# PK=PRODUCT#widget, SK=METADATA             → product info

# GSI for inverted access:
# GSI1PK=ORDER#2024-01-15, GSI1SK=STATUS#SHIPPED
# Query: "all orders on this date with status SHIPPED"

# Access patterns supported:
# GetUser(userId): PK=USER#alice, SK=PROFILE
# GetUserOrders(userId): PK=USER#alice, SK begins_with ORDER#
# GetOrderItems(orderId): PK=ORDER#001, SK begins_with ITEM#
# GetOrdersByDate(date, status): GSI1 query

# Hot partition prevention:
# Avoid high-cardinality PK all writing to one partition
# Add shard suffix: PK=ORDERS#3 (random 1-10) for write-heavy entities
# Write sharded, read from all shards and merge

# Capacity: 1 WCU = 1KB write; 1 RCU = 4KB strongly consistent read
# Item size limit: 400KB max
4How does Aurora Global Database work? When do you need multi-region databases?
# Aurora Global Database:
# Primary region: all writes
# Secondary regions (up to 5): read-only replicas, <1s replication lag
# RPO: <1 second | RTO: <1 minute (managed failover)

# Use cases:
# 1. Read scale: route regional users to nearest region for low-latency reads
# 2. Disaster recovery: fast failover if primary region fails
# 3. Compliance: data in specific regions for data residency

# Managed failover (planned, e.g., region maintenance):
# Promotes secondary in <35s, no data loss

# vs DynamoDB Global Tables:
# DynamoDB: active-active (all regions write), eventual consistency
# Aurora Global: active-passive (writes only in primary), strongly consistent

# Multi-region write patterns:
# Single primary (Aurora Global): consistent, writes bottleneck to one region
# Sharded by region: EU users → EU DB, US users → US DB (no cross-region reads)
# Active-active (DynamoDB / CockroachDB): all regions accept writes,
#   eventual consistency, conflict resolution required

# RDS Proxy: connection pooling for Lambda/serverless → Aurora
# Prevents connection exhaustion when Lambda scales to thousands of concurrent executions
# Also provides: IAM authentication for DB, automatic failover handling
5How do you design a data lake on AWS? What is the modern lake house architecture?
# Lake house: S3 as central store, query in place (no ETL to data warehouse)
# Zones: raw (landing) → curated (cleaned/partitioned) → consumption (aggregated)

# Ingestion:
# Batch: AWS Glue ETL (Spark) or EMR → S3
# Streaming: Kinesis Firehose → S3 (buffered micro-batch, 1-15 min)
# CDC from databases: DMS → Kinesis → S3 or direct to S3

# Storage (open table formats — replacing Hive):
# Apache Iceberg: ACID transactions on S3, time travel, schema evolution
# Delta Lake: Databricks-native, now open source, same capabilities
# Both supported by Athena v3, Glue, EMR natively

# Querying:
# Athena: serverless SQL on S3, $5/TB scanned → use Parquet + partition pruning
# Redshift Spectrum: query S3 from Redshift (join warehouse + lake)
# EMR: Spark/Hive for complex transformations

# Catalog and governance:
# Glue Catalog: Hive-compatible metastore (schemas, partitions)
# Lake Formation: column/row-level access control on the lake
# Glue Crawler: auto-discovers schemas from S3

# Performance best practices:
# Parquet/ORC format: columnar → 10x less data scanned for analytical queries
# Partition by date: WHERE dt='2024-01-15' → prune non-matching partitions
# Compaction: merge many small files into fewer large files (Iceberg auto-compaction)
# Predicate pushdown: Athena pushes filters into Parquet reader
6How do you implement caching strategies in cloud architectures?
# Cache-aside (lazy loading — most common):
def get_user(user_id):
    user = redis.get(f"user:{user_id}")
    if user: return json.loads(user)
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    redis.setex(f"user:{user_id}", 3600, json.dumps(user))
    return user
# Pro: only caches accessed data; Con: cache miss = 2 trips

# Write-through: write DB and cache on every update
def update_user(user_id, data):
    db.update(user_id, data)
    redis.setex(f"user:{user_id}", 3600, json.dumps(data))
# Pro: cache always fresh; Con: extra write latency

# Cache stampede prevention (thundering herd):
# On cache miss, hundreds of requests hit DB simultaneously
# Fix: probabilistic early expiration or mutex lock on key
lock = redis.set(f"lock:user:{uid}", 1, nx=True, ex=10)
if not lock: time.sleep(0.1); return get_user_with_lock(uid)

# CloudFront as API cache:
# Cache GET /products/* for 5 minutes at edge (300 ALB requests → 1)
# Vary cache by Accept-Language header for localized content

# Tiered caching:
# L1: Local in-process cache (LRU dict, 100 entries) — microseconds
# L2: ElastiCache Redis (shared across instances) — sub-ms
# L3: Database query result — ms

# Cache invalidation on write:
# Simple TTL: accept up to TTL stale data
# Write-through: update cache synchronously
# Event-based: publish invalidation event → listeners delete cache keys
7How do you handle database migrations with zero downtime?
# Expand-Contract pattern (for schema changes):

# Phase 1 — Expand (backward-compatible):
ALTER TABLE users ADD COLUMN phone VARCHAR(20) NULL;
# Deploy code that writes to BOTH old and new columns
# Deploy code that reads from new column (fallback to old)

# Phase 2 — Backfill (batched to avoid table lock):
UPDATE users SET phone = legacy_phone
WHERE phone IS NULL AND id BETWEEN %s AND %s;  # batch by primary key
# Run during off-peak hours; monitor replication lag

# Phase 3 — Contract (after all code reads new column only):
ALTER TABLE users DROP COLUMN legacy_phone;

# AWS DMS for version upgrades or engine migrations:
# Full load: copy all existing data to target
# CDC (Change Data Capture): replicate ongoing changes during cutover window
# Cutover: stop writes to source → wait for DMS lag = 0 → switch connection strings

# RDS Blue-Green Deployments (for major version upgrades):
# Creates a green RDS with the new version
# Replicates from blue via binlog
# Switchover: promotes green, renames endpoints (~30s downtime)
# Rollback window: blue kept for 3 days after switchover

# Flyway / Liquibase in CI/CD:
# Versioned migrations: V001__initial.sql, V002__add_column.sql
# Runs automatically in deployment pipeline
# Tracks executed migrations in schema_history table
# Idempotent: running twice is safe
8What is GCP Spanner and Bigtable? When do you choose each over Cloud SQL?
# Cloud SQL (≈ RDS): managed MySQL/PostgreSQL, single-region, familiar
# $50-500/month depending on instance size
# Use for: standard OLTP, lift-and-shift, most relational workloads

# Cloud Spanner: globally distributed, strongly consistent RDBMS
# ACID transactions across rows, tables, AND continents
# Horizontal read/write scaling (unlike traditional RDBMS)
# 99.999% SLA multi-region, auto-sharding
# Cost: ~$650/month minimum (1 node)
# Use for: global financial systems, inventory consistency, >1 TB relational data
# Powers: Google Ads, Google Play, Pokémon GO

CREATE TABLE Orders (
  customer_id INT64 NOT NULL,
  order_id INT64 NOT NULL,
  amount NUMERIC,
) PRIMARY KEY (customer_id, order_id),
  INTERLEAVE IN PARENT Customers ON DELETE CASCADE;
# Interleaved tables: Orders physically co-located with Customers → fast joins

# Cloud Bigtable: wide-column NoSQL, petabyte-scale
# HBase-compatible API, consistent low latency at scale
# Schema: rows keyed by row key; columns in column families
# Use for: time-series (IoT, metrics), ML features, click stream, Google Maps tiles
# Not for: ad-hoc queries, SQL joins, small datasets

# Comparison:
# Cloud SQL: familiar, cheap, single-region → most web apps
# Spanner: global ACID, expensive → fintech, global inventory
# Bigtable: massive scale, simple access patterns → IoT, analytics
# BigQuery: serverless OLAP → analytics, dashboards (not OLTP)

Messaging & Events

7 questions
1Compare SQS, SNS, EventBridge, and Kinesis. When do you use each?
# SQS — point-to-point work queue, at-least-once:
# Standard: unlimited throughput, out-of-order; FIFO: ordered, exactly-once, 3000/s
# Visibility timeout: message hidden while processing, requeued on failure
# DLQ: messages failing N times go to Dead Letter Queue
# Use for: decoupling services, async work queues, rate-limiting downstream

# SNS — pub/sub fan-out, push delivery:
# One publish → many subscribers (SQS, Lambda, HTTP, email, mobile push)
# Message filtering: subscribers receive only matching messages
# Fan-out pattern: SNS topic → multiple SQS queues → independent consumers
# Use for: notifications, triggering multiple actions from one event

# EventBridge — event bus with content-based routing:
# Default bus: all AWS service events; Custom bus: your app events
# Rules match events by pattern → 20+ target types
# Schema registry, archive/replay, cross-account routing
# Use for: event-driven architecture, reacting to AWS service changes

# Kinesis Data Streams — ordered, replayable, real-time:
# Shard: 1 MB/s write, 2 MB/s read; Retention: 24h to 365 days
# Enhanced fan-out: 2 MB/s per consumer (no sharing)
# Use for: real-time analytics, ordered events, replay needed, IoT

# Decision guide:
# Work queue:              SQS Standard/FIFO
# Fan-out notifications:   SNS → SQS
# AWS service event routing: EventBridge
# Real-time streaming + replay: Kinesis Data Streams
# Managed Kafka workloads: Amazon MSK
2What is the Saga pattern? How do you implement distributed transactions in microservices?

Distributed transactions across microservices can't use traditional 2PC reliably at scale. The Saga pattern breaks a transaction into a sequence of local transactions. If one fails, compensating transactions undo previous steps.

# Choreography saga (event-driven — no central coordinator):
# OrderService: creates order → publishes OrderCreated
# PaymentService: receives OrderCreated → charges card → publishes PaymentProcessed
# InventoryService: receives PaymentProcessed → reserves stock → publishes StockReserved
# ShippingService: receives StockReserved → creates shipment → publishes OrderShipped

# Compensation on failure:
# InventoryService: out of stock → publishes StockFailed
# PaymentService: receives StockFailed → refunds card → publishes PaymentRefunded
# OrderService: receives PaymentRefunded → marks order Cancelled

# Orchestration saga (Step Functions — centralized):
# State machine calls each service explicitly
# On failure: state machine triggers compensation steps
# Clearer flow, easier debugging; harder to scale to many services

# Idempotency requirement (events delivered at-least-once):
processed = set()
def handle_event(event):
    if event["event_id"] in processed: return  # deduplicate
    processed.add(event["event_id"])
    do_work(event)

# Outbox pattern (ensures event published if DB write succeeds):
# Write event to outbox table IN SAME DB TRANSACTION as business data
# Separate process reads outbox → publishes to message bus → marks sent
3How do you design a real-time data pipeline using Kinesis Firehose and S3?
# Complete streaming pipeline:
# App → Kinesis Data Streams → Kinesis Firehose → S3 → Glue Catalog → Athena

# Firehose: auto-scales, no shards to manage; delivers in micro-batches
# Buffer: flush when 128MB OR 300s reached (whichever first)
# Compression: GZIP reduces storage cost by 70-80%

# Firehose with Lambda transformation:
{
  "DeliveryStreamName": "app-events",
  "ExtendedS3DestinationConfiguration": {
    "BucketARN": "arn:aws:s3:::my-data-lake",
    "Prefix": "events/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/",
    "ErrorOutputPrefix": "errors/",
    "BufferingHints": {"SizeInMBs": 128, "IntervalInSeconds": 300},
    "CompressionFormat": "GZIP",
    "DataFormatConversionConfiguration": {
      "OutputFormatConfiguration": {"Serializer": {"ParquetSerDe": {}}}
    },
    "ProcessingConfiguration": {
      "Processors": [{"Type": "Lambda", "Parameters": [...]}]
    }
  }
}

# Glue Crawler auto-discovers partitions and schema → Athena can query immediately
# Athena query cost: $5/TB scanned → partitioning + Parquet = 100x less data scanned

# Real-time dashboard:
# Kinesis Data Analytics (Apache Flink) → tumbling windows, anomaly detection
# Output to DynamoDB or OpenSearch for sub-second dashboard queries
4What is Amazon MSK? When do you choose Kafka over SQS/Kinesis?
# MSK: managed Apache Kafka — runs real Kafka brokers in your VPC
# Provisioned: dedicated brokers; Serverless: pay per throughput (auto-scales)

# Choose Kafka/MSK when:
# ✅ Existing Kafka workloads (lift-and-shift)
# ✅ Kafka ecosystem: Debezium CDC, Kafka Connect, ksqlDB, Kafka Streams
# ✅ Exactly-once transactions (Kafka transactional API)
# ✅ Many consumers with high throughput (no Kinesis fan-out fees)
# ✅ Indefinite retention (replay old events without re-ingesting)

# Choose SQS/Kinesis when:
# ✅ Native AWS integrations: Lambda triggers, EventBridge, Firehose
# ✅ Operational simplicity: no broker sizing, no ZooKeeper/KRaft
# ✅ Cost: SQS/Kinesis scales to zero; MSK minimum ~$700/month
# ✅ Simple patterns: one producer → one consumer (SQS is perfect)

# MSK Serverless vs Provisioned:
# Serverless: automatic capacity, pay per throughput → variable workloads
# Provisioned: predictable performance, better for sustained high throughput

# MSK Connect (managed Kafka Connect):
# Source: Debezium → capture MySQL/PostgreSQL changes → Kafka topics
# Sink: Kafka topics → S3 (S3 Sink Connector), OpenSearch, Redshift
# Fully managed connectors, no infrastructure to manage
5How do you handle backpressure and flow control in event-driven systems?
# Backpressure: producer faster than consumer → queue grows → memory/latency problems

# Strategy 1: Scale consumers based on queue depth
# CloudWatch alarm: SQS ApproximateNumberOfMessagesVisible > 1000
# → Scale out ECS tasks or Lambda concurrency

# Strategy 2: Lambda + SQS (built-in backpressure management)
# Lambda polls SQS, scales up to concurrency limit automatically
# Set reserved concurrency = DB connection pool max (prevent connection exhaustion)
# Batch size: 10 messages per invocation; report batch item failures for partial retry
aws lambda create-event-source-mapping \
  --function-name processor \
  --event-source-arn arn:aws:sqs:::my-queue \
  --batch-size 10 \
  --function-response-types ReportBatchItemFailures

# Strategy 3: Rate limiting at producer
# API Gateway throttling: 1000 req/s → 429 when exceeded
# Token bucket per user: prevents any single tenant from flooding the queue

# Strategy 4: Circuit breaker
# If downstream DB is failing, stop consuming from queue
# Prevents cascade: queue → retry storm → DB overload → full outage

# Dead Letter Queue management:
# After 3 failed attempts → DLQ
# CloudWatch alarm on DLQ depth > 0 → immediate page
# DLQ redrive: after fixing root cause, replay from DLQ to main queue
aws sqs start-message-move-task \
  --source-arn arn:aws:sqs:::my-queue-dlq \
  --destination-arn arn:aws:sqs:::my-queue
6What is GCP Pub/Sub and Azure Service Bus? How do they compare to AWS equivalents?
# GCP Pub/Sub (≈ SNS + SQS in one):
# Topic → multiple Subscriptions → each subscription gets every message
# Within a subscription: only one consumer gets each message (queue semantics)
# Global by default, at-least-once; exactly-once with ordering key enabled
# Push mode: Pub/Sub calls your HTTPS endpoint (great for Cloud Run)
# Pull mode: consumer polls (standard queue consumer)
# BigQuery subscription: auto-stream messages to BigQuery table

# Ordering:
publisher.publish(topic, data=b"event", ordering_key="order-123")
# All messages with same ordering_key delivered in order to same consumer

# Azure Service Bus (≈ SQS + partial SNS):
# Queues: point-to-point, FIFO, sessions (ordered groups), dead letter
# Topics + Subscriptions: pub/sub (each subscription = filtered copy)
# Sessions: group related messages → processed by single consumer in order
# Advanced: message deferral, scheduled delivery, duplicate detection
# Premium tier: dedicated capacity, VNet integration, 100MB messages

# Azure Event Hubs (≈ Kinesis Data Streams):
# Partitioned event streaming, consumer groups
# Kafka-compatible API: run existing Kafka code against Event Hubs!
# Capture: auto-write to Azure Blob Storage (≈ Kinesis → Firehose → S3)

# Cross-cloud messaging (hybrid scenarios):
# Confluent Cloud / Confluent Platform: Kafka on any cloud
# MQ services: Amazon MQ (ActiveMQ/RabbitMQ managed) for lift-and-shift
7What is EventBridge Pipes and how do you use it to build event pipelines?
# EventBridge Pipes: point-to-point event pipeline (source → filter → enrich → target)
# Replaces Lambda glue code for common integration patterns

# Sources: SQS, Kinesis, DynamoDB Streams, MSK, MQ
# Filter: only forward matching events (e.g., only OrderCreated events)
# Enrichment: Lambda, API GW, Step Functions (transform/augment the event)
# Targets: Lambda, SQS, EventBridge bus, Step Functions, Kinesis, API GW, etc.

# Example: DynamoDB stream → Pipe → EventBridge bus
aws pipes create-pipe \
  --name dynamo-to-eventbridge \
  --source arn:aws:dynamodb:::stream/... \
  --source-parameters '{
    "DynamoDBStreamParameters": {
      "StartingPosition": "LATEST",
      "BatchSize": 10
    },
    "FilterCriteria": {
      "Filters": [{"Pattern": "{\"dynamodb\": {\"NewImage\": {\"status\": {\"S\": [\"SHIPPED\"]}}}}"}]
    }
  }' \
  --enrichment arn:aws:lambda:::function:enrich-order \
  --target arn:aws:events:::event-bus/order-events

# EventBridge Archive + Replay:
# Archive all events for 30 days → replay after a bug fix
aws events start-replay \
  --replay-name bug-fix-replay \
  --source my-event-bus \
  --event-start-time 2024-01-15T00:00:00 \
  --event-end-time 2024-01-16T00:00:00 \
  --destination '{"Arn": "arn:aws:events:::event-bus/my-bus", "FilterArns": [...]}'

Observability

6 questions
1What are the three pillars of observability and how do you implement them on AWS?

Metrics answer "Is my system healthy?", Logs answer "What happened?", and Traces answer "Where is the slowdown?"

# Metrics: CloudWatch
import boto3
cw = boto3.client('cloudwatch')
cw.put_metric_data(Namespace='MyApp', MetricData=[{
    'MetricName': 'OrdersProcessed',
    'Value': 1, 'Unit': 'Count',
    'Dimensions': [{'Name': 'Environment', 'Value': 'prod'}]
}])
# Alarms on threshold or anomaly detection
# Dashboards: composite views with SLO burn rates

# Logs: CloudWatch Logs Insights
# Structured JSON logging essential for querying:
{"level":"ERROR","msg":"Payment failed","order_id":"123","error":"timeout","trace_id":"abc"}
# Query:
fields @timestamp, order_id, error
| filter level = "ERROR"
| stats count(*) as errors by bin(5m)

# Traces: AWS X-Ray + OpenTelemetry (ADOT)
from aws_xray_sdk.core import xray_recorder, patch_all
patch_all()  # auto-instrument boto3, requests, SQLAlchemy

@xray_recorder.capture('process-order')
def process_order(order_id):
    with xray_recorder.in_subsegment('db-query'):
        return db.get_order(order_id)

# OpenTelemetry on AWS (vendor-neutral):
# ADOT Collector → X-Ray, CloudWatch, Prometheus, Grafana, Datadog
# Benefit: switch backends without code changes
2How do you design SLI/SLO-based alerting and avoid alert fatigue?
# SLI/SLO alerting (SRE methodology):
# SLI: the metric you measure (e.g., p99 latency, availability)
# SLO: the target (99.9% availability = 8.7 hours downtime/year)
# Error Budget: 1 - SLO = 0.1% = how much badness you can tolerate

# Key SLIs for web APIs:
# Availability:   (successful / total) > 99.9%
# Latency:        p99 < 500ms
# Error rate:     (4xx+5xx / total) < 1%
# Throughput:     requests/second (capacity planning)

# Error budget burn rate alerts:
# 1x burn = consuming budget at steady state (neutral)
# 14x burn for 1h = consumed 10% of monthly budget → P1 page
# 6x burn for 6h  = consumed 10% → P2 notify

aws cloudwatch put-metric-alarm \
  --alarm-name "P99-Latency-High" \
  --metric-name TargetResponseTime \
  --namespace AWS/ApplicationELB \
  --extended-statistic p99 \
  --period 60 --evaluation-periods 5 \
  --threshold 0.5 \
  --comparison-operator GreaterThanThreshold

# Composite alarms (reduce noise):
aws cloudwatch put-composite-alarm \
  --alarm-name "Service-Degraded" \
  --alarm-rule "ALARM(ErrorRateHigh) AND ALARM(LatencyHigh)"
# Alert only when BOTH conditions are true simultaneously

# Alert hierarchy:
# P1 (page immediately): SLO burn rate 14x+, complete outage
# P2 (notify in 30 min): SLO burn rate 6x+, partial degradation
# P3 (review next day): capacity warning, non-critical anomaly
3How do you implement centralized logging in a multi-account AWS organization?
# Centralized logging architecture:
# Each account: App → CloudWatch Logs → Kinesis Firehose subscription
# Central Log Archive account: Firehose → S3 (compressed, partitioned)

# Organization-level trail (CloudTrail):
aws cloudtrail create-trail \
  --name org-trail \
  --s3-bucket-name central-cloudtrail-logs \
  --is-multi-region-trail \
  --enable-log-file-validation \
  --is-organization-trail

# Application logs via Fluent Bit (EKS/ECS):
[OUTPUT]
    Name kinesis_firehose
    Match app.*
    region us-east-1
    delivery_stream app-logs-${ACCOUNT_ID}

# Firehose → S3 with account prefix:
# s3://central-logs/account=111222333/year=2024/month=01/app-logs.gz

# OpenSearch for searchable logs:
# Firehose can also deliver to OpenSearch Service
# Fine-grained access: per-index permissions (team A sees only their logs)

# Log retention tiers:
# Application debug logs: 7 days (CloudWatch) → delete
# Application error logs: 30 days hot → 1 year S3 IA
# Security/audit logs: 1 year hot → 7 years S3 Glacier (compliance)

# Protect against log tampering:
# S3 Object Lock (WORM) on security logs
# SCPs deny s3:DeleteObject on the central logs bucket
# CloudTrail log file validation (SHA-256 hash per log file)
4How do you use AWS CloudTrail for security auditing and compliance?
# CloudTrail: records every AWS API call
# Who (principal), what (API action), when, where (IP), on what (resource ARN)

# Critical CloudWatch Metric Filters on CloudTrail logs:
# Root account usage:
'{ $.userIdentity.type = "Root" && $.userIdentity.invokedBy NOT EXISTS }'
# IAM policy changes:
'{ $.eventName = "PutUserPolicy" || $.eventName = "AttachRolePolicy" }'
# Unauthorized API calls:
'{ $.errorCode = "AccessDenied" || $.errorCode = "UnauthorizedOperation" }'
# Console login without MFA:
'{ $.eventName = "ConsoleLogin" && $.additionalEventData.MFAUsed = "No" }'
# Security group changes:
'{ $.eventName = "AuthorizeSecurityGroupIngress" }'

# CloudTrail Lake (serverless SQL queries on API history):
SELECT userIdentity.arn, eventName, errorCode, COUNT(*) as count
FROM trail_event_data_store
WHERE errorCode IS NOT NULL
  AND eventTime > '2024-01-01'
GROUP BY 1, 2, 3
ORDER BY count DESC
LIMIT 20;
# Retention: 7 years; query in seconds; no S3 setup required

# Protect against disabling CloudTrail:
# SCP: Deny cloudtrail:StopLogging, cloudtrail:DeleteTrail for all non-platform accounts
# SNS notification on any StopLogging event
# S3 Object Lock on trail bucket (WORM mode)
5How does distributed tracing work in microservices? Explain context propagation and sampling.
# Trace: unique ID for a complete request journey across services
# Span: one service's work within a trace (start/end time, attributes, errors)
# Context propagation: trace ID passed in HTTP headers between services

# W3C Trace Context header: traceparent: 00-{traceId}-{spanId}-{flags}
# AWS X-Ray header: X-Amzn-Trace-Id: Root=1-xxx;Parent=xxx;Sampled=1

# OpenTelemetry instrumentation:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process-payment") as span:
    span.set_attribute("payment.amount", amount)
    span.set_attribute("payment.currency", "USD")
    try:
        result = charge_stripe(amount)
        span.set_attribute("payment.status", "success")
    except Exception as e:
        span.record_exception(e)
        span.set_status(Status(StatusCode.ERROR))
        raise

# Sampling strategies:
# Head-based: decision made at trace start (simple, cheaper)
#   - Fixed rate: sample 5% → misses rare errors
#   - Always sample errors: 5% + 100% for 4xx/5xx
# Tail-based (OTEL Collector): decision after trace completes
#   - Sample 100% of slow (>1s) and error traces; 1% of normal
#   - More useful data, higher collector cost

# Correlate traces with logs:
# Add trace_id to log entries → click log → see full trace
import logging
logging.info("Payment processed", extra={"trace_id": get_trace_id(), "order_id": oid})
6How do GCP Cloud Monitoring and Azure Monitor compare to CloudWatch?
# AWS CloudWatch:
# Strengths: deep AWS service integration, composite alarms, anomaly detection,
#   Lambda Insights, Contributor Insights, cross-account dashboards
# Logs Insights: fast SQL-like queries; EMF (Embedded Metric Format) for custom metrics
# Gaps: Grafana integration needed for advanced visualization

# GCP Cloud Monitoring:
# Uptime checks: HTTP/TCP from global locations (≈ Route 53 health checks)
# SLO monitoring built-in: define SLOs, auto-track error budget
# Cloud Logging: structured logs, log-based metrics, BigQuery export for analytics
# Cloud Trace: auto-instrumented for GCP services, linked with logs
# Managed Prometheus: GCP manages Prometheus backend storage

# Azure Monitor:
# Application Insights: full APM — requests, exceptions, dependencies, live metrics
# Log Analytics Workspace: central log store with KQL (Kusto Query Language)
# KQL example: requests | where duration > 1000 | summarize count() by bin(timestamp, 5m)
# Azure Monitor Alerts: metric, log-based, resource health, activity log
# Azure Managed Grafana: hosted Grafana with native Azure AD integration

# Third-party tools (cloud-agnostic):
# Datadog: best unified APM+infra, expensive
# New Relic: full-stack observability, competitive pricing
# Grafana Cloud: Prometheus + Loki + Tempo (metrics + logs + traces) — open-source based
# Honeycomb: best for high-cardinality event data, tail-based sampling

Security & Compliance

6 questions
1What are AWS GuardDuty, Security Hub, and Macie? How do you build a security posture?
# GuardDuty: ML-based threat detection
# Analyzes: CloudTrail, VPC Flow Logs, DNS logs, EKS audit logs
# Detects: compromised credentials, crypto-mining, unusual API patterns, port scans
# Enable with one click, no agents, no configuration
aws guardduty create-detector --enable --finding-publishing-frequency SIX_HOURS

# Finding types:
# UnauthorizedAccess:IAMUser/MaliciousIPCaller
# CryptoCurrency:EC2/BitcoinTool.B
# Recon:EC2/PortProbeUnprotectedPort
# PrivilegeEscalation:IAMUser/AdministrativePermissions

# Security Hub: aggregates findings from GuardDuty, Inspector, Macie + third-party
# Runs automated compliance checks (CIS AWS Foundations, AWS Foundational Security)
# Security score per account and per control
# Aggregates across all org accounts into one pane

# Macie: ML-based sensitive data discovery in S3
# Detects: PII, financial data, credentials, health records
# Inventory: which S3 buckets have sensitive data
# Alerts: unencrypted buckets, public buckets, anomalous access patterns

# AWS Config: continuous compliance monitoring
# Rule: "all S3 buckets must have encryption enabled"
# Auto-remediation: Lambda or SSM Document runs when non-compliant
aws config put-config-rule --config-rule '{
  "ConfigRuleName": "s3-bucket-server-side-encryption-enabled",
  "Source": {"Owner": "AWS", "SourceIdentifier": "S3_BUCKET_SERVER_SIDE_ENCRYPTION_ENABLED"}
}'
2How do you implement encryption at rest and in transit in cloud architectures?
# Encryption at rest — enforce via SCP + Config rules:
# S3: default encryption with SSE-S3 (free) or SSE-KMS (audit trail)
# RDS/Aurora: enable storage encryption at creation (can't change after)
# DynamoDB: encrypted by default (SSE with AWS managed key)
# EBS: encrypted by default (account-level setting)
# Secrets Manager / Parameter Store: encrypted by KMS

# Envelope encryption with KMS:
# KMS Customer Managed Key (CMK) encrypts DEK (Data Encryption Key)
# DEK encrypts the actual data locally (fast, no KMS API per record)
# Only one KMS call to decrypt DEK → decrypt all data with local DEK

import boto3
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
import os

kms = boto3.client('kms')
KEY_ID = 'arn:aws:kms:us-east-1:ACCOUNT:key/KEY-ID'

def encrypt(plaintext):
    dek = os.urandom(32)         # generate local DEK
    nonce = os.urandom(12)
    ciphertext = AESGCM(dek).encrypt(nonce, plaintext, None)
    encrypted_dek = kms.encrypt(KeyId=KEY_ID, Plaintext=dek)['CiphertextBlob']
    return nonce + ciphertext, encrypted_dek

# Encryption in transit:
# TLS 1.3 for all external traffic (ALB policy: ELBSecurityPolicy-TLS13-1-2-2021-06)
# mTLS for service-to-service (Istio/App Mesh in EKS)
# ACM (Certificate Manager): free TLS certs, auto-renewal for ALB/CloudFront/API GW
# VPC traffic: encrypted by default within AWS network for Nitro instances
3What are the common cloud security misconfigurations and how do you prevent them?
# Top cloud misconfigurations and prevention:

# 1. Public S3 buckets (Capital One breach vector):
# Prevention: S3 Block Public Access at account level (default since 2023)
aws s3control put-public-access-block --account-id ACCOUNT --public-access-block-configuration \
  '{"BlockPublicAcls":true,"IgnorePublicAcls":true,"BlockPublicPolicy":true,"RestrictPublicBuckets":true}'
# AWS Config rule: s3-bucket-public-read-prohibited

# 2. Overly permissive security groups (0.0.0.0/0 on admin ports):
# Prevention: AWS Config rule: restricted-ssh, restricted-common-ports
# SCP: deny creating SG rules allowing 0.0.0.0/0 on 22/3389
# Use SSM Session Manager instead of SSH (no port 22 needed)

# 3. IAM users with access keys (static, long-lived):
# Prevention: SCP deny iam:CreateAccessKey; require use of roles
# AWS Config: iam-no-inline-policy-check, iam-user-no-policies-check
# AWS Access Analyzer: find unused access

# 4. Unencrypted databases and S3:
# Prevention: enforce encryption via SCP (Deny if ServerSideEncryptionAlgorithm != aws:kms)
# AWS Config: rds-storage-encrypted, s3-bucket-server-side-encryption-enabled

# 5. IMDSv1 enabled on EC2 (SSRF → metadata credentials):
# Prevention: require IMDSv2 at account level:
aws ec2 modify-instance-metadata-defaults \
  --http-tokens required \
  --http-put-response-hop-limit 1

# Tools:
# AWS Security Hub: automated compliance checks
# Prowler: open-source CIS Benchmark scanner
# ScoutSuite: multi-cloud security auditing
# Checkov / tfsec: IaC security scanning in CI/CD
4What is AWS Network Firewall and how do you implement centralized egress inspection?
# AWS Network Firewall: managed stateful Layer 7 firewall
# Rules: Suricata-compatible (IDS/IPS rules), domain-based filtering
# Use cases: block malicious domains, inspect egress, compliance

# Centralized egress inspection architecture:
# All VPC spoke accounts → TGW → Inspection VPC (Network Firewall) → NAT GW → Internet
# Egress VPC: public subnet (NAT GW + NFW), firewall subnet

# Firewall rule group:
{
  "RuleGroupName": "block-malicious-domains",
  "Type": "STATEFUL",
  "RuleGroup": {
    "RulesSource": {
      "RulesSourceList": {
        "Targets": [".malware-site.com", ".phishing.com"],
        "TargetTypes": ["HTTP_HOST", "TLS_SNI"],
        "GeneratedRulesType": "DENYLIST"
      }
    }
  }
}

# Domain allowlist for egress (zero-trust egress):
# Only allow specific domains: api.stripe.com, s3.amazonaws.com, etc.
# Block everything else: GeneratedRulesType: "ALLOWLIST"
# Compliance: PCI-DSS requires all outbound connections to be justified and controlled

# vs WAF:
# WAF: protects ingress (HTTP requests to your APIs)
# Network Firewall: protects egress (traffic leaving your VPCs) and can inspect ingress at Layer 4/7

# GCP equivalent: Cloud Next Generation Firewall (Cloud NGFW)
# Azure equivalent: Azure Firewall Premium
5How do you achieve compliance in the cloud? What are shared responsibility models?

Shared Responsibility Model: The cloud provider is responsible for security of the cloud (physical hardware, network infrastructure, hypervisor). The customer is responsible for security in the cloud (OS, applications, data, IAM, network configuration).

# AWS handles: physical facilities, network, hardware, managed service security
# You handle: guest OS, application code, IAM, data encryption, network configuration

# Compliance frameworks and AWS:
# PCI-DSS: payment card data
#   - AWS has PCI DSS Level 1 compliant services
#   - You: encrypt cardholder data, access controls, audit logs, WAF
#   - Use: Macie to detect card data in S3, VPC isolation for cardholder environment

# HIPAA: healthcare data (US)
#   - Sign AWS Business Associate Agreement (BAA)
#   - Use only HIPAA-eligible services
#   - Encrypt PHI at rest + in transit; access controls; audit logs

# SOC 2: security, availability, confidentiality
#   - AWS provides SOC 2 Type II report (covers their infrastructure)
#   - You: application controls, change management, access reviews
#   - AWS Audit Manager: automates evidence collection for SOC 2

# GDPR: EU personal data
#   - Data residency: use EU regions only (eu-west-1, eu-central-1)
#   - Data subject rights: deletion, portability (design into your application)
#   - DPA: sign AWS Data Processing Addendum

# Infrastructure as Code for compliance:
# Terraform Sentinel policies: enforce compliance rules on every apply
# AWS Config: continuous compliance monitoring with auto-remediation
# AWS Audit Manager: automated evidence collection → compliance reports
6What is AWS WAF and how do you use it to protect APIs and web applications?
# AWS WAF: Layer 7 firewall, attach to ALB, CloudFront, API GW, AppSync
# Rule groups: AWS Managed (free), Marketplace (paid), Custom

# Web ACL with common protections:
aws wafv2 create-web-acl --name my-web-acl --scope REGIONAL \
  --default-action Allow={} \
  --rules '[
    {
      "Name": "AWSManagedRulesCommonRuleSet",
      "Priority": 1,
      "OverrideAction": {"None": {}},
      "Statement": {"ManagedRuleGroupStatement": {
        "VendorName": "AWS", "Name": "AWSManagedRulesCommonRuleSet"
      }},
      "VisibilityConfig": {...}
    },
    {
      "Name": "RateLimit",
      "Priority": 2,
      "Action": {"Block": {}},
      "Statement": {"RateBasedStatement": {
        "Limit": 2000,
        "AggregateKeyType": "IP"
      }},
      "VisibilityConfig": {...}
    }
  ]'

# Managed rule groups:
# AWSManagedRulesCommonRuleSet: OWASP Top 10 (SQLi, XSS, LFI, RFI)
# AWSManagedRulesKnownBadInputsRuleSet: Log4Shell, Spring4Shell, etc.
# AWSManagedRulesAmazonIpReputationList: known malicious IPs, botnets
# AWSManagedRulesBotControlRuleSet: bot detection ($10/million requests)

# Geo-blocking:
# Block specific countries or allow-list only your target countries
# Note: VPN/proxy bypass is possible — defense in depth, not sole control

# WAF logs → Kinesis Firehose → S3 → Athena for analysis
# Detect attack patterns, tune rules, build custom block lists

Cost Optimization

6 questions
1What is FinOps and how do you build a cost-aware cloud culture?

FinOps (Cloud Financial Operations) is a practice that brings engineering, finance, and business together to take ownership of cloud spending — enabling faster product delivery while maintaining financial control.

# FinOps lifecycle: Inform → Optimize → Operate

# Inform: make costs visible
# AWS Cost Explorer: visualize spending by service, account, tag
# Cost Allocation Tags: tag every resource (app, team, environment, cost-center)
# Budget alerts: SNS notification when spending forecast exceeds threshold
aws budgets create-budget --account-id ACCOUNT --budget '{
  "BudgetName": "payment-service-monthly",
  "BudgetLimit": {"Amount": "1000", "Unit": "USD"},
  "TimeUnit": "MONTHLY",
  "BudgetType": "COST",
  "CostFilters": {"TagKeyValue": ["user:app$payment-service"]}
}' --notifications-with-subscribers '[{
  "Notification": {"NotificationType": "ACTUAL", "ComparisonOperator": "GREATER_THAN",
    "Threshold": 80},
  "Subscribers": [{"SubscriptionType": "SNS", "Address": "arn:aws:sns:::cost-alerts"}]
}]'

# Unit economics: cost per transaction, cost per user, cost per API call
# Enables developers to understand the financial impact of their code

# Cloud billing accountability:
# Chargeback: allocate costs to teams/products (internal billing)
# Showback: show costs without internal charging (awareness without punishment)
# Both require consistent tagging and org structure

# Cost anomaly detection:
aws ce create-anomaly-monitor --anomaly-monitor '{
  "MonitorName": "service-anomalies",
  "MonitorType": "DIMENSIONAL",
  "MonitorDimension": "SERVICE"
}'
# Alerts when a service's spend deviates significantly from historical pattern
2What are Savings Plans and Reserved Instances? How do you decide what to commit to?
# Savings Plans: commit to $/hr spend, automatic application
# Compute Savings Plans: most flexible — EC2 any family/region/OS + Fargate + Lambda
# EC2 Instance Savings Plans: locked to instance family + region, highest discount (up to 72%)
# SageMaker Savings Plans: ML training/inference

# Reserved Instances: commit to specific instance configuration
# Standard RI: exact instance type/AZ/OS, up to 72% discount
# Convertible RI: can exchange for different type, up to 54% discount
# Scope: regional (flexible) or zonal (capacity reservation)

# Decision framework:
# 1. Right-size BEFORE committing (waste money on wrong size)
# 2. Analyze 3+ months of usage: identify stable baseline
# 3. Commit baseline to Compute Savings Plans (most flexible)
# 4. On-demand for variable peak capacity above baseline

# Typical strategy:
# - 60-70% of normalized compute → 1-year Compute Savings Plans
# - Additional 10-20% for specific heavy workloads → EC2 Instance SP or RIs
# - Remainder → On-demand and Spot
# Expected savings: 30-50% reduction in compute costs

# Payment options:
# All Upfront: best discount but ties up capital
# Partial Upfront: balanced
# No Upfront: slightly lower discount, no capital required
# At 8% cost of capital, All Upfront is only better if discount > 8% differential

# AWS Cost Explorer recommendations:
# Analyzes your usage → recommends specific SPs/RIs with projected savings
# Start with recommendations but validate against growth plans
3What are the biggest AWS cost optimization opportunities and how do you identify waste?
# Top cost optimization opportunities:

# 1. EC2 right-sizing (often 30-50% over-provisioned):
# AWS Compute Optimizer: ML-based recommendations from actual utilization
# Action: downsize instances with <20% p95 CPU; switch to Graviton (20-40% saving)

# 2. Unattached resources (pure waste):
# EBS volumes not attached to instances
# Unused Elastic IPs (~$4/month each)
# Idle load balancers with no targets
# Old snapshots (set lifecycle policies)
aws ec2 describe-volumes --filters Name=status,Values=available  # unattached volumes
aws ec2 describe-addresses --filters Name=domain,Values=vpc | jq '.Addresses[] | select(.AssociationId == null)'

# 3. Data transfer costs (often 20-30% of bill):
# S3 → CloudFront instead of S3 → Internet (CloudFront transfer is cheaper)
# VPC Gateway endpoints for S3/DynamoDB (free — avoids NAT Gateway fees)
# Same-AZ traffic: ensure app and DB are in same AZ
# NAT Gateway: $0.045/GB → replace with Interface Endpoints for AWS services

# 4. RDS over-provisioned:
# Multi-AZ standby instance is running 24/7 — use Aurora Serverless v2 for variable load
# Read replicas not needed → remove during off-peak

# 5. Over-retained CloudWatch Logs (can be 5-15% of bill):
# Set log group retention: 7 days for debug, 30 days for app, 365 for audit
# Export to S3 after retention window instead of keeping in CW

# 6. S3 storage class optimization:
# S3 Intelligent-Tiering or lifecycle policies to move to IA/Glacier
# S3 Lens: identify large buckets with infrequent access
4How do you optimize data transfer costs in AWS?
# AWS data transfer pricing (approximate):
# Inbound to AWS: FREE
# EC2 → Internet: $0.09/GB (first 10 TB), then lower tiers
# EC2 → S3 (same region): FREE
# EC2 → S3 (different region): $0.02/GB
# EC2 AZ-to-AZ (same region): $0.01/GB each direction
# CloudFront → Internet: $0.0085/GB (after 10TB, even cheaper)

# Optimization strategies:

# 1. CloudFront for S3 content delivery (80% cost reduction):
# S3 → Internet: $0.09/GB;  S3 → CloudFront: $0.02/GB;  CF → Internet: $0.0085/GB
# 10TB/month: S3 direct = $900, via CloudFront = ~$85

# 2. VPC Interface/Gateway Endpoints (eliminate NAT Gateway costs):
# NAT Gateway: $0.045/GB processed + $0.045/hr
# Interface endpoint: $0.01/GB + $0.01/hr
# For Lambda/ECS calling S3, Secrets Manager, ECR: big savings

# 3. Same-AZ for high-volume traffic:
# Put RDS Read Replica and app in same AZ for reads ($0 vs $0.01/GB)
# ElastiCache cluster nodes in same AZ as primary consumers

# 4. Compress data:
# Enable S3 Firehose GZIP compression (70% less data = 70% less transfer cost)
# Enable ALB/CloudFront compression for HTTP responses

# 5. Direct Connect for large on-prem transfers:
# DX egress: $0.002/GB vs Internet: $0.09/GB
# Break-even at ~2TB/month of on-prem traffic (depending on DX port cost)
5How do you optimize serverless costs? Lambda, Fargate, and DynamoDB cost patterns.
# Lambda cost: $0.0000002/request + $0.0000166667/GB-second
# Optimization:
# 1. Right-size memory: more memory = faster = potentially cheaper
#    Lambda Power Tuning tool: test 10 memory sizes → find optimal price/perf
#    Often: doubling memory cuts duration by >50% → net cost reduction
# 2. Reduce duration: move initialization outside handler, efficient libraries
# 3. ARM/Graviton2 (arm64): 20% cheaper + 19% better price-performance
# 4. Batch SQS messages: 10 messages per invocation vs 10 separate invocations
# 5. Reuse HTTP connections: requests.Session() reused across warm invocations

# Fargate cost: ~20-30% premium over equivalent EC2 (no management overhead)
# When EC2 is cheaper: steady-state workloads > 50% utilization
# When Fargate wins: variable workloads, Spot interruptions, ops complexity cost

# DynamoDB cost optimization:
# On-demand vs Provisioned:
#   On-demand: $1.25/million WCU writes, $0.25/million RCU reads — pay per use
#   Provisioned: $0.00065/WCU-hour, $0.00013/RCU-hour — commit to capacity
#   On-demand is 6x more expensive per unit than provisioned!
#   Break-even: use provisioned if utilization > 20% of time

# DynamoDB cost killers:
# Full table scans (Scan API) consume capacity proportional to table size
# Large items: 10KB item = 10 WCUs to write vs 1KB item = 1 WCU
# Returning all attributes when you need 2: use ProjectionExpression
# GSI storage: each GSI stores a copy of projected attributes
6How do you implement cost governance with tagging, budgets, and auto-remediation?
# Tagging strategy — required tags enforced via SCP:
{
  "Effect": "Deny",
  "Action": ["ec2:RunInstances", "rds:CreateDBInstance", "lambda:CreateFunction"],
  "Resource": "*",
  "Condition": {
    "Null": {
      "aws:RequestTag/app": "true",       # must have 'app' tag
      "aws:RequestTag/env": "true",       # must have 'env' tag
      "aws:RequestTag/team": "true"       # must have 'team' tag
    }
  }
}

# Required tags: app, env (prod/staging/dev), team, cost-center
# Enforce via SCP for creates; use Config rule for existing resources

# Budget hierarchy:
# Org total → account → team → application
# Budget action: auto-apply SCP to deny non-essential creates when 80% exceeded

# Auto-remediation for waste:
# Lambda: find and terminate EC2 instances idle > 7 days
# EventBridge scheduled: daily scan for unattached EBS volumes → SNS alert
# AWS Config remediation: attach IAM password policy, enable S3 versioning

# Tagging compliance report:
aws resourcegroupstaggingapi get-resources \
  --tag-filters Key=env \
  --resource-type-filters ec2:instance
# Find resources missing required tags → send to team Slack channel

# Cost allocation: AWS Cost and Usage Report (CUR) → S3 → Athena → QuickSight
# Dashboard: cost by team, cost per feature, savings vs last month

HA & Disaster Recovery

6 questions
1What are RPO and RTO? How do they determine your DR strategy?

RPO (Recovery Point Objective): Maximum acceptable data loss measured in time. "We can afford to lose up to 1 hour of orders." RPO determines backup/replication frequency.

RTO (Recovery Time Objective): Maximum acceptable downtime. "We must be back online within 4 hours." RTO determines how fast you must failover.

# DR strategies (cheapest to most expensive, slowest to fastest recovery):

# 1. Backup & Restore (RTO: hours; RPO: hours)
#    Cost: low — just backup storage
#    S3 backups + AMIs; restore from backup when disaster strikes
#    Use for: non-critical systems, dev/test environments

# 2. Pilot Light (RTO: 30-60 min; RPO: minutes)
#    Core DB replication running; minimal EC2 in DR region
#    Scale out EC2 on disaster; redirect DNS
#    Use for: important but not time-critical systems

# 3. Warm Standby (RTO: 10-30 min; RPO: seconds)
#    Reduced-capacity replica running at all times in DR region
#    Scale up on disaster (already running, just needs more capacity)
#    Use for: important systems, moderate budget

# 4. Active-Active / Multi-Site (RTO: near-zero; RPO: near-zero)
#    Full capacity in all regions simultaneously; traffic split
#    Route 53 latency/geolocation routing
#    Use for: critical revenue-generating systems
#    Cost: 2x infrastructure

# Matching strategy to business requirements:
# RPO=0, RTO=0: Active-active (most expensive)
# RPO<1min, RTO<15min: Warm standby
# RPO<1hr, RTO<4hr: Pilot light
# RPO<24hr, RTO<24hr: Backup and restore
2How do you design multi-AZ high availability in AWS?
# AZ: physically separate data center in same AWS region
# Correlated failures in same AZ → application continues from other AZs

# Multi-AZ HA architecture:
# ALB: spans 3 AZs automatically, fails over in seconds
# EC2 ASG: min=3, desired=6, spread across 3 AZs (AZRebalance policy)
# RDS Multi-AZ: synchronous standby in second AZ, ~1-2min failover
# ElastiCache: multi-AZ with auto-failover enabled
# ECS/EKS: tasks/pods spread with topologySpreadConstraints:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels: {app: my-service}

# Aurora: 6 storage copies across 3 AZs by default; 1-2 reader instances in other AZs

# Failure testing with AWS Fault Injection Simulator (FIS):
# Simulate AZ failure: terminate all instances in us-east-1b
# Verify: application continues from us-east-1a and us-east-1c
# Chaos engineering: validate theoretical HA with actual failure injection

# Availability SLAs:
# Single EC2:       99.5%   (4.38 hrs downtime/year)
# Multi-AZ (2 AZ):  99.99%  (52 min downtime/year)
# Multi-AZ (3 AZ):  99.999% (5.2 min downtime/year)
3How do you implement multi-region failover? What are the challenges?
# Multi-region failover architecture:
# Primary: us-east-1; Secondary: us-west-2
# Route 53 failover routing with health checks

# Data layer options:
# Aurora Global Database: <1s replication, managed failover (<1min)
# DynamoDB Global Tables: multi-master, eventually consistent
# S3 Cross-Region Replication: async (~minutes)
# RDS Read Replica: manual promotion on failure

# Infrastructure failover:
# Route 53 health check: monitors /health endpoint in primary region
# On failure: DNS TTL expires → traffic shifts to secondary
# With 60s TTL: recovery in ~1-2 minutes (DNS propagation)

# Challenges:
# 1. Data consistency during failover:
#    Async replication → potential data loss (RPO = replication lag)
#    Prevent split-brain: ensure primary is truly down before promoting secondary

# 2. Regional service dependencies:
#    Does your app depend on services not available in DR region?
#    Test full failover regularly (not just in theory)

# 3. Runbook complexity:
#    DNS failover is automatic; DB promotion is often manual
#    Automate everything: EventBridge → Lambda → promote RDS replica → update Secrets Manager

# 4. Cost:
#    Warm standby = 50-100% additional cost for the standby region
#    Balance against cost of downtime (revenue/hr * expected outage hours/year)

# AWS Global Accelerator (alternative to Route 53 failover):
# Anycast IPs, health checks, sub-minute failover (no DNS TTL issue)
# Better for latency-sensitive applications
4What is chaos engineering and how do you practice it on AWS?
# Chaos engineering: deliberately inject failures to validate resilience
# Principle: "Break things in controlled ways to find weaknesses before real failures do"

# AWS Fault Injection Simulator (FIS):
# Terminate EC2 instances (simulate AZ failure)
# Throttle EC2 API calls (simulate service degradation)
# Inject network latency (simulate slow dependencies)
# Fail RDS Multi-AZ failover
# Drain ECS tasks from a load balancer

# Sample FIS experiment: AZ failure simulation
{
  "description": "Terminate instances in us-east-1b",
  "actions": {
    "TerminateInstances": {
      "actionId": "aws:ec2:terminate-instances",
      "parameters": {},
      "targets": {"Instances": "ProductionInstances"}
    }
  },
  "targets": {
    "ProductionInstances": {
      "resourceType": "aws:ec2:instance",
      "resourceTags": {"env": "prod"},
      "filters": [{"path": "Placement.AvailabilityZone", "values": ["us-east-1b"]}],
      "selectionMode": "ALL"
    }
  },
  "stopConditions": [{"source": "aws:cloudwatch:alarm", "value": "ErrorRateCritical"}]
}

# Chaos engineering process:
# 1. Define steady state (normal system behavior with metrics)
# 2. Form hypothesis: "Terminating AZ-b instances will not affect availability"
# 3. Inject failure in staging, then production (during business hours, with paging)
# 4. Observe: does steady state hold? What failed unexpectedly?
# 5. Fix weaknesses found; repeat

# GameDay: scheduled chaos exercises with full team present
# Netflix Chaos Monkey → Chaos Monkey for AWS → FIS integration
5How do you implement backup and restore strategies for databases and stateful workloads?
# RDS / Aurora automated backups:
# Daily snapshots + transaction logs → point-in-time recovery to any second
# Retention: 1-35 days; 0 = disable (not recommended for production)
# Cross-region copy: copy snapshots to DR region for additional protection
aws rds copy-db-snapshot \
  --source-db-snapshot-identifier arn:aws:rds:us-east-1:ACCOUNT:snapshot:my-db-snap \
  --target-db-snapshot-identifier my-db-snap-dr \
  --source-region us-east-1

# DynamoDB Point-in-Time Recovery (PITR):
aws dynamodb update-continuous-backups \
  --table-name my-table \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
# Restore to any point in last 35 days (per-second granularity)

# AWS Backup: centralized backup across RDS, DynamoDB, EFS, EC2, S3, EKS
# Backup plan: schedule, retention, cross-region/cross-account copy
aws backup create-backup-plan --backup-plan '{
  "BackupPlanName": "daily-backup",
  "Rules": [{
    "RuleName": "daily",
    "TargetBackupVaultName": "prod-backups",
    "ScheduleExpression": "cron(0 3 * * ? *)",   # 3am daily
    "StartWindowMinutes": 60,
    "CompletionWindowMinutes": 180,
    "Lifecycle": {"DeleteAfterDays": 90},
    "CopyActions": [{"DestinationBackupVaultArn": "arn:aws:backup:us-west-2:..."}]
  }]
}'

# Restore testing (critical — validate backups actually work!):
# Monthly: restore DB snapshot to test environment and validate data integrity
# AWS Backup restore testing: automated restore + validation Lambda
6What is cell-based architecture and how does it achieve extreme resilience?

Cell-based architecture (used by Amazon, Netflix, AWS itself) partitions workloads into independent cells that share nothing. A failure in one cell affects only the users mapped to that cell — not the entire system.

# Cell: complete, independent deployment with its own:
# - EC2/EKS cluster
# - Database (RDS or DynamoDB table)
# - Cache (ElastiCache)
# - Load balancer
# No shared infrastructure between cells

# Cell mapping: which users go to which cell?
# Consistent hashing: user_id % num_cells → cell assignment
# Stored in a lightweight "routing plane" (DynamoDB or Route 53 GeoDNS)

# Benefits:
# Blast radius: a bug or infrastructure failure affects 1/N users (N = cell count)
# Deployments: roll out to 1 cell first, validate, then deploy to remaining cells
# Noisy neighbor: a misbehaving customer in cell 3 can't affect customers in cell 7

# How AWS itself uses cells:
# EC2 control plane: each AZ is a cell (launch instance in us-east-1a = cell 1)
# DynamoDB: each storage node is a cell
# S3: each partition (prefix range) served by independent cells

# When to consider cell-based architecture:
# You need better-than-multi-region availability (99.999%+)
# Large enough to justify the operational complexity
# Compliance: data locality requirements per tenant
# Examples: financial institutions, healthcare, large SaaS platforms

# AWS Shuffle Sharding (variant):
# Assign each customer a unique subset of cells (e.g., 2 of 8 servers)
# Blast radius: customer A failure only affects customers sharing the same servers
# Mathematically: probability of two customers sharing the same subset is very low

Well-Architected & Multi-Cloud

7 questions
1What are the six pillars of the AWS Well-Architected Framework?

The AWS Well-Architected Framework provides a consistent approach for evaluating architectures and implementing designs that scale over time.

  • Operational Excellence: Run and monitor systems effectively. Infrastructure as code, small frequent reversible changes, anticipate failure, operations runbooks. Key services: CloudFormation, Systems Manager, CloudWatch.
  • Security: Protect data and systems. Least privilege, defense in depth, encryption everywhere, incident response. Key services: IAM, KMS, GuardDuty, Security Hub.
  • Reliability: Recover from failures, meet demand. Multi-AZ, auto scaling, chaos testing, DR. Key services: Route 53, ALB, ASG, RDS Multi-AZ.
  • Performance Efficiency: Use resources efficiently. Right instance types, managed services, CDN, caching. Key services: CloudFront, ElastiCache, Kinesis.
  • Cost Optimization: Avoid unnecessary costs. Right-sizing, Savings Plans, spot instances, serverless. Key services: Cost Explorer, Compute Optimizer, Trusted Advisor.
  • Sustainability: (Added 2021) Minimize environmental impact. Right-size, use managed services, maximize utilization. Graviton for energy efficiency, serverless for on-demand compute.
# Well-Architected Tool: guided review of your architecture
# Create workload → answer questions per pillar → get recommendations
# Trusted Advisor: automated checks for security, cost, performance, fault tolerance

# Common review findings and fixes:
# Security: S3 bucket public, root account used, MFA not enabled
# Reliability: single-AZ, no health checks, manual scaling
# Cost: overprovisioned EC2, unattached EBS, no Savings Plans
# Performance: no caching layer, synchronous everything, wrong instance family
2What is Infrastructure as Code? Compare CloudFormation, CDK, Terraform, and Pulumi.
# CloudFormation: AWS-native IaC (JSON/YAML templates)
# Deep AWS integration: first-class support for every AWS service
# StackSets: deploy same stack across multiple accounts/regions
# Drift detection: detect manual changes to stack resources
# CDK (Cloud Development Kit): define AWS infrastructure in Python/TypeScript/Java
# Synthesizes to CloudFormation under the hood
# Constructs: reusable components at L1 (raw), L2 (opinionated), L3 (patterns)

import aws_cdk as cdk
from aws_cdk import aws_s3 as s3, aws_lambda as lambda_
class MyStack(cdk.Stack):
    def __init__(self, scope, id, **kwargs):
        super().__init__(scope, id, **kwargs)
        bucket = s3.Bucket(self, "MyBucket", versioned=True, encryption=s3.BucketEncryption.S3_MANAGED)
        fn = lambda_.Function(self, "Handler",
            runtime=lambda_.Runtime.PYTHON_3_12,
            handler="index.handler",
            code=lambda_.Code.from_asset("lambda"),
            environment={"BUCKET": bucket.bucket_name})
        bucket.grant_read(fn)  # CDK handles IAM policy automatically!

# Terraform (HCL): multi-cloud, largest community
# State management: remote state in S3 + DynamoDB lock
# Provider ecosystem: AWS, GCP, Azure, Kubernetes, Datadog, etc.
# Atlantis / Spacelift: Terraform GitOps automation

# Pulumi: general-purpose languages (Python, TypeScript, Go, Java)
# Same code can provision across clouds
# Strong typing, IDE support, unit testing of infrastructure

# Decision:
# AWS-only + simplicity: CDK (with CloudFormation backing)
# Multi-cloud / existing Terraform org: Terraform
# Complex abstractions in real code: Pulumi
# Avoid: ClickOps (console) and raw CloudFormation JSON for new projects
3What is GitOps and how do you implement it with AWS CodePipeline or GitHub Actions?
# GitOps: Git repository is the single source of truth for infrastructure
# Any change to infra → PR → review → merge → automated pipeline applies it
# Desired state in Git; operator ensures cluster matches desired state

# GitHub Actions CI/CD pipeline for EKS:
name: Deploy to EKS
on:
  push:
    branches: [main]
permissions:
  id-token: write   # OIDC for AWS
  contents: read
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::ACCOUNT:role/GitHubActionsDeployRole
          aws-region: us-east-1
      - name: Build and push container
        run: |
          aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REPO
          docker build -t $ECR_REPO:$GITHUB_SHA .
          docker push $ECR_REPO:$GITHUB_SHA
      - name: Deploy to EKS
        run: |
          aws eks update-kubeconfig --name my-cluster --region us-east-1
          helm upgrade --install my-app ./charts/my-app \
            --set image.tag=$GITHUB_SHA \
            --wait --timeout 10m

# ArgoCD (Kubernetes GitOps operator):
# Watches Git repo → applies manifests to cluster automatically
# Drift detection: alert if cluster state diverges from Git
# Rollback: git revert → ArgoCD automatically reverts deployment

# AWS CodePipeline for multi-account deployments:
# Source: CodeCommit/GitHub → Build: CodeBuild → Deploy: CloudFormation StackSets
# Cross-account deployment: assume role in target account to deploy
4What are multi-cloud architectures? When do they make sense and what are the pitfalls?

Multi-cloud uses services from two or more cloud providers. It's a spectrum from "we use AWS but also Cloudflare" to "our core workload runs on both AWS and GCP simultaneously."

Legitimate reasons for multi-cloud:

  • Best-of-breed services: GCP BigQuery for analytics, AWS for compute, Azure for M365 integration
  • Negotiating leverage: Realistic ability to switch prevents vendor lock-in and improves pricing negotiations
  • Regulatory: Some regions require specific local cloud providers
  • M&A: Acquired a company on a different cloud

Multi-cloud pitfalls:

  • Operational complexity: Two sets of tools, two sets of IAM, two sets of monitoring — double the skill requirements
  • Data transfer costs: Egress between clouds is expensive ($0.08-0.09/GB). Workloads that talk cross-cloud will have high bills.
  • Lowest-common-denominator: Using cloud-agnostic tools (plain Kubernetes, Terraform) foregoes managed services that save operational overhead
  • "Multi-cloud resilience" is mostly a myth: A region outage (us-east-1) doesn't necessitate a different cloud provider — AWS has other regions. For true resilience, multi-region on one cloud is simpler and sufficient.
5What is AWS Landing Zone and how do you bootstrap a new AWS Organization?
# Landing Zone: pre-configured, multi-account AWS environment with security best practices

# AWS Control Tower (managed Landing Zone):
# Sets up the following automatically:
# - Management account (billing, org policies)
# - Log Archive account (central CloudTrail, Config, S3 access logs)
# - Audit account (security tooling, GuardDuty aggregator, Security Hub)
# - Core OU structure (Security, Sandbox, Workloads)
# - Preventive guardrails (SCPs): cannot disable CloudTrail, GuardDuty
# - Detective guardrails (Config rules): detect non-compliant resources
# - Account Factory: provision new accounts with consistent baseline

# Account Factory for Terraform (AFT):
# Git-based account provisioning
# PR → new account request → pipeline creates account with CT baseline
# Customizations: additional SCPs, IAM Identity Center assignments

# Must-have baseline for every new account:
resource "aws_cloudtrail" "main" {
  is_multi_region_trail           = true
  enable_log_file_validation      = true
  include_global_service_events   = true
  s3_bucket_name                  = var.central_log_bucket
}

resource "aws_guardduty_detector" "main" { enable = true }
resource "aws_securityhub_account" "main" {}
resource "aws_config_configuration_recorder" "main" { ... }

# Account Vending Machine concept:
# 1. Developer requests new account via ServiceNow/Jira
# 2. Approval workflow
# 3. Control Tower + AFT provisions account in ~20 minutes
# 4. Developer gets access via IAM Identity Center
# 5. Account has full baseline: logging, security, networking connected to org TGW
6How do you design a cloud-native CI/CD pipeline for microservices?
# Complete microservices CI/CD pipeline:
# Git push → GitHub Actions → ECR → EKS (dev → staging → prod)

# Pipeline stages:
# 1. Code quality gate: lint, unit tests, SAST (Semgrep), secret scan (gitleaks)
# 2. Build: docker build → push to ECR (tagged with git SHA)
# 3. Deploy to dev: helm upgrade, smoke test
# 4. Integration tests: Postman/k6 against dev endpoint
# 5. Deploy to staging: same chart, prod-like config
# 6. Load test: k6 performance test, SLO validation
# 7. Approval gate: manual or auto (if SLOs pass)
# 8. Deploy to prod: canary (10%) → monitor 30min → full rollout

# Blue-Green deployment on EKS:
# Two deployments: blue (current) and green (new)
# Service selector switches traffic when green is healthy
# Rollback: flip selector back to blue (seconds)

# Canary with Argo Rollouts:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10       # 10% traffic to new version
      - pause: {duration: 30m}  # wait and monitor
      - setWeight: 50
      - pause: {duration: 15m}
      - setWeight: 100      # full rollout

# Feature flags (decouple deploy from release):
# Rollout to 100% of infra but only show to 1% of users
# LaunchDarkly / AWS AppConfig / Unleash
# Instant rollback: turn off the flag, no re-deployment needed
7How do you implement platform engineering and developer portals with cloud infrastructure?

Platform engineering builds internal developer platforms (IDPs) that abstract cloud complexity — developers self-serve without needing deep cloud expertise. This is the "golden path" pattern.

# Internal Developer Platform components:

# 1. Service catalog (golden paths):
# Pre-built templates: "new microservice", "data pipeline", "ML workload"
# Developer creates new service → selects template → platform provisions:
#   - AWS account or EKS namespace with RBAC
#   - ECR repo, CodePipeline, GitHub repo with CI config
#   - Monitoring dashboards, logging configuration, IAM roles
# Service catalog tools: Backstage (Spotify), Port, OpsLevel

# 2. Backstage (open-source developer portal):
# Software catalog: all services, owners, runbooks in one place
# Golden path scaffolding: generate new services from templates
# Kubernetes plugin: cluster and workload visibility
# TechDocs: auto-generated documentation from markdown in repos

# 3. Crossplane (Kubernetes-native infrastructure provisioning):
# Developers request infrastructure (RDS, S3) via Kubernetes manifests
# Platform team defines compositions (what developers can request)
# Crossplane calls AWS APIs to provision
apiVersion: database.example.org/v1alpha1
kind: PostgreSQLInstance
metadata: {name: my-app-db}
spec:
  parameters: {storageGB: 20, version: "14"}
  compositeDeletePolicy: Foreground
  writeConnectionSecretToRef: {name: db-credentials}

# 4. Metrics: platform adoption, mean-time-to-deploy, developer NPS
# Goal: reduce cognitive load → developers ship faster, more reliably
# Platform team as product team: build features developers actually use