Senior DevOps / Kubernetes
A complete set of senior-level DevOps and Kubernetes interview questions covering container orchestration, CI/CD pipelines, infrastructure as code, observability, networking, security, reliability engineering, and platform engineering for cloud-native systems.
Containers & Docker
8 questionsVirtual machines emulate full hardware stacks and run a complete guest OS through a hypervisor. Each VM includes its own kernel, which makes VMs heavy (GBs) and slow to start (minutes).
Containers share the host kernel and isolate processes using two Linux kernel primitives:
- Namespaces — provide process isolation. There are six namespaces:
pid(process tree),net(network stack),mnt(filesystem),uts(hostname),ipc(shared memory), anduser(UID/GID mappings). A container sees only the resources in its namespace. - cgroups (control groups) — enforce resource limits (CPU, memory, disk I/O). They prevent any single container from starving others.
Container images use a union filesystem (OverlayFS) — a stack of read-only layers with a writable layer on top. Layers are content-addressed and shared across images, making images small and pulls fast.
Key trade-offs: Containers start in milliseconds, use far less memory, and achieve near-native performance. The downside is weaker isolation — a kernel vulnerability can affect all containers on the host. VMs provide stronger isolation (critical for multi-tenant environments where you don't trust the workload).
# Inspect the namespaces a container is placed in
docker run --rm -it ubuntu ls -la /proc/1/ns
# On the host, confirm process isolation
docker run -d nginx
ps aux | grep nginx # visible on host
docker exec <id> ps aux # different PID inside container
Multi-stage builds use multiple FROM instructions in one Dockerfile. Each stage can use a different base image, and you selectively copy artifacts from earlier stages into the final image.
Why it matters: Build tools (JDK, Maven, GCC, npm) are only needed at compile time. Including them in production images increases attack surface, image size, and pull times. Multi-stage builds give you a minimal runtime image with only what the application needs.
# Stage 1 — build (heavy image)
FROM maven:3.9-eclipse-temurin-21 AS builder
WORKDIR /app
COPY pom.xml .
RUN mvn dependency:go-offline # cache dependencies as a separate layer
COPY src ./src
RUN mvn package -DskipTests
# Stage 2 — runtime (minimal image)
FROM eclipse-temurin:21-jre-alpine
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
USER appuser
WORKDIR /app
COPY --from=builder /app/target/app.jar app.jar
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "app.jar"]
Best practices:
- Order
COPYandRUNstatements from least to most frequently changing — Docker caches layers and invalidates everything below a changed layer. - Prefer
-alpineor-slimbase images, orscratch/distrolessfor compiled languages (Go, Rust). - Never run as root — create a dedicated non-root user.
- Pin base image digests (
FROM ubuntu@sha256:...) for reproducibility.
Each Dockerfile instruction produces an image layer. Docker caches layers by comparing the instruction string and, for COPY/ADD, the checksums of the copied files. When a layer cache is invalidated, every subsequent layer is rebuilt.
Common cache-busting pitfalls:
COPY . .early in the Dockerfile — any source file change (including unrelated ones) invalidates the cache for all subsequent steps including dependency downloads.RUN apt-get updatealone on one line — if cached, the package index becomes stale. Always combine:RUN apt-get update && apt-get install -y pkg && rm -rf /var/lib/apt/lists/*- Using
ADDwith URLs — always re-downloads. UsecurlinsideRUNwith explicit checksums instead. - Embedding build timestamps or git SHAs via
ARGorENVearly in the file — invalidates everything below.
# Bad — COPY . . before installing dependencies kills caching
COPY . .
RUN npm install # reinstalls everything every time any file changes
# Good — copy manifest first, install, then copy source
COPY package.json package-lock.json ./
RUN npm ci # cached until package-lock.json changes
COPY . . # source change only invalidates this and below
A container registry stores and serves OCI-compliant images. Options: Docker Hub, AWS ECR, GCP Artifact Registry, GitHub Container Registry, Harbor (self-hosted).
Image promotion strategy: Build the image once, promote the same immutable artifact through environments. Never rebuild for staging/prod — different builds are different artifacts.
# Build once with a unique digest-based tag (SHA of the git commit)
IMAGE=myapp:$(git rev-parse --short HEAD)
docker build -t $IMAGE .
docker push $IMAGE
# In CI: tag the same image for dev
docker tag $IMAGE registry/myapp:dev
docker push registry/myapp:dev
# After dev validation, promote the same digest to staging
docker tag $IMAGE registry/myapp:staging
docker push registry/myapp:staging
# After staging sign-off, promote to prod — same bits, new tag
docker tag $IMAGE registry/myapp:latest
docker tag $IMAGE registry/myapp:v1.4.2
docker push registry/myapp:latest registry/myapp:v1.4.2
Image scanning: Integrate vulnerability scanning (Trivy, Snyk, AWS ECR scanning) into CI. Fail builds on critical CVEs. Scan continuously, not just at build time — new CVEs are discovered against already-deployed images.
Kubernetes communicates with container runtimes through the Container Runtime Interface (CRI) — a gRPC API that kubelet uses to manage pod lifecycle without caring about the specific runtime.
- containerd — the most widely deployed CRI runtime. Originally part of Docker, now a CNCF project. Manages image pull, storage, and low-level container operations via the runc OCI runtime.
- CRI-O — lightweight CRI implementation from Red Hat, purpose-built for Kubernetes. Used by OpenShift by default. Supports any OCI-compliant runtime.
- Docker Engine — no longer supported as a Kubernetes runtime since Kubernetes 1.24 (dockershim was removed). Docker itself uses containerd underneath, so images built with Docker run fine.
The runtime stack when a pod is created: kubelet → CRI (containerd/CRI-O) → OCI runtime (runc/kata/gVisor). Kata Containers and gVisor provide stronger isolation by running containers inside lightweight VMs or sandboxed kernels.
# Check which runtime a node uses
kubectl get node <node-name> -o jsonpath='{.status.nodeInfo.containerRuntimeVersion}'
# e.g., containerd://1.7.2
# Check runtime on the node directly
crictl info | jq .config.containerdEndpoint
Distroless images (Google's gcr.io/distroless series) contain only the application and its runtime dependencies — no shell, no package manager, no coreutils. They are purpose-built for production security.
Alpine uses musl libc and busybox, producing small images (~5 MB base). It has a shell and apk package manager, which aids debugging but increases attack surface.
Comparison:
- Attack surface: Distroless wins — no shell means an attacker who gains code execution can't easily explore or download additional tools.
- Size: Similar in practice for JVM apps (the JRE dominates). For Go/Rust, distroless
:staticorscratchcan produce single-binary images. - Debugging: Alpine wins — you can
docker execand run commands. With distroless you need ephemeral debug containers (kubectl debug). - Compatibility: Alpine's musl libc can cause subtle issues with glibc-linked binaries. Distroless uses glibc, which is the standard for most Linux software.
# Distroless for Java
FROM gcr.io/distroless/java21-debian12
COPY --from=builder /app/target/app.jar /app.jar
ENTRYPOINT ["/app.jar"]
# Debug a distroless container in Kubernetes
kubectl debug -it <pod> --image=busybox --target=<container>
Recommendation: Use distroless in production for any internet-facing workload. Use Alpine in development or for internal tooling where debuggability matters more.
Wrong ways:
- Hardcoding secrets in the Dockerfile (
ENV DB_PASSWORD=secret) — baked into every image layer, visible indocker inspect, persisted in registries forever. - Copying secrets into the image via
COPY .env .— same problem, plus the file remains in the layer even if youRUN rm .envlater (the original layer still exists). - Passing secrets as build args (
ARG SECRET) — build args are visible in the image history viadocker history --no-trunc.
Right ways at build time:
- BuildKit secret mounts — secrets are mounted into the build context as tmpfs, never written to any layer.
- SSH forwarding — for private git repos:
RUN --mount=type=ssh
# BuildKit secret mount — secret never appears in image history
# syntax=docker/dockerfile:1
FROM node:20-alpine
RUN --mount=type=secret,id=npm_token \
NPM_TOKEN=$(cat /run/secrets/npm_token) npm install
# Build with:
docker build --secret id=npm_token,src=~/.npmrc .
Right ways at runtime: Inject secrets as environment variables from a secrets manager (AWS Secrets Manager, HashiCorp Vault, Kubernetes Secrets). Never bake runtime secrets into images. In Kubernetes, mount secrets as files (not env vars where possible — env vars can be leaked by the application in logs or error pages).
CMD and ENTRYPOINT? What is the exec form vs shell form?ENTRYPOINT defines the executable that always runs. It cannot be overridden from docker run (without --entrypoint). Use it for the main process.
CMD provides default arguments to ENTRYPOINT, or the default command if no ENTRYPOINT is set. It is easily overridden from docker run <image> <command>.
Exec form (["executable", "arg1"]) — runs the process directly. PID 1 is the application. Signals (SIGTERM for graceful shutdown) are delivered directly to the process. Always prefer exec form in production.
Shell form (CMD command arg1) — runs via /bin/sh -c. The shell is PID 1, and the actual process is a child. SIGTERM is received by the shell which may not forward it, causing unclean shutdowns.
# Correct — exec form, signals work properly
ENTRYPOINT ["java", "-jar", "app.jar"]
CMD ["--spring.profiles.active=prod"] # overridable default args
# Running:
docker run myapp # uses CMD default
docker run myapp --spring.profiles.active=dev # CMD overridden
# Bad — shell form, signals not forwarded
ENTRYPOINT java -jar app.jar
Kubernetes Core
12 questionsThe control plane is the brain of Kubernetes. It runs on master nodes (or as a managed service in EKS/GKE/AKS).
- etcd — the distributed key-value store that is the single source of truth for all cluster state. All control plane components are stateless; only etcd persists data. Losing etcd without a backup means losing the cluster.
- kube-apiserver — the only component all others communicate with. It validates and processes REST requests, persists state to etcd, and serves the Kubernetes API. Horizontally scalable.
- kube-scheduler — watches for unbound Pods and assigns them to nodes by running a scoring algorithm (considering resource requests, affinity, taints/tolerations, topology spread). Does not start pods itself — writes the node name to the Pod spec.
- kube-controller-manager — runs controller loops that reconcile actual state to desired state. Includes the ReplicaSet controller, Deployment controller, Node controller (detects node failures), Job controller, etc.
- cloud-controller-manager — optional; runs cloud-provider-specific controllers (load balancer provisioning, node lifecycle). Separates cloud logic from core Kubernetes.
On each worker node:
- kubelet — the node agent. Receives PodSpecs from the API server and ensures containers are running. Reports node and pod status. Runs health checks (liveness/readiness probes).
- kube-proxy — maintains network rules (iptables or IPVS) for Service routing. Replaced by eBPF-based implementations (Cilium) in modern clusters.
kubectl apply -f deployment.yaml?Understanding this flow is essential for debugging and knowing where failures can occur:
- kubectl reads and validates the YAML client-side. It merges with the last-applied-configuration annotation (server-side apply uses field managers). Sends a PATCH/POST to the
kube-apiserver. - kube-apiserver authenticates the request (client cert, token, OIDC), authorizes it (RBAC), runs admission controllers (webhooks, PodSecurity, ResourceQuota), validates the schema, and persists to etcd.
- Deployment controller (in kube-controller-manager) watches for Deployment changes. It creates or updates a ReplicaSet.
- ReplicaSet controller creates Pod objects with no
nodeName— they are "Pending". - kube-scheduler watches for unscheduled Pods, runs its algorithm, and writes a
nodeNameinto the Pod spec via the API server. - kubelet on the assigned node watches for Pods assigned to it. It instructs the container runtime (containerd) to pull the image and start the container.
- kubelet reports Pod status back to the API server (ContainerCreating → Running). Readiness probes start after
initialDelaySeconds. - If a Service exists, kube-proxy / Cilium updates routing rules so traffic can reach the new Pod.
- Deployment — manages stateless applications. Handles rolling updates and rollbacks by creating new ReplicaSets and gradually shifting traffic. Use for web servers, APIs, microservices — anything where pods are interchangeable.
- ReplicaSet — ensures N identical pod replicas are running. Not used directly; Deployments manage ReplicaSets. You might reference a ReplicaSet directly during a rollback investigation.
- StatefulSet — for stateful applications needing stable identities. Pods get stable DNS names (
db-0,db-1), stable persistent volumes, and ordered startup/shutdown. Use for databases (PostgreSQL, Kafka, Elasticsearch, Redis Cluster). Never use Deployments for clustered databases — they don't preserve identity. - DaemonSet — runs exactly one pod per node (or per selected nodes). Use for infrastructure components: log collectors (Fluentd, Filebeat), monitoring agents (node-exporter, Datadog agent), CNI plugins, CSI drivers, security scanners.
- Job — runs a pod to completion. Use for batch processing, database migrations, report generation.
completionscontrols how many successful completions are needed;parallelismcontrols concurrency. - CronJob — creates Jobs on a cron schedule. Use for periodic tasks: backups, cache warming, cleanup jobs. Be aware of the
concurrencyPolicy(Forbid vs Replace vs Allow) to handle schedule overlaps.
- Liveness probe — answers "is this container alive?" Failure → kubelet restarts the container. Use to recover from deadlocks or unrecoverable states. Danger: A liveness probe that's too aggressive (low timeout, probes a slow dependency) causes restart loops. Never probe external dependencies in a liveness check — if your DB is down, restarting the app doesn't fix it.
- Readiness probe — answers "is this container ready to receive traffic?" Failure → pod is removed from Service endpoints (no traffic). The container is not restarted. Use to signal when an app has finished startup, is under load, or temporarily unavailable. Danger: Wrong path returns 404 → pod never receives traffic.
- Startup probe — answers "has the container finished starting?" While failing, kubelet waits and does not run liveness/readiness probes. Use for slow-starting applications (JVM warm-up, loading large models) to avoid premature liveness restarts.
livenessProbe:
httpGet:
path: /actuator/health/liveness # must only check internal state
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /actuator/health/readiness # can check DB connectivity
port: 8080
periodSeconds: 5
failureThreshold: 3
startupProbe:
httpGet:
path: /actuator/health
port: 8080
failureThreshold: 30 # 30 * 10s = 5 minutes to start
periodSeconds: 10
A Service is a stable virtual IP (ClusterIP) and DNS name that load-balances traffic to a set of pods matching its selector. kube-proxy maintains iptables/IPVS rules on every node to implement this routing.
- ClusterIP (default) — reachable only within the cluster. Use for internal service-to-service communication. DNS:
service-name.namespace.svc.cluster.local. - NodePort — opens a port (30000–32767) on every node that forwards to the ClusterIP. Useful for direct external access without a cloud load balancer, or for on-premise clusters. Not recommended for production — exposes a port on every node.
- LoadBalancer — provisions an external cloud load balancer (ALB, NLB, GCP LB) pointing to NodePorts. Used for production ingress in cloud environments. Each Service gets its own LB and IP — expensive at scale (use Ingress instead).
- ExternalName — CNAME alias to an external DNS name. Use to make external services (RDS, third-party APIs) accessible within the cluster via a Kubernetes service name, allowing you to swap the actual endpoint without changing application code.
Headless Service (clusterIP: None) — returns pod IPs directly from DNS instead of a virtual IP. Required by StatefulSets for stable per-pod DNS, and useful for client-side load balancing (e.g., gRPC).
Ingress is a Kubernetes API object that defines HTTP/HTTPS routing rules (host-based and path-based) to backend Services. An Ingress Controller is a pod that watches Ingress objects and implements the rules — it's not built into Kubernetes.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx
tls:
- hosts: [api.example.com]
secretName: api-tls
rules:
- host: api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service: { name: api-v1, port: { number: 80 } }
NGINX Ingress Controller — the most common implementation. Uses nginx.conf under the hood. Highly configurable via annotations, but annotations are vendor-specific and not portable.
Gateway API (the successor to Ingress) — a more expressive, role-oriented API (GatewayClass, Gateway, HTTPRoute, TCPRoute). Supports traffic splitting natively (no annotations), multiple protocols, and separates infrastructure concerns (Gateway) from routing concerns (HTTPRoute managed by developers). Gateway API is the future; adopt it for new deployments.
ConfigMap stores non-sensitive configuration (key-value pairs, config files, env files). Secret stores sensitive data — base64-encoded by default (not encrypted). For real security, enable etcd encryption at rest, or use external secrets operators (External Secrets Operator with AWS Secrets Manager / Vault).
Consumption methods:
- Environment variables —
envFrom(all keys as env vars) orenv.valueFrom(individual keys). Simple but values are static at pod startup (not updated if ConfigMap changes). - Volume mounts — mount as files in a directory. Files are updated automatically when the ConfigMap changes (with a short delay, ~1min). Ideal for config files the app reads at runtime.
- Command args — reference
$(VAR_NAME)in args array after defining it as an env var.
spec:
containers:
- name: app
envFrom:
- configMapRef:
name: app-config
- secretRef:
name: app-secrets
volumeMounts:
- name: config-vol
mountPath: /etc/config
volumes:
- name: config-vol
configMap:
name: app-config
Best practices: Prefer volume mounts over env vars for secrets (harder to accidentally log). Use immutable: true for ConfigMaps and Secrets that should not change — Kubernetes stops watching them, reducing API server load.
Namespaces are a logical partitioning mechanism within a single cluster. They provide scoping for names, RBAC policies, resource quotas, and network policies.
What namespaces provide:
- Name scoping — same Service name can exist in multiple namespaces.
- RBAC boundaries — roles and bindings are namespace-scoped by default.
- ResourceQuota — limit CPU, memory, and object counts per namespace.
- LimitRange — set default requests/limits for pods in the namespace.
- Network isolation — via NetworkPolicy (not automatic; must be implemented).
Limitations: Namespaces do not provide strong isolation. ClusterRoles can span namespaces. Some resources (Nodes, PersistentVolumes, StorageClasses, ClusterRoles) are cluster-scoped, not namespace-scoped. A misconfigured RBAC binding can grant cross-namespace access. For true multi-tenant isolation of untrusted workloads, use separate clusters, or virtual cluster tools (vcluster).
A rolling update replaces pods incrementally by creating a new ReplicaSet with the updated template and gradually scaling it up while scaling the old ReplicaSet down.
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # at most 1 pod down at a time (absolute or %)
maxSurge: 2 # at most 2 extra pods above desired (absolute or %)
- maxUnavailable: 0 + maxSurge: 1 — zero-downtime update. Always keeps full capacity; slightly slower.
- maxUnavailable: 1 + maxSurge: 0 — conserves resources (useful when cluster is at capacity). Briefly reduced capacity.
Rollback:
# Monitor rollout
kubectl rollout status deployment/myapp
# Rollback to previous version immediately
kubectl rollout undo deployment/myapp
# Rollback to a specific revision
kubectl rollout history deployment/myapp
kubectl rollout undo deployment/myapp --to-revision=3
Recreate strategy — kills all old pods before starting new ones. Causes downtime. Use only when two versions of the app cannot run simultaneously (e.g., a schema migration that breaks backwards compatibility — though better to avoid this with expand-contract migrations).
The controller pattern is the heart of Kubernetes. A controller watches the current state of resources (via Informers / list-watch on the API server) and continuously reconciles it toward the desired state. Controllers are edge-triggered (watch events) but level-driven (they re-reconcile based on current state, not the delta, making them idempotent).
Writing a custom controller / operator:
- Define a Custom Resource Definition (CRD) — extends the Kubernetes API with your own resource type.
- Write a controller that watches your CRD and reconciles. Use controller-runtime (Go) or the Operator SDK.
- Package as a Deployment in the cluster.
// controller-runtime reconciler skeleton (Go)
func (r *MyAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
myApp := &myv1.MyApp{}
if err := r.Get(ctx, req.NamespacedName, myApp); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Create or update a Deployment to match desired state
dep := buildDeployment(myApp)
if err := ctrl.SetControllerReference(myApp, dep, r.Scheme); err != nil {
return ctrl.Result{}, err
}
// CreateOrUpdate — idempotent
result, err := controllerutil.CreateOrUpdate(ctx, r.Client, dep, func() error {
dep.Spec = buildDeploymentSpec(myApp)
return nil
})
// Requeue after 5 minutes for drift detection
return ctrl.Result{RequeueAfter: 5 * time.Minute}, err
}
Use the Operator Pattern (CRD + Controller) to automate complex operational knowledge — day 2 operations like backups, failover, scaling decisions for stateful systems.
Admission controllers intercept requests to the API server after authentication and authorization but before persistence. They are the enforcement layer for cluster policy.
Two types of webhooks:
- MutatingAdmissionWebhook — can modify the request (e.g., inject sidecar containers, add labels, set default resource limits, inject image pull secrets). Runs before validating webhooks.
- ValidatingAdmissionWebhook — can approve or reject the request (cannot modify). Used for policy enforcement.
Common uses:
- Istio / Linkerd — mutating webhook injects the sidecar proxy container into every pod.
- OPA/Gatekeeper — validating webhook that evaluates Rego policies (block pods without resource limits, enforce label requirements, restrict image registries).
- Kyverno — policy engine that can both validate and mutate. More Kubernetes-native policy language than Rego.
- Pod Security Admission (built-in since 1.25) — enforces Pod Security Standards (Privileged, Baseline, Restricted) per namespace without an external webhook.
# Kyverno policy — require all pods to have resource limits
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-resource-limits
spec:
validationFailureAction: Enforce
rules:
- name: check-limits
match:
resources: { kinds: [Pod] }
validate:
message: "Resource limits are required."
pattern:
spec:
containers:
- resources:
limits:
memory: "?*"
cpu: "?*"
HPA adjusts the replica count of a Deployment/StatefulSet based on metrics. It polls the Metrics API every 15 seconds and applies: desiredReplicas = ceil(currentReplicas × (currentMetric / targetMetric))
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef: { kind: Deployment, name: api }
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 70 }
- type: Resource
resource:
name: memory
target: { type: AverageValue, averageValue: 512Mi }
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # prevents flapping
HPA limitations: Only scales on CPU, memory, and custom metrics from the Metrics API. Cannot scale to zero. Not aware of external event sources (queue depth, Kafka lag).
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with 60+ event source scalers: Kafka lag, RabbitMQ queue depth, AWS SQS, Prometheus queries, cron schedules. KEDA can scale to zero and back — ideal for event-driven workloads where running idle pods wastes resources.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
scaleTargetRef: { name: consumer }
minReplicaCount: 0 # scale to zero when no messages
maxReplicaCount: 50
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
topic: orders
consumerGroup: order-processor
lagThreshold: "100" # 1 replica per 100 messages lag
Networking
8 questionsKubernetes enforces a flat networking model with three rules:
- Every pod gets its own unique IP address.
- Pods can communicate with all other pods without NAT.
- Agents on a node (kubelet) can communicate with all pods on that node.
This is implemented by a CNI (Container Network Interface) plugin. Popular choices: Flannel (simple overlay, VXLAN), Calico (BGP routing, powerful NetworkPolicy), Cilium (eBPF-based, no iptables, high performance, built-in observability), Weave Net.
Cross-node communication flow (Calico BGP example):
- Pod A on Node 1 sends a packet to Pod B on Node 2 (using Pod B's IP).
- The packet leaves the pod's network namespace via a veth pair into the node's network namespace.
- The node's routing table (maintained by Calico via BGP) routes the packet to Node 2's IP.
- Node 2's routing table delivers it to Pod B's veth interface.
With Flannel/VXLAN, packets are encapsulated in UDP at the node level (overlay). With Calico BGP or Cilium eBPF, packets are routed natively without encapsulation overhead.
By default, Kubernetes allows all pod-to-pod communication. NetworkPolicy is a firewall rule for pods. A NetworkPolicy selects pods and defines allowed ingress/egress traffic.
Zero-trust approach — start with a default-deny policy in every namespace, then explicitly allow required traffic:
# Step 1: Default deny all ingress and egress in the namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
spec:
podSelector: {} # selects ALL pods in namespace
policyTypes: [Ingress, Egress]
---
# Step 2: Allow specific traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-api-to-db
spec:
podSelector:
matchLabels: { app: postgres }
policyTypes: [Ingress]
ingress:
- from:
- podSelector:
matchLabels: { app: api }
ports:
- protocol: TCP
port: 5432
Important: NetworkPolicies are only enforced if the CNI plugin supports them. Flannel does not. Calico, Cilium, and Weave do. Cilium extends NetworkPolicies with L7 awareness (HTTP method, path-based filtering).
A service mesh intercepts all network traffic between services using sidecar proxies (or eBPF), providing: mutual TLS (mTLS), observability (L7 metrics, traces), traffic management (retries, circuit breaking, traffic splitting), and fine-grained authorization — all without application code changes.
Istio: The most feature-rich. Uses Envoy as the data plane. Extremely capable (traffic shifting, fault injection, JWT validation, WebAssembly extensions). Complex to operate. Control plane (Istiod) has historically been resource-heavy.
Linkerd: Simpler and more opinionated. Uses its own Rust-based micro-proxy (ultra-light). Easier to install and operate. Excellent for mTLS and observability out of the box. Less feature-rich than Istio for advanced traffic management.
When is a service mesh worth it?
- You need automatic mTLS between all services (compliance, zero-trust).
- You want consistent L7 observability (latency histograms, per-route error rates) without instrumenting every service.
- You need advanced traffic management (canary deployments, traffic mirroring, circuit breaking) without application changes.
- When NOT to use it: Small clusters (<10 services), teams without the operational capacity to learn and manage a mesh, or when simpler solutions (Kubernetes NetworkPolicy + application-level retries) suffice. A service mesh adds non-trivial operational complexity.
CoreDNS is the DNS server deployed as a Deployment in the kube-system namespace. Every pod's /etc/resolv.conf is configured to use the CoreDNS ClusterIP as its nameserver.
DNS record format for Services:
# Full form
my-service.my-namespace.svc.cluster.local
# Within the same namespace (search domains allow short form)
my-service
# Pod DNS (rarely used directly)
pod-ip-dashes.namespace.pod.cluster.local
# e.g., 10-244-1-5.default.pod.cluster.local
StatefulSet pods get stable DNS: pod-name.service-name.namespace.svc.cluster.local (e.g., postgres-0.postgres-headless.default.svc.cluster.local). This is how StatefulSet replicas discover each other (required for consensus protocols like etcd, Zookeeper, Kafka).
Common DNS debugging:
# Run a debug pod and test DNS resolution
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup kubernetes.default
kubectl run dns-test --image=busybox:1.36 --rm -it -- nslookup my-service.my-namespace
# Check CoreDNS logs for resolution failures
kubectl logs -n kube-system -l k8s-app=kube-dns
# Increase DNS debugging with ndots configuration
# High ndots (default 5) causes many unnecessary FQDN lookups — tune it
eBPF (extended Berkeley Packet Filter) allows running sandboxed programs in the Linux kernel without modifying kernel source code. eBPF programs are attached to kernel hooks (network events, syscalls, tracepoints) and execute with near-native performance.
Traditional kube-proxy uses iptables rules for Service routing. At scale (thousands of Services and Endpoints), iptables is a linear rule chain — O(n) lookups that become a bottleneck. Every rule change also requires a full table lock.
Cilium with eBPF replaces iptables with BPF maps (hash tables in kernel memory) — O(1) lookups regardless of cluster size. Benefits:
- Significantly lower latency at scale (10,000+ services).
- L7-aware networking (HTTP, gRPC, Kafka protocol awareness) without a sidecar proxy.
- Hubble (built-in observability) — real-time network flow visibility with zero application instrumentation.
- ClusterMesh — cross-cluster service discovery and load balancing.
- Transparent encryption (WireGuard or IPSec) without sidecars.
Cilium is now the default CNI for many managed Kubernetes services (GKE Dataplane V2, EKS with VPC CNI + Cilium). For high-scale or security-sensitive clusters, Cilium has become the standard choice.
cert-manager is a Kubernetes controller that automates certificate lifecycle — issuance, renewal, and storage as Secrets. It supports Let's Encrypt (ACME), Vault PKI, Venafi, and self-signed CAs.
# 1. Define a ClusterIssuer for Let's Encrypt
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ops@example.com
privateKeySecretRef: { name: letsencrypt-prod }
solvers:
- http01:
ingress:
class: nginx # or dns01 solver for wildcard certs
---
# 2. Request a certificate (or annotate your Ingress)
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: api-tls
spec:
secretName: api-tls-secret
issuerRef: { name: letsencrypt-prod, kind: ClusterIssuer }
dnsNames: [api.example.com]
cert-manager automatically renews certificates 30 days before expiry. For internal services, use a private CA (cert-manager can manage it) to issue mTLS certificates — this is what Istio and Linkerd use under the hood for automatic mTLS.
ExternalDNS automatically synchronizes Kubernetes Services and Ingresses with external DNS providers (Route53, Cloud DNS, Cloudflare, Azure DNS). It watches for Services of type LoadBalancer and Ingress objects, and creates/updates DNS records accordingly.
# Ingress annotation triggers ExternalDNS
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
external-dns.alpha.kubernetes.io/hostname: api.example.com
external-dns.alpha.kubernetes.io/ttl: "60"
spec:
rules:
- host: api.example.com
...
ExternalDNS runs with minimal IAM permissions (only Route53 ChangeResourceRecordSets on the specific zone). Combined with cert-manager, you get fully automated DNS + TLS for every new service, which is foundational for self-service developer platforms.
By default, pods can reach any external IP. Controlling egress is important for security, compliance, and cost (unexpected data transfer charges).
Options:
- NetworkPolicy egress rules — restrict which external CIDRs or DNS names pods can reach. Limited to IP-based rules (no hostname matching in standard NetworkPolicy).
- Egress gateway (Istio) — all outbound traffic routes through a dedicated egress gateway pod. You get observability, mTLS to external services, and a single egress IP for firewall rules.
- Cilium Egress Gateway — eBPF-based egress SNAT to a stable IP per namespace or pod group. Lower overhead than a sidecar-based solution.
- Node-level egress via NAT gateway (AWS/GCP) — all pod traffic exits via the node's NAT gateway IP. Simple but coarse-grained; doesn't give pod-level control.
- HTTP proxy / forward proxy — set
HTTP_PROXYenv var; all HTTP/HTTPS traffic routes through a proxy that enforces an allowlist of domains. Tools: Squid, mitmproxy, or cloud-native solutions.
Storage & State
6 questions- PersistentVolume (PV) — a piece of storage provisioned by an admin or dynamically. It exists independently of any pod and has a lifecycle that outlasts pods.
- PersistentVolumeClaim (PVC) — a request for storage by a user. Specifies size, access mode, and StorageClass. Kubernetes binds a PVC to a matching PV.
- StorageClass — defines the provisioner, parameters, and reclaim policy. Enables dynamic provisioning — when a PVC references a StorageClass, the CSI driver automatically creates the underlying volume (EBS, GCP PD, Azure Disk, Ceph RBD).
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-data
spec:
storageClassName: gp3-encrypted # references a StorageClass
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi
Access modes: ReadWriteOnce (single node, most block storage), ReadOnlyMany (multiple readers, e.g., NFS), ReadWriteMany (multiple writers — requires shared storage like EFS, NFS, or CephFS). Block storage (EBS, GCP PD) only supports RWO.
Reclaim policy: Retain (PV kept after PVC deletion, must be manually reclaimed — safe for production data), Delete (underlying storage deleted — default for dynamic provisioning), Recycle (deprecated).
In-tree volume plugins were storage drivers compiled directly into the Kubernetes binary. This meant storage vendors had to contribute code to the Kubernetes repository, wait for releases, and users had to upgrade Kubernetes to get storage fixes.
CSI is a standardized gRPC interface that allows storage vendors to develop and ship drivers independently. A CSI driver runs as pods in the cluster and implements three services:
- Controller Plugin — manages volume lifecycle (CreateVolume, DeleteVolume, AttachVolume) — runs once per cluster.
- Node Plugin — mounts/unmounts volumes on nodes (NodeStageVolume, NodePublishVolume) — runs as a DaemonSet.
- Identity Plugin — reports driver capabilities.
CSI drivers can be updated independently of Kubernetes. All major cloud providers (EBS CSI, GCP PD CSI, Azure Disk CSI) and storage vendors (Portworx, Longhorn, OpenEBS, Rook/Ceph) ship CSI drivers. VolumeSnapshots (backup and restore via CSI snapshot API) are also CSI-specific.
Running databases in Kubernetes is viable using StatefulSets with appropriate operators (CloudNativePG for PostgreSQL, MySQL Operator, Strimzi for Kafka, Redis Operator). Operators encode operational knowledge: cluster formation, failover, backup, point-in-time recovery.
Trade-offs vs managed services (RDS, Cloud SQL, ElastiCache):
- Control: In-cluster gives you full control — tuning, extensions, custom configurations. Managed services are opinionated and may not support all Postgres extensions (e.g., pgvector availability varies).
- Operational burden: Managed services handle patching, backups, failover, monitoring. In-cluster databases require your team to own this. Running a production database is significantly harder than deploying a stateless service.
- Cost: In-cluster can be cheaper (shared node resources). Managed services have premium pricing for HA configurations.
- Portability: In-cluster databases are cloud-agnostic — useful for multi-cloud or on-premise deployments.
- Performance: In-cluster databases on nodes with local NVMe SSD can outperform managed services at high throughput.
Recommendation: For most teams, start with managed databases. Move in-cluster only with a dedicated team capacity, and use a mature operator with a strong community.
Ephemeral containers are temporary containers you can inject into a running pod for debugging, without modifying the pod spec or restarting it. They share the pod's namespaces (network, PID, filesystem via --target).
They solve the distroless / minimal image problem — production images have no shell or debugging tools, but ephemeral containers let you attach a debugging image into the running pod's context.
# Attach a debug container to a running pod sharing its namespaces
kubectl debug -it <pod-name> \
--image=nicolaka/netshoot \ # contains dig, curl, tcpdump, ss, strace
--target=<container-name> \ # shares the container's PID namespace
--share-processes
# Inside the debug container you can:
# - curl internal services using the pod's network
# - run strace on the application process
# - inspect /proc/<pid>/fd for open file descriptors
# - run tcpdump on the pod's network interface
# Debug a node
kubectl debug node/<node-name> -it --image=ubuntu
Ephemeral containers cannot be removed once added — they terminate on their own or when the pod restarts. They don't have ports or resource limits in the same way regular containers do.
VolumeSnapshots are a Kubernetes API for point-in-time copies of PersistentVolumes, implemented by CSI drivers that support snapshotting (EBS, GCP PD, Azure Disk, Ceph).
# Create a snapshot
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: postgres-snap-2024-01
spec:
volumeSnapshotClassName: csi-aws-vsc
source:
persistentVolumeClaimName: postgres-data
---
# Restore: create a PVC from the snapshot
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-restored
spec:
storageClassName: gp3-encrypted
dataSource:
name: postgres-snap-2024-01
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi
Full backup solution — Velero: Velero backs up all Kubernetes objects (etcd state) plus PV snapshots. It supports schedule-based backups, cross-cluster migration, and disaster recovery. Velero is the standard for full cluster backup — snapshots alone don't capture the Kubernetes object state.
- emptyDir — temporary storage created when a pod is scheduled, deleted when the pod is removed. Survives container restarts within the pod. Use for: caches, scratch space, sharing files between containers in the same pod (e.g., sidecar reading logs written by the main container). Can use
medium: Memoryto back with tmpfs (RAM disk) for low-latency scratch space. - hostPath — mounts a directory from the host node. Persistent across pod restarts but tied to a specific node. Dangerous in multi-node clusters (pod can be rescheduled to a different node). Use only for DaemonSet infrastructure components that need host access (log agents reading
/var/log, container runtime socket at/run/containerd). Never use hostPath for application data. - configMap / secret — mounts ConfigMap or Secret data as files. Read-only (except for writable defaultMode).
- projected — combines multiple volume types (configMap, secret, serviceAccountToken, downwardAPI) into a single mount.
- CSI volumes — for persistent storage backed by a CSI driver. The production-grade choice for any data that must outlast pod lifecycle.
Scheduling & Resources
7 questionsRequests — guaranteed resources. Used by the scheduler to select a node with sufficient available capacity. Defines the pod's resource "reservation".
Limits — the maximum the container can use. Enforced at runtime by cgroups.
CPU limit exceeded: The container is throttled (CPU time is restricted by the CFS scheduler). The process continues running but at reduced speed. CPU throttling is a common source of high latency and is often misinterpreted as a memory problem.
Memory limit exceeded: The container is killed by the OOM (Out of Memory) killer with a OOMKilled status. If the pod has a restart policy (always for Deployments), it will restart. Repeated OOM kills cause CrashLoopBackOff.
resources:
requests:
cpu: "250m" # 0.25 cores reserved for scheduling
memory: "256Mi" # 256 MiB reserved
limits:
cpu: "1000m" # throttled if exceeds 1 core (burstable)
memory: "512Mi" # OOM killed if exceeds 512 MiB
QoS classes:
- Guaranteed — requests == limits for all containers. Highest priority; last to be evicted.
- Burstable — requests < limits (or only one is set).
- BestEffort — no requests or limits. First to be evicted under memory pressure.
Setting CPU limits is controversial — many performance engineers recommend setting CPU requests but not limits to avoid unnecessary throttling (use VPA to tune requests instead).
Taints & Tolerations — Taints on nodes repel pods. Only pods with a matching toleration can be scheduled on tainted nodes. Use to dedicate nodes to specific workloads (GPU nodes, spot instances), or to prevent general workloads from landing on system nodes.
# Taint a GPU node
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
# Pod tolerates the taint and CAN run on GPU nodes
tolerations:
- key: gpu
operator: Equal
value: "true"
effect: NoSchedule
Node Affinity — scheduling preferences/requirements based on node labels. More expressive than nodeSelector. requiredDuringSchedulingIgnoredDuringExecution is a hard requirement; preferredDuringScheduling... is a soft preference.
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values: [us-east-1a, us-east-1b]
Pod Affinity / Anti-Affinity — schedule pods relative to other pods. Use anti-affinity to spread replicas across nodes or zones (resilience). Use affinity to co-locate tightly coupled services (reduced latency).
# Spread replicas across nodes — prefer not to place on a node that already has this app
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels: { app: api }
topologyKey: kubernetes.io/hostname
Pod anti-affinity is binary — it either co-locates or avoids. It doesn't guarantee an even spread. Topology Spread Constraints declaratively specify how pods should be distributed across topology domains (zones, nodes, racks).
topologySpreadConstraints:
- maxSkew: 1 # max difference in pod count between domains
topologyKey: topology.kubernetes.io/zone # spread across AZs
whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
labelSelector:
matchLabels: { app: api }
- maxSkew: 1
topologyKey: kubernetes.io/hostname # also spread across nodes
whenUnsatisfiable: ScheduleAnyway # soft constraint
labelSelector:
matchLabels: { app: api }
This ensures that if you have 9 replicas across 3 AZs, each AZ gets exactly 3 — not all 9 in one AZ because anti-affinity at the node level was satisfied. Topology Spread Constraints are the preferred mechanism for HA workloads in multi-AZ clusters.
VPA automatically adjusts container resource requests (and optionally limits) based on historical usage. It analyzes metrics and recommends right-sized requests, addressing one of the hardest problems in Kubernetes: setting accurate resource requests for applications you haven't profiled.
VPA modes:
- Off — only provides recommendations (no changes). Use to audit current resources.
- Initial — only sets resources at pod creation. Does not modify running pods.
- Auto / Recreate — evicts and recreates pods to apply updated recommendations. Causes brief disruptions.
VPA + HPA conflict: Do not use VPA in Auto mode targeting CPU alongside HPA targeting CPU — they conflict. Safe combinations: VPA for memory + HPA for CPU, or VPA in Recommendation mode + manual tuning + HPA.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
spec:
targetRef: { apiVersion: apps/v1, kind: Deployment, name: api }
updatePolicy:
updateMode: "Off" # just give recommendations first
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed: { cpu: 100m, memory: 128Mi }
maxAllowed: { cpu: 4, memory: 4Gi }
The Cluster Autoscaler (CA) automatically adjusts the number of nodes in a node group/ASG when pods cannot be scheduled (scale up) or nodes are underutilized (scale down).
Scale up: CA detects unschedulable pods (Pending due to insufficient resources). It simulates which node group addition would allow the pods to schedule, then triggers a node group scale-out (adds nodes). Pods are scheduled once the new node joins.
Scale down: CA identifies nodes where all pods could be moved to other nodes (utilization below 50% threshold by default). It evicts pods (respecting PodDisruptionBudgets) and removes the node. Scale-down has a stabilization delay (10 minutes by default) to avoid thrashing.
Important interactions:
- Pods with no resource requests don't help CA estimate node requirements — always set requests.
- Pods pinned by affinity, hostPath, or local PVs block scale-down of their node.
- PodDisruptionBudgets prevent CA from evicting too many pods at once — must be set for stateful services.
- Karpenter (AWS) is a more flexible alternative to CA — it provisions nodes based on the exact pod requirements (right-sized instances) rather than predefined node groups, and consolidates nodes aggressively.
A PodDisruptionBudget (PDB) limits the number of pods from a deployment that can be disrupted simultaneously during voluntary disruptions (node drains, Cluster Autoscaler scale-down, rolling node upgrades). It does not protect against involuntary disruptions (node failures, OOM kills).
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # at least 2 pods must always be running
# OR: maxUnavailable: 1 # at most 1 pod can be down
selector:
matchLabels: { app: api }
When a node is drained (kubectl drain), Kubernetes checks PDBs before evicting each pod. If evicting a pod would violate the PDB, the eviction is blocked and the drain waits.
Critical for: Kubernetes version upgrades (nodes are drained one by one), cloud provider maintenance windows, Cluster Autoscaler scale-downs, and Karpenter node consolidation.
Without a PDB, a node drain can evict all replicas of a service simultaneously — causing complete downtime even with multiple replicas configured.
PriorityClass assigns a numeric priority to pods. When cluster resources are constrained, the scheduler can preempt (evict) lower-priority pods to make room for higher-priority ones.
# Critical infrastructure — highest priority
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: critical-system
value: 1000000
globalDefault: false
description: "Critical infrastructure pods"
---
# Normal application workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: app-default
value: 100
globalDefault: true # assigned to pods without explicit class
---
# Batch jobs — preemptable
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-low
value: 10
preemptionPolicy: Never # won't preempt others, but can be preempted
Use cases: Ensure monitoring agents (Prometheus, Datadog) are never preempted in favor of application pods. Allow batch and ML training jobs to use spare capacity but yield resources to production workloads immediately. System-critical pods (CoreDNS, kube-proxy) use the built-in system-cluster-critical and system-node-critical classes.
Security
8 questionsRBAC grants permissions based on roles bound to subjects (users, groups, ServiceAccounts).
- Role — namespace-scoped. Grants permissions on resources within a specific namespace.
- ClusterRole — cluster-scoped. Can grant permissions on cluster-scoped resources (nodes, PVs, namespaces), or on namespaced resources across all namespaces.
- RoleBinding — binds a Role (or ClusterRole) to subjects within a namespace.
- ClusterRoleBinding — binds a ClusterRole to subjects cluster-wide.
# Give a ServiceAccount read-only access to pods in the 'production' namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods", "pods/log"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
namespace: production
name: pod-reader-binding
subjects:
- kind: ServiceAccount
name: monitoring-agent
namespace: production
roleRef:
kind: Role
name: pod-reader
apiGroup: rbac.authorization.k8s.io
Least privilege principle: Give workloads dedicated ServiceAccounts with minimal permissions. Avoid using the default ServiceAccount (it may accumulate permissions over time). Never grant cluster-admin except to actual administrators. Audit with kubectl auth can-i --list --as=system:serviceaccount:production:monitoring-agent.
Pod Security Standards (PSS) replace the deprecated PodSecurityPolicy. Three built-in levels:
- Privileged — unrestricted. For trusted system workloads (CNI plugins, storage drivers).
- Baseline — prevents known privilege escalations. Allows most workloads. Blocks: hostNetwork, hostPID, privileged containers, dangerous capabilities.
- Restricted — heavily hardened. Requires running as non-root, dropping all capabilities, using seccomp profiles, disabling privilege escalation.
Enforce via Pod Security Admission using namespace labels:
# Label the namespace to enforce the restricted standard
kubectl label namespace production \
pod-security.kubernetes.io/enforce=restricted \
pod-security.kubernetes.io/enforce-version=v1.28 \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted
# A pod violating the policy is rejected at admission
# The warn mode sends a warning but allows it; audit mode logs it
For more fine-grained policies, use OPA/Gatekeeper or Kyverno. Example policies: require non-root UIDs, restrict container registries to your private registry, require read-only root filesystems.
Traditionally, pods accessed cloud services (S3, RDS, etc.) using long-lived IAM credentials stored as Kubernetes Secrets. This is dangerous: credentials can be compromised, don't rotate automatically, and violate least privilege if shared.
IRSA (IAM Roles for Service Accounts) on EKS / Workload Identity on GKE — allows pods to assume a cloud IAM role based on their Kubernetes ServiceAccount identity, without any static credentials. Uses the Kubernetes OIDC provider to federate identities.
# 1. Annotate the ServiceAccount with the IAM role ARN (AWS IRSA)
apiVersion: v1
kind: ServiceAccount
metadata:
name: s3-reader
namespace: production
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/S3ReaderRole
# 2. Pods using this ServiceAccount automatically receive short-lived
# AWS credentials via the projected token volume — no Secrets needed.
# The AWS SDK automatically picks up the token:
# AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
Benefits: Credentials are short-lived (rotated every 15 minutes), scoped to the pod's ServiceAccount, auditable in CloudTrail per service, and never stored as Kubernetes Secrets. This is the standard approach for cloud-native credential management.
Kubernetes Secrets are base64-encoded but not encrypted by default in etcd (enable etcd encryption at rest as a baseline). The main challenge is getting secrets into clusters without storing plaintext in Git.
- Sealed Secrets — asymmetrically encrypts secrets into
SealedSecretCRDs that are safe to commit to Git. The Sealed Secrets controller decrypts them in-cluster. Simple and GitOps-friendly. Limitation: secrets live only in Kubernetes; no centralized secret management across multiple clusters/environments. - External Secrets Operator (ESO) — syncs secrets from external stores (AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault, Azure Key Vault) into Kubernetes Secrets on a schedule. The source of truth is the external secret manager. Best for organizations already using a cloud secrets manager.
- HashiCorp Vault + Vault Agent Injector — the most feature-rich option. Vault issues dynamic, short-lived credentials (database passwords that auto-rotate, PKI certs). The Vault Agent sidecar (via mutating webhook) injects secrets as files into pods. Most complex to operate.
Recommendation: ESO with AWS Secrets Manager or GCP Secret Manager is the pragmatic choice for cloud-native teams. Use Vault when you need dynamic secret generation, fine-grained audit trails, or a multi-cloud/on-premise hybrid environment.
Supply chain security ensures that the code you build and the images you run haven't been tampered with. After the SolarWinds and Log4Shell incidents, this became a board-level concern.
SLSA (Supply-chain Levels for Software Artifacts) is a framework of security levels. Higher SLSA levels require more rigorous provenance attestation (who built it, how, from what source).
Cosign (Sigstore project) signs container images and stores signatures in the OCI registry. Combined with Fulcio (keyless OIDC-based signing) and Rekor (transparency log), you get cryptographic proof that an image was built by your CI pipeline from your source code.
# Sign an image in CI after push (keyless signing with OIDC)
cosign sign --yes ghcr.io/myorg/myapp:v1.2.3
# Verify an image before deployment
cosign verify ghcr.io/myorg/myapp:v1.2.3 \
--certificate-identity="https://github.com/myorg/myapp/.github/workflows/build.yml@refs/heads/main" \
--certificate-oidc-issuer="https://token.actions.githubusercontent.com"
Policy enforcement: Use Kyverno or OPA/Gatekeeper with a mutating/validating webhook to require image signature verification at admission — if an image isn't signed by your CI, it's rejected. This ensures no unsigned or externally pulled images run in production.
Falco is a CNCF runtime security tool that detects anomalous behavior in containers and Kubernetes by monitoring Linux system calls (via eBPF or kernel module). It provides a threat detection layer that operates at runtime — after an attacker is already inside your container.
How it works: Falco attaches to the kernel and inspects every syscall. It evaluates them against a rule engine. Rules match on syscall type, process name, file paths, network connections, and container context.
Built-in rule examples:
- Shell spawned inside a container (
execveof bash/sh by a non-shell process). - Write to sensitive file paths (
/etc/passwd,/etc/shadow). - Outbound connection to an unexpected IP.
- Container privilege escalation attempt.
- Crypto mining process signatures.
# Custom Falco rule
- rule: Unexpected Outbound Connection
desc: Detect unexpected egress from the payments service
condition: >
outbound and container.image.repository = "myorg/payments"
and not fd.sip in (allowed_payment_gateways)
output: >
Unexpected outbound connection from payments service
(command=%proc.cmdline connection=%fd.name)
priority: CRITICAL
Falco outputs alerts to stdout, syslog, HTTP webhooks, or Falcosidekick (which forwards to Slack, PagerDuty, Elasticsearch). Pair with Kubernetes audit logs for a comprehensive security posture.
The CIS Kubernetes Benchmark is the authoritative hardening guide. Key controls:
- API server: Disable anonymous authentication (
--anonymous-auth=false). Enable audit logging. Use OIDC for human authentication (not static tokens). Restrict access to the API server endpoint (private endpoint + allowlisted CIDR). - etcd: Enable encryption at rest for Secrets. Enable TLS for peer and client communication. Restrict etcd access to only the API server.
- Node hardening: Use CIS-hardened OS images (Bottlerocket, Flatcar). Enable AppArmor/seccomp profiles on nodes. Restrict SSH access. Use IMDSv2 only (AWS) to prevent SSRF metadata access from pods.
- RBAC: Disable
system:mastersgroup in kubeconfig. Audit ClusterRoleBindings regularly. Use namespace-scoped roles over ClusterRoles. - Network: Enable NetworkPolicy default-deny. Restrict external access to the control plane. Use private node groups with NAT.
- Workloads: Enforce Pod Security Standards (Restricted). Require non-root containers, read-only root FS, seccomp profiles.
# Run kube-bench to audit against CIS benchmark
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml
kubectl logs job/kube-bench
These are Linux kernel security mechanisms that restrict what system calls and file operations a container process can make.
seccomp (Secure Computing Mode) — filters syscalls. A container only needs a subset of the 300+ Linux syscalls. A seccomp profile blocks all others, dramatically reducing the kernel attack surface. The RuntimeDefault profile (provided by containerd) is safe for most workloads and is now the default in the Restricted Pod Security Standard.
securityContext:
seccompProfile:
type: RuntimeDefault # use container runtime's default profile
allowPrivilegeEscalation: false
runAsNonRoot: true
runAsUser: 1000
readOnlyRootFilesystem: true
capabilities:
drop: [ALL] # drop all Linux capabilities
add: [NET_BIND_SERVICE] # re-add only what's needed (bind to port <1024)
AppArmor — mandatory access control that restricts file system access, network access, and capabilities based on profiles loaded on the node. More granular than seccomp but host-dependent (profiles must be loaded on every node). Used extensively on Ubuntu nodes.
Together, seccomp + AppArmor + dropping capabilities provide defense-in-depth: even if an attacker escapes container isolation, they face additional kernel-level restrictions preventing lateral movement or privilege escalation.
CI/CD & GitOps
10 questionsGitOps is an operational model where Git is the single source of truth for declarative infrastructure and application state. Every change to the system is made by committing to Git. The running system continuously reconciles toward the Git state.
Push-based CI/CD: CI pipeline builds the image, then pushes changes to the cluster (kubectl apply, helm upgrade) from the pipeline. The pipeline needs cluster credentials. State drift (someone ran kubectl directly) is not detected or corrected.
Pull-based GitOps: An agent running inside the cluster (Argo CD, Flux) watches the Git repository. It detects drift between Git state and live cluster state and reconciles. No external entity needs cluster credentials — the cluster pulls its own config.
GitOps benefits:
- Audit trail — every cluster change is a Git commit with author, timestamp, and PR review.
- Disaster recovery — rebuild any cluster to a known state by pointing it at the Git repo.
- Automatic drift correction — manual kubectl changes are detected and reverted.
- Reduced blast radius — cluster credentials don't leave the cluster perimeter.
Both are CNCF GitOps tools, but with different philosophies.
Argo CD:
- Monolithic controller with a rich web UI, RBAC, and SSO integration.
- Application CRD as the central concept. UI-first — excellent for teams that want visibility and manual sync triggers.
- Multi-cluster management from a single Argo CD instance.
- ApplicationSets for templating across many clusters/environments.
- Argo Rollouts (separate project) for advanced deployment strategies (canary, blue-green).
Flux:
- Toolkit of composable controllers (source-controller, kustomize-controller, helm-controller, notification-controller). Each does one thing.
- CLI-first, GitOps-native. Bootstraps itself into a cluster from a Git repo.
- Better multi-tenancy model — each team can own their own Flux objects.
- Image automation (auto-update image tags in Git when new images are pushed).
Choose Argo CD when you want a polished UI and multi-cluster visibility. Choose Flux when you prefer a composable, CLI-first, pure GitOps experience or need advanced multi-tenancy. Both are production-ready — many organizations use both (Flux for infrastructure, Argo CD for applications).
Helm is the Kubernetes package manager. It templatizes Kubernetes manifests into reusable Charts, manages versioned releases, and handles upgrades and rollbacks.
# Install a chart (e.g., cert-manager from the community)
helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace \
--set installCRDs=true
# Upgrade with values override
helm upgrade cert-manager jetstack/cert-manager \
--set global.logLevel=2
# Rollback
helm rollback cert-manager 1 # to revision 1
Helm limitations:
- Templating complexity: Helm templates use Go templating — they get unreadable quickly. Complex conditionals are a maintenance burden.
- Immutable release names: You can't rename a Helm release; migrating releases is painful.
- Testing:
helm testis basic. Integration testing Helm charts is hard. - Drift detection: Helm doesn't reconcile drift (someone applies a change outside Helm). Use with Argo CD/Flux for GitOps-style reconciliation.
- Alternatives: Kustomize (overlay-based, no templating, built into kubectl), cdk8s (TypeScript/Python), Timoni (CUE-based, strongly typed).
A canary deployment routes a small percentage of traffic to a new version while the rest goes to the stable version. You monitor the canary, then progressively increase traffic if metrics are healthy.
Options:
- Kubernetes-native (label-based): Run two Deployments (stable and canary) with the same app label. A Service selects both by label. Traffic split is proportional to replica counts — crude but simple (1 canary pod + 9 stable pods = 10% canary traffic).
- Ingress-based: NGINX Ingress supports canary via annotations (
nginx.ingress.kubernetes.io/canary-weight: "10"). Canary Ingress routes 10% of traffic to the canary Service. - Argo Rollouts: A Deployment replacement that supports sophisticated canary strategies with automated analysis (Prometheus, Datadog metrics). Integrates with traffic providers (Istio, AWS ALB, NGINX) for weight-based routing independent of replica counts.
- Istio VirtualService: Traffic split at the L7 level by percentage, header values, or user segments.
# Argo Rollouts canary — automated progressive delivery
spec:
strategy:
canary:
steps:
- setWeight: 10 # 10% to canary
- pause: {duration: 10m}
- analysis: # automated metric analysis
templates: [{templateName: error-rate-check}]
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100 # promote if all checks pass
A well-structured CI pipeline for containers provides fast feedback and security gates before code reaches production.
# GitHub Actions example pipeline
jobs:
lint-and-test:
steps:
- uses: actions/checkout@v4
- run: mvn test -pl :unit-tests
- run: mvn checkstyle:check spotbugs:check
build-image:
needs: lint-and-test
steps:
- uses: docker/build-push-action@v5
with:
push: false
tags: myapp:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=max
security-scan:
needs: build-image
steps:
- uses: aquasecurity/trivy-action@master
with:
image-ref: myapp:${{ github.sha }}
exit-code: 1 # fail pipeline on HIGH/CRITICAL CVEs
severity: HIGH,CRITICAL
push-and-sign:
needs: security-scan
if: github.ref == 'refs/heads/main'
steps:
- run: docker push registry/myapp:${{ github.sha }}
- uses: sigstore/cosign-installer@v3
- run: cosign sign --yes registry/myapp:${{ github.sha }}
update-gitops-repo: # triggers Argo CD / Flux sync
needs: push-and-sign
steps:
- run: |
yq e '.image.tag = "${{ github.sha }}"' -i k8s/deployment.yaml
git commit -am "chore: bump image to ${{ github.sha }}"
git push
Key principle: fail fast — lint and unit tests first (seconds), integration tests after image build (minutes), security scan before push. Never push an unsigned or unscanned image.
Kustomize uses a base + overlay model instead of templating. You write standard Kubernetes YAML (the base), then overlay environment-specific patches. No template syntax — just YAML transformations.
k8s/
├── base/
│ ├── deployment.yaml
│ ├── service.yaml
│ └── kustomization.yaml
└── overlays/
├── staging/
│ ├── kustomization.yaml # patches: replicas=1, image tag
│ └── patch-replicas.yaml
└── production/
├── kustomization.yaml # patches: replicas=10, resource limits
└── patch-resources.yaml
# overlays/production/kustomization.yaml
bases: [../../base]
images:
- name: myapp
newTag: v1.4.2 # override image tag
patches:
- path: patch-resources.yaml
Choose Kustomize when: Your YAML is straightforward and doesn't need complex templating logic. You want no DSL — just YAML patches. You're managing your own internal services. kubectl has Kustomize built in (kubectl apply -k ./overlays/production).
Choose Helm when: You're distributing to external users (community chart ecosystem). You need complex conditional logic. You want release management (rollbacks, revision history). Many teams use both: Helm for third-party charts, Kustomize for their own services.
Database migrations in Kubernetes require careful orchestration to avoid downtime and data loss. The key principle is the expand-contract pattern (also called parallel change): make schema changes backward compatible across two deployments.
Expand (deploy 1): Add the new column/table as nullable, keeping the old schema intact. Both old and new application code work. Deploy and migrate.
Contract (deploy 2): After all old app instances are gone, backfill data, add NOT NULL constraints, and remove old columns.
Running migrations in Kubernetes:
- Init containers — run the migration as an init container before the application container starts. If the migration fails, the pod doesn't start. Simple, but doesn't coordinate across replicas.
- Kubernetes Job — run migration as a one-off Job before updating the Deployment. In Argo CD, use
PreSynchooks; in Helm, use pre-upgrade hooks.
# Helm pre-upgrade hook for migrations
annotations:
"helm.sh/hook": pre-upgrade
"helm.sh/hook-weight": "-1"
"helm.sh/hook-delete-policy": before-hook-creation
# Argo CD PreSync hook
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
Always use a migration tool (Flyway, Liquibase, Alembic) with checksums. Never edit existing migrations — only add new ones. Test rollback paths explicitly.
In a GitOps model, environment promotion is a Git operation — updating the image tag or configuration in the environment's Git path, which Argo CD or Flux then reconciles.
Repository structure options:
- Folder-per-environment (monorepo):
k8s/dev/,k8s/staging/,k8s/production/. Simple. Promotion = PR to update the image tag in the target environment folder. - Branch-per-environment:
main= production,stagingbranch,devbranch. Promotion = merge or cherry-pick. Harder to track drift between environments. - Multi-repo: Separate repos for app code and manifests. CI updates the manifest repo; GitOps agent deploys from it. Clean separation of concerns.
# Promotion automation in CI (after staging validation passes)
- name: Promote to production
if: github.event_name == 'workflow_dispatch' && github.ref == 'refs/heads/main'
run: |
cd gitops-repo
# Update the production image tag
yq e '.images[0].newTag = "${{ env.IMAGE_TAG }}"' \
-i overlays/production/kustomization.yaml
git add -A
git commit -m "chore(prod): promote ${{ env.IMAGE_TAG }}"
git push
# Argo CD detects the change and syncs
Gate production promotions with: automated integration tests on staging, manual approval via PR review, optional chaos engineering / load testing on staging, and automated rollback triggers (watch error rate in production for 10 minutes post-deploy).
Tekton is a CNCF project that runs CI/CD pipelines as Kubernetes native resources. Each pipeline step runs as a container in a Kubernetes pod. Tekton is cloud-provider agnostic and designed for platforms that need to host CI/CD for multiple tenants.
Tekton vs GitHub Actions vs Jenkins:
- Tekton: Fully Kubernetes-native CRDs (Task, Pipeline, PipelineRun). Runs entirely in your cluster. Maximum portability and control. Steeper learning curve. Best for platform teams building internal developer platforms.
- GitHub Actions: Hosted SaaS — no infrastructure to manage. Rich marketplace ecosystem. Tightly coupled to GitHub. Excellent for teams already using GitHub and wanting simplicity. Limited control over runner environments.
- Jenkins: Mature, massive plugin ecosystem. High operational overhead. JenkinsX is the Kubernetes-native evolution. Declining adoption for greenfield projects.
For most teams: GitHub Actions (or GitLab CI) for CI, Argo CD / Flux for CD. Tekton when you're building a self-service platform where many teams share a CI/CD infrastructure.
Feature flags decouple deployment from release. Code ships to production disabled, and features are enabled for specific users/groups without a new deployment. This is sometimes called "dark launching".
Benefits in Kubernetes context:
- Continuous delivery without continuous exposure — merge to main and deploy daily without every user seeing every change.
- Instant rollback — turn off a feature flag in seconds vs a Kubernetes rollback taking minutes.
- Targeted rollouts — enable for internal users → 1% → 10% → 100% (gradual rollout without Kubernetes traffic splitting).
- A/B testing — expose different variations to measure business impact.
Implementation options:
- OpenFeature — CNCF standard SDK that abstracts flag evaluation backends. Vendor-neutral.
- Flagsmith, Unleash — open-source flag management, self-hostable in Kubernetes.
- LaunchDarkly, Split — SaaS options with advanced targeting and experimentation.
Store flag state in the feature flag service, not in ConfigMaps — ConfigMap changes require pod restarts to pick up (unless using volume mounts with periodic refresh), while feature flag SDKs use streaming connections for instant updates.
Infrastructure as Code
8 questionsBoth provision cloud infrastructure declaratively, but with fundamentally different approaches to the language.
Terraform (HCL): Uses HashiCorp Configuration Language — a domain-specific, declarative language. Large community, provider ecosystem (3000+ providers), and ecosystem tooling (Terragrunt, Atlantis, Terraform Cloud). HCL is purpose-built for infrastructure and readable by non-engineers. Loops and conditionals are possible but feel awkward.
Pulumi (TypeScript/Python/Go/C#): Uses real programming languages. Full language features — loops, functions, classes, package managers, unit tests. Better for infrastructure that requires complex logic. More approachable for application developers. Smaller provider ecosystem (but imports Terraform providers).
# Terraform — HCL (declarative)
resource "aws_eks_cluster" "main" {
name = "prod-cluster"
role_arn = aws_iam_role.eks.arn
version = "1.28"
vpc_config {
subnet_ids = module.vpc.private_subnets
}
}
// Pulumi — TypeScript (imperative-feeling but declarative execution)
const cluster = new eks.Cluster("prod-cluster", {
version: "1.28",
vpcId: vpc.id,
privateSubnetIds: vpc.privateSubnetIds,
instanceType: "t3.medium",
});
Choose Terraform for teams with mixed engineering levels, when community modules save time, or when standardizing across a large organization. Choose Pulumi for complex infrastructure logic, when your team is developer-heavy and wants to use existing languages and testing frameworks, or when building reusable infrastructure libraries.
Terraform state maps your configuration to real-world resources. It records which Terraform resource corresponds to which cloud resource (ARNs, IDs). Without state, Terraform can't know what already exists and would try to recreate everything.
Local state (default) is a terraform.tfstate file. In a team: engineers run terraform apply from their laptops, each with a different (stale) state file. Catastrophic conflicts occur. Never use local state in a team.
Remote state + locking (S3 + DynamoDB, Terraform Cloud, GCS) ensures:
- Single source of truth — one state file in a shared, versioned location.
- State locking — only one
plan/applyruns at a time. DynamoDB lock prevents concurrent modifications. - State history — S3 versioning allows recovery from accidental state corruption.
terraform {
backend "s3" {
bucket = "mycompany-tf-state"
key = "production/eks/terraform.tfstate"
region = "us-east-1"
encrypt = true # server-side encryption
dynamodb_table = "tf-state-lock" # locking table
}
}
State best practices: Never commit state to Git. Use workspaces or separate state files per environment. Restrict state bucket access to CI/CD roles only. Enable MFA delete on the S3 bucket.
Terraform modules help with DRY, but managing backend configuration and provider config across many environments and accounts still requires repetition. Terragrunt is a thin wrapper that adds DRY backend configuration, dependency management, and multi-module orchestration.
# terragrunt.hcl at the root — shared backend config
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite"
}
config = {
bucket = "mycompany-tf-state"
key = "${path_relative_to_include()}/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "tf-state-lock"
}
}
# environments/production/eks/terragrunt.hcl
include "root" { path = find_in_parent_folders() }
terraform {
source = "git::git@github.com:myorg/infra-modules.git//eks?ref=v2.1.0"
}
dependency "vpc" {
config_path = "../vpc" # auto-reads VPC outputs
}
inputs = {
cluster_name = "prod"
vpc_id = dependency.vpc.outputs.vpc_id
private_subnets = dependency.vpc.outputs.private_subnets
}
Terragrunt allows running terragrunt run-all apply across all modules in dependency order, making it practical to manage large multi-account, multi-region Terraform configurations.
A Terraform module is a directory of .tf files that encapsulates a reusable infrastructure component. Modules take input variables and expose output values.
Good module design principles:
- Single responsibility — one module per logical resource group (VPC module, EKS module, RDS module).
- Sensible defaults, override-able — production-safe defaults that developers don't have to think about.
- Expose outputs — expose IDs, ARNs, and endpoints so downstream modules can reference them.
- Semantic versioning — tag module versions in Git. Callers pin to a version, not
main.
# modules/eks/variables.tf
variable "cluster_name" { type = string }
variable "kubernetes_version" {
type = string
default = "1.28" # sensible default
}
variable "node_groups" {
type = map(object({
instance_types = list(string)
min_size = number
max_size = number
desired_size = number
}))
}
# modules/eks/outputs.tf
output "cluster_endpoint" { value = aws_eks_cluster.this.endpoint }
output "cluster_ca_certificate" {
value = aws_eks_cluster.this.certificate_authority[0].data
sensitive = true
}
output "oidc_provider_arn" { value = aws_iam_openid_connect_provider.this.arn }
The community terraform-aws-modules/eks module is excellent — read its source before writing your own; it handles many edge cases.
Testing IaC is harder than application testing — you're provisioning real infrastructure. A multi-layer approach is needed:
- Static analysis:
terraform validate(syntax),tflint(linting, best practices, provider-specific rules),checkov/tfsec(security scanning — detect public S3 buckets, unencrypted volumes, etc.),terraform fmt(formatting in CI). - Plan review:
terraform planin CI on every PR. Usetfcmtor Atlantis to post plan output as a PR comment. Require human review of the plan before apply. - Unit tests (Terraform native):
terraform test(GA in 1.6) runs plan/apply against mock providers or real providers in a test workspace. - Integration tests (Terratest): Go library that runs
terraform apply, makes assertions against the real provisioned infrastructure, then destroys it. Expensive but catches real bugs. Run in a dedicated test account.
// Terratest example (Go)
func TestEKSModule(t *testing.T) {
t.Parallel()
terraformOptions := &terraform.Options{
TerraformDir: "../modules/eks",
Vars: map[string]interface{}{
"cluster_name": fmt.Sprintf("test-%s", random.UniqueId()),
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
clusterEndpoint := terraform.Output(t, terraformOptions, "cluster_endpoint")
assert.Contains(t, clusterEndpoint, "amazonaws.com")
}
Crossplane is a CNCF project that extends Kubernetes to provision and manage cloud infrastructure using the Kubernetes control plane and GitOps workflows. Instead of running Terraform in CI, you declare an RDS instance or S3 bucket as a Kubernetes Custom Resource — Crossplane's controllers provision and reconcile it.
apiVersion: database.aws.crossplane.io/v1beta1
kind: RDSInstance
metadata:
name: production-postgres
spec:
forProvider:
region: us-east-1
dbInstanceClass: db.t3.medium
engine: postgres
engineVersion: "15.4"
allocatedStorage: 100
multiAZ: true
writeConnectionSecretsToRef:
namespace: production
name: postgres-conn # Crossplane writes connection details as a K8s Secret
Benefits: Infrastructure and applications share the same GitOps pipeline. Developers can self-service infrastructure via Composite Resource Claims (abstractions over raw cloud resources) without needing cloud console access. Drift detection and automatic remediation via Kubernetes reconciliation.
Trade-offs: Crossplane adds cluster complexity. The team must understand both Kubernetes and the cloud provider. Terraform has a larger ecosystem and is more mature for complex IaC. Crossplane shines in platform engineering — providing a self-service infrastructure API for developers.
Terraform is optimized for provisioning infrastructure (creating/modifying/destroying cloud resources). It's declarative and manages state. Ansible is optimized for configuration management — installing software, configuring files, managing OS state on existing servers. It's procedural (playbooks run top-to-bottom) and agentless (uses SSH).
Complementary roles:
- Terraform provisions the VM, VPC, security groups, load balancer.
- Ansible configures the VM: installs Docker, configures nginx, deploys application files, manages cron jobs.
In a Kubernetes world, Ansible's role shrinks — containers solve configuration management for application code. Ansible remains relevant for: configuring Kubernetes nodes before joining the cluster, managing bare-metal infrastructure, configuring network devices, and managing non-cloud legacy systems.
# Ansible playbook — configure a new server
- hosts: k8s_nodes
become: yes
tasks:
- name: Install containerd
package: name=containerd state=present
- name: Configure cgroup driver
template:
src: containerd-config.toml.j2
dest: /etc/containerd/config.toml
notify: restart containerd
handlers:
- name: restart containerd
service: name=containerd state=restarted
Scaling Terraform requires solving state isolation, team workflows, and drift detection.
State isolation: Separate state files per environment, per account, per region. Use Terragrunt or Terraform workspaces (workspaces are simpler but share the same backend bucket — prefer separate state files for strong isolation).
Multi-account strategy: Use AWS Organizations / GCP Folders. Each account/project has its own Terraform state. A "management" account/project runs Terraform to create other accounts (account vending machine pattern).
Workflow automation (Atlantis): Atlantis is a self-hosted GitOps tool for Terraform. PRs automatically trigger terraform plan, post the plan as a PR comment, and run terraform apply after merge. No engineer runs Terraform locally in production — everything goes through PR review.
# atlantis.yaml — per-repo configuration
version: 3
projects:
- name: production-eks
dir: environments/production/eks
workspace: default
autoplan:
when_modified: ["**/*.tf", "**/*.tfvars", "../../modules/**"]
apply_requirements: [approved, mergeable] # require PR approval before apply
Module registry: Publish internal modules to Terraform Cloud Private Registry or a private GitHub repo with versioned releases. Enforce module versions in CI (tflint --enable-rule=terraform_module_pinned_source).
Drift detection: Run terraform plan nightly. Alert if plan shows changes (someone modified infrastructure outside Terraform). Auto-remediate with terraform apply or alert the team, depending on risk tolerance.
Observability
10 questions- Metrics — numerical measurements aggregated over time (request rate, error rate, latency percentiles, CPU usage). Cheap to store and query. Answer "what is happening?" and "how often?" Use for alerting and dashboards.
- Logs — discrete, timestamped records of events. Rich in context but expensive to store and slow to query at scale. Answer "what exactly happened at time T?" Use for debugging specific incidents.
- Traces — end-to-end journey of a single request through distributed services. Each trace is composed of spans. Answer "where did this request spend its time?" Use for performance analysis, understanding service dependencies, and root-causing latency.
How they complement each other: Metrics alert you that error rate is high. Traces show which service in the call chain is failing. Logs in that service show the exact error. The workflow is: metrics → traces → logs (funnel from aggregate to specific).
Modern addition — Continuous Profiling: Tools like Parca, Pyroscope, or Datadog Continuous Profiler add a fourth signal — always-on CPU/memory profiling. Completes the picture when you need to understand why code is slow at the function level.
Prometheus uses a pull model — it scrapes metrics from HTTP endpoints (/metrics) on a configured interval (usually 15-30 seconds). This is opposite to push-based systems (StatsD, InfluxDB line protocol).
Architecture:
- Prometheus server — scrapes targets, stores time series in its local TSDB (time-series database, 2-hour blocks with compaction), evaluates alerting rules.
- Service Discovery — in Kubernetes, Prometheus uses the API server to discover pods, services, and nodes dynamically. Annotations or ServiceMonitor CRDs (kube-prometheus-stack) configure which targets to scrape.
- Alertmanager — receives firing alerts from Prometheus, deduplicates, groups, silences, and routes to receivers (PagerDuty, Slack, OpsGenie).
- Grafana — visualization layer, reads from Prometheus via PromQL.
# PromQL examples
# Request rate (per-second) over the last 5 minutes
rate(http_requests_total{job="api", status=~"5.."}[5m])
# 99th percentile latency
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{job="api"}[5m]))
# Kubernetes pod CPU usage
rate(container_cpu_usage_seconds_total{namespace="production"}[5m])
Long-term storage: Prometheus local TSDB is not designed for multi-year retention. Use Thanos or Cortex (or Grafana Mimir) to shard Prometheus, add global query layer, and store metrics in object storage (S3) for years.
OpenTelemetry (OTel) is a CNCF standard for telemetry collection — a vendor-neutral SDK and protocol (OTLP) for traces, metrics, and logs. It replaces a fragmented ecosystem where every vendor had a proprietary agent and SDK.
Why it won: Instrument once, send anywhere. Switch observability backends (Datadog → Honeycomb → Grafana stack) by changing the exporter configuration, not the application code. This is a major operational advantage.
// Java auto-instrumentation — zero code changes
// Add the OTel Java agent at startup
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=payments-api \
-Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \
-jar app.jar
// Manual instrumentation for custom spans
Tracer tracer = GlobalOpenTelemetry.getTracer("payments");
Span span = tracer.spanBuilder("process-payment")
.setAttribute("payment.amount", amount)
.startSpan();
try (Scope scope = span.makeCurrent()) {
// business logic
} catch (Exception e) {
span.recordException(e);
span.setStatus(StatusCode.ERROR);
} finally {
span.end();
}
OTel Collector — a vendor-agnostic agent that receives telemetry, processes it (filter, batch, enrich), and exports to multiple backends simultaneously. Deploy as a DaemonSet (one per node) or a Deployment (central gateway). Decouples applications from observability backends.
Grafana Loki is a log aggregation system inspired by Prometheus. Key design: Loki only indexes log labels (not the full text). Log content is compressed and stored in object storage (S3). This makes Loki dramatically cheaper to operate than Elasticsearch at scale.
Loki stack: Promtail (DaemonSet that tails container logs and ships to Loki) → Loki (stores and serves) → Grafana (query with LogQL, visualize alongside metrics from Prometheus).
ELK stack (Elasticsearch, Logstash, Kibana / Filebeat): Full-text indexes every log line. Fast full-text search. Powerful aggregations. High resource consumption (Elasticsearch is JVM-based, requires significant RAM and storage). Expensive to operate at scale.
Comparison:
- Cost: Loki wins significantly — no full-text indexing, object storage is cheap.
- Query power: Elasticsearch wins for ad-hoc full-text search. Loki's LogQL is label-first — efficient if you label well (namespace, pod, level, trace_id).
- Kubernetes integration: Loki wins — Kubernetes labels map naturally to Loki labels. Auto-discovery via Promtail.
- Grafana integration: Loki + Grafana allows correlating logs with Prometheus metrics in the same dashboard (click a metric spike → see logs from that time period).
- SLI (Service Level Indicator) — a metric that measures service quality. Example: the ratio of successful requests to total requests (availability SLI).
- SLO (Service Level Objective) — a target for the SLI. Example: 99.9% of requests succeed over a 30-day rolling window. SLOs are internal targets.
- SLA (Service Level Agreement) — the contractual commitment to customers, usually less strict than the internal SLO (buffer for incidents).
- Error budget — the allowed unreliability. 99.9% SLO = 0.1% budget = ~43 minutes downtime per month. When the budget is exhausted, stop feature work and focus on reliability.
# PromQL — 30-day availability SLI
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
# Remaining error budget (0 to 1)
(
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
) / (1 - 0.999) # divided by allowed error rate
Use Sloth or OpenSLO to define SLOs declaratively and generate multi-window, multi-burn-rate alerts automatically. Multi-burn-rate alerts catch both fast burns (big incident) and slow burns (small steady degradation).
RED Method (Tom Wilkie) — for services (request-driven microservices, APIs):
- Rate — how many requests per second is the service handling?
- Errors — what fraction of requests are failing?
- Duration — how long are requests taking? (latency histograms: p50, p95, p99)
USE Method (Brendan Gregg) — for resources (CPU, memory, disk, network, queues):
- Utilization — what fraction of the resource's capacity is being used? (e.g., CPU at 80%)
- Saturation — how much is the resource over-loaded? (e.g., CPU run queue depth, memory swap usage)
- Errors — are there hardware errors, packet drops, disk errors?
Apply RED to your microservices dashboards (are users experiencing problems?). Apply USE to your infrastructure dashboards (is the hardware/OS under stress?). They complement each other: RED tells you there's a problem, USE helps find whether it's resource-related.
Google's Four Golden Signals (from the SRE book) are very similar: Latency, Traffic, Errors, Saturation — a practical synthesis of both methods.
Distributed tracing requires propagating trace context (trace ID, span ID) across service boundaries via HTTP headers or message metadata, so the entire call chain can be reconstructed.
Implementation choices:
- Auto-instrumentation: OTel Java/Python/Node.js agents instrument HTTP clients and servers automatically. No code changes — just add the agent. Propagates W3C TraceContext headers by default.
- Manual instrumentation: Add custom spans for business operations (payment processing, inventory check) that aren't captured by framework instrumentation.
- Service mesh traces: Istio and Linkerd capture L4/L7 traces between services automatically (no application changes). Less detailed than application traces (don't show internal function calls) but useful for service topology maps.
# OTel Collector pipeline — receive traces, send to Jaeger and Tempo
receivers:
otlp:
protocols:
grpc: { endpoint: "0.0.0.0:4317" }
http: { endpoint: "0.0.0.0:4318" }
processors:
batch: { timeout: 5s }
tail_sampling: # sample 100% of traces with errors, 1% of success
decision_wait: 30s
policies:
- name: error-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: probabilistic
type: probabilistic
probabilistic: { sampling_percentage: 1 }
exporters:
jaeger: { endpoint: "jaeger:14250" }
otlp/tempo: { endpoint: "tempo:4317" }
Backends: Jaeger (open-source, Cassandra/Elasticsearch storage), Grafana Tempo (Loki-like — label index only, trace content in object storage, very cheap), Honeycomb, Datadog APM.
Kubernetes Events are objects that record what happened to other objects — pod scheduling, image pulls, OOMKills, probe failures, node pressure. They are the first place to look when a deployment is misbehaving.
# Watch events for a specific pod
kubectl describe pod <pod-name> # events section at the bottom
# Watch cluster-wide events in real time
kubectl get events -A --sort-by='.lastTimestamp' -w
# Filter for warnings
kubectl get events -A --field-selector type=Warning
Problem: Events are stored in etcd and are only retained for 1 hour by default (configurable). They're not queryable historically without exporting them.
Solutions:
- Kubernetes Event Exporter — ships events to Elasticsearch, Loki, or webhook endpoints for persistent storage and alerting.
- Botkube — sends Kubernetes events to Slack/Teams with filtering rules. Immediate notification when a pod OOMKills or a deployment fails.
- kube-state-metrics — exposes Kubernetes object state as Prometheus metrics (deployment available replicas, pod phase, PVC bound status). Alert on deployment rollout failures, pending pods exceeding 5 minutes, PVCs in Pending state.
Good alerting principles:
- Alert on symptoms, not causes. Alert when users are impacted (high error rate, high latency) — not on CPU at 80% which may be normal. Cause-based alerts generate noise.
- Alert on SLO burn rate, not raw thresholds. A 5% error rate for 5 minutes burns your monthly budget faster than a 0.5% rate for an hour.
- Every alert must be actionable — if an engineer can't do anything about it, it's not an alert; it's a dashboard.
- Include runbook links in alert annotations. Alert fatigue kills on-call quality.
Common anti-patterns:
- Alert overload / noise: Too many alerts → engineers start ignoring them → miss real incidents. Tune aggressively. Merge related alerts. Eliminate flapping alerts.
- Alerting on every metric: CPU, memory, disk I/O — all generate noise without indicating user impact. Use USE for dashboards, not for paging alerts.
- Missing alerting windows: Alerting on a 1-minute rate misses slow degradations. Use multi-window burn rates.
- No deduplication or grouping: A single bad deployment triggers 100 pod-level alerts. Alertmanager group_by should aggregate these into one actionable page.
# Alertmanager — group related alerts
route:
group_by: [alertname, namespace, cluster]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: pagerduty-team
routes:
- match: { severity: warning }
receiver: slack-team # warnings go to Slack
- match: { severity: critical }
receiver: pagerduty-team # critical pages on-call
Kubernetes audit logging records every request to the API server with who made it, what they did, what resource was affected, and when. Critical for incident investigation and compliance (SOC 2, PCI, HIPAA).
Audit policy levels: None, Metadata (request metadata only — no body), Request (metadata + request body), RequestResponse (full request and response — very verbose, use sparingly).
# audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Don't log read-only requests to non-sensitive resources
- level: None
verbs: [get, list, watch]
resources: [{group: "", resources: [pods, services]}]
# Log secret access at metadata level (no values in logs)
- level: Metadata
resources: [{group: "", resources: [secrets, configmaps]}]
# Log all exec and port-forward — high-risk operations
- level: RequestResponse
resources: [{group: "", resources: [pods/exec, pods/portforward]}]
# Log all other requests at metadata level
- level: Metadata
High-value security events to alert on:
- Any access to secrets (especially from unexpected ServiceAccounts).
kubectl execorport-forwardinto production pods.- New ClusterRoleBinding creations (privilege escalation).
- Deletion of audit logs themselves.
- API calls from unexpected source IPs.
Ship audit logs to a separate, write-only storage (S3 with object lock, or SIEM) that cluster admins cannot delete — immutable audit trails for forensics.
Reliability & SRE
9 questionsSite Reliability Engineering (SRE) is Google's approach to operations: treat operations as a software engineering problem. SREs write code to automate operational work, rather than performing manual toil.
Key principles:
- 50% engineering cap: SREs spend at most 50% of their time on toil (manual, repetitive, automatable work). The rest goes to engineering — improving systems, writing automation, reducing toil permanently.
- Error budgets: If a service has 99.9% availability SLO, it has 0.1% error budget. When budget is healthy, feature velocity is prioritized. When budget is exhausted, reliability work takes precedence. This aligns developers and operators around a shared objective.
- Blameless postmortems: After incidents, focus on system failures and process improvements, not individual mistakes. Blame prevents people from reporting problems and learning from them.
- Toil budgets: When toil exceeds 50%, SREs push back on developers or automate. This creates an incentive to build reliable, self-healing systems.
The goal is to limit the blast radius of failures — a downstream service being down should degrade functionality, not crash the entire system.
Key patterns:
- Circuit breaker: After N consecutive failures to a downstream service, "open" the circuit — return an immediate fallback response instead of calling the failing service. Prevents cascading failures and allows recovery time. Implement with Resilience4j, or use Istio/Linkerd circuit breakers.
- Retry with exponential backoff + jitter: Retry transient failures, but not immediately. Add randomized jitter to prevent thundering herd (all retriers hitting the recovering service simultaneously).
- Timeout: Every external call must have a deadline. Without timeouts, threads block indefinitely, causing resource exhaustion.
- Bulkhead: Isolate resources (thread pools, connection pools) for different downstream services. A slow downstream doesn't exhaust the shared thread pool and crash unrelated features.
- Fallback: Return cached data, degraded response, or a user-friendly error instead of propagating the failure.
Chaos engineering (Chaos Monkey, Chaos Mesh, LitmusChaos) — intentionally inject failures (kill pods, add network latency, exhaust CPU) in staging and production to validate fault tolerance. "Design for failure" is only proven when you actually test failure.
A postmortem (also called incident review or retrospective) is a structured analysis of an incident to understand its causes, impact, and how to prevent recurrence.
Effective postmortem structure:
- Impact statement: Duration, affected users/services, business impact (revenue, SLO budget burned).
- Timeline: Precise event sequence — when the incident started, when it was detected, key actions and their outcomes, time to mitigation, time to resolution.
- Root cause analysis: Use "5 Whys" to get past symptoms to underlying causes. Multiple contributing factors are common.
- Action items: Specific, owned, and time-bound. "Improve monitoring" is not an action item. "Add PagerDuty alert for payment service error rate > 1% — assigned to @alice, due 2024-02-15" is.
Common pitfalls:
- Blaming individuals — kills psychological safety and prevents honest reporting in future incidents.
- Treating symptoms as root causes — stopping at "the deployment was bad" instead of "our deployment pipeline lacked automated regression testing for X."
- Action items with no owner or deadline — they never get done. Track in a ticketing system with SLOs on completion.
- Not sharing postmortems broadly — incidents are learning opportunities for the whole organization, not just the team involved.
Blue-green deployments run two identical production environments (blue = current, green = new version). Traffic switches instantly from blue to green by updating a Service selector or load balancer. Rollback is instant — switch back.
# Blue Deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-blue
spec:
replicas: 10
selector:
matchLabels: { app: api, version: blue }
template:
metadata:
labels: { app: api, version: blue }
spec:
containers:
- image: myapp:v1.3.0
---
# Green Deployment (new version, initially not receiving traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-green
spec:
replicas: 10
selector:
matchLabels: { app: api, version: green }
template:
metadata:
labels: { app: api, version: green }
spec:
containers:
- image: myapp:v1.4.0
---
# Service — change selector to switch traffic
apiVersion: v1
kind: Service
metadata:
name: api
spec:
selector:
app: api
version: blue # ← change to "green" to switch traffic (instant!)
ports:
- port: 80
Trade-offs: Instant cutover and rollback, but requires 2x the resources (both deployments must run simultaneously). Database schema changes must be backward compatible with both versions. Use Argo Rollouts to automate and integrate with metric analysis before promoting.
Chaos engineering is the practice of deliberately injecting failures to verify that a system handles them gracefully. It turns unknown unknowns (does our retry logic actually work?) into known knowns.
Principles for safe chaos (from the Principles of Chaos Engineering):
- Define a "steady state" hypothesis — what does normal look like? (e.g., error rate < 0.1%, p99 latency < 200ms).
- Vary real-world events — pod failures, network latency, CPU/memory pressure, region outage.
- Run in production — staging doesn't have production traffic patterns. But start small (blast radius = 1 pod, 1% of traffic).
- Automate — manual chaos experiments are forgotten. Schedule GameDays with automated tooling.
- Have a kill switch — ability to stop the experiment instantly if the hypothesis is wrong and real damage is occurring.
# Chaos Mesh — inject pod failures in a specific namespace
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
spec:
action: pod-kill
mode: one # kill one pod at a time
selector:
namespaces: [staging]
labelSelectors: { app: payments }
scheduler:
cron: "@every 1h" # scheduled experiment
Tools: Chaos Mesh (CNCF, Kubernetes-native), LitmusChaos (CNCF, workflow-based), Gremlin (SaaS). Start with killing pods (the simplest experiment), then network faults, then resource exhaustion.
Control plane HA requires multiple master nodes across availability zones, with HA etcd (3 or 5 nodes — must be an odd number for quorum) and load-balanced API server endpoints.
Self-managed HA cluster (kubeadm):
- Minimum 3 control plane nodes (one per AZ) + 3 etcd nodes (can be co-located or dedicated). 3 nodes tolerate 1 failure; 5 nodes tolerate 2.
- External load balancer in front of multiple API server instances (HAProxy, AWS NLB, or cloud-managed).
- etcd requires low-latency, high-IOPS storage (NVMe SSD). Network latency between etcd members must be <10ms for stable operation.
Managed Kubernetes (EKS/GKE/AKS): The cloud provider manages control plane HA. EKS runs the API server across 3 AZs automatically. Your responsibility is ensuring worker nodes span AZs, and that critical system pods (CoreDNS, CNI) run on multiple nodes.
Even with HA control plane — losing the control plane doesn't immediately kill running workloads. Pods continue running; kubelet continues managing containers. But no new pods can be scheduled, and no changes can be applied until the control plane is restored. This is why HA control plane is critical for production, but it buys time, not immunity.
Capacity planning is the practice of ensuring sufficient resources exist (now and in the future) to run workloads within performance and cost targets.
Right-sizing process:
- Measure actual usage: VPA recommendations,
kubectl top, Goldilocks (runs VPA in recommendation mode for all workloads and displays results in a dashboard). - Set accurate requests: Requests should reflect actual sustained usage (not peaks). Use p95 CPU and p99 memory from the past 7-14 days.
- Head room: Don't fill nodes to 100% requested — leave 20-30% for burst and system components. Node allocatable capacity is less than total (kubelet, OS, kube-proxy reserve resources).
- Node type selection: Memory-optimized instances for JVM workloads. Compute-optimized for CPU-intensive jobs. Spot/preemptible for batch. Node size: fewer, larger nodes reduce overhead (less kubelet overhead) but increase blast radius per failure. Find the right balance.
- Forecast: Use traffic trends + growth projections to anticipate when you'll need to expand. Set up Cluster Autoscaler or Karpenter to handle elasticity automatically.
Tools: Kubecost (cost allocation per namespace/team), OpenCost (open-source Kubecost), CNCF Opencost, and cloud provider cost explorer with Kubernetes labels for cost attribution.
A GameDay (or Disaster Recovery drill) is a scheduled practice incident. The team deliberately simulates failure scenarios and practices detection, response, and recovery — before a real incident forces them to.
How to run one:
- Choose a scenario: AZ failure, database outage, certificate expiry, DDoS attack, data corruption, accidental deletion of a Deployment.
- Set a hypothesis: "We can recover from an AZ failure with <5 minutes of service degradation."
- Brief participants: Engineers know a GameDay is happening; they don't know the specific scenario until it starts (realistic detection practice).
- Inject the failure and measure how long it takes to detect, diagnose, and remediate.
- Debrief and postmortem: Document gaps — runbook was missing a step, alert didn't fire, the on-call rotation had a gap, recovery took 30 min not 5.
GameDays validate runbooks, test monitoring and alerting, identify knowledge gaps, and build team confidence. Schedule quarterly at minimum. Common finding: the alert existed but nobody knew what to do with it (no runbook, or runbook was stale).
On-call is a critical reliability function, but unsustainable on-call kills morale and causes attrition. Designing it well is as important as designing the system itself.
Elements of healthy on-call:
- Rotation size: Minimum 6-8 engineers so each person is on-call at most once every 6-8 weeks. Smaller rotations mean burnout.
- Actionable alerts only: The worst outcome is paging someone for something they can't fix or that resolves itself. Every page should require action and have a runbook.
- Alert volume budget: SRE book recommends <2 incidents per on-call shift. More than that indicates systemic reliability problems that require engineering work, not more on-call heroics.
- Runbooks: For every alert, a runbook explains: what is alerting, what does it mean, steps to diagnose, steps to remediate, escalation path. Keep runbooks in the same Git repo as code (owned, versioned, tested in GameDays).
- Escalation policy: Who do you call if the primary can't resolve it? Clear escalation chains prevent decision paralysis during incidents.
- Compensation: On-call engineers should be compensated for their availability. This varies by company but is non-negotiable for sustainable on-call.
- Post-incident toil budget: After a hard incident, give the on-call engineer time to recover and write the postmortem — don't immediately return them to feature work.
Platform Engineering
9 questionsDevOps is a culture and practice — breaking down silos between development and operations teams, enabling faster, more reliable delivery through automation and shared responsibility.
Platform engineering is a discipline that builds internal developer platforms (IDPs) — self-service infrastructure and tooling that enable development teams to ship software independently, without deep infrastructure knowledge.
The problem it solves: As DevOps scaled, "you build it, you run it" created cognitive overload. Every team had to learn Kubernetes, Terraform, observability stacks, secret management — a full platform expertise on top of their domain expertise. Platform engineering centralizes this complexity into a product that developer teams consume through simple abstractions.
Key characteristics of a good platform:
- Treated as a product — the platform team has a roadmap, does user research with developer teams, measures adoption and developer satisfaction (DORA metrics, SPACE metrics).
- Self-service — developers provision environments, deploy services, and get observability without filing tickets.
- Golden paths — opinionated, paved roads for common tasks (service scaffolding, deployment, observability) that are easy to follow but not mandatory.
- Reduces cognitive load — developers think about their domain, not about Kubernetes internals.
An Internal Developer Platform is the product built by a platform team that abstracts infrastructure complexity. It's the interface between developers and the underlying infrastructure.
Five planes of an IDP (Humanitec model):
- Developer Control Plane — where developers interact: Backstage portal, CLIs, APIs. Service catalog, self-service workflows, golden path templates.
- Integration and Delivery Plane — CI/CD pipelines (GitHub Actions, Tekton), GitOps (Argo CD, Flux), image building and signing.
- Monitoring and Logging Plane — pre-configured observability: Grafana dashboards per service, Loki log access, distributed tracing, pre-built SLO dashboards.
- Security and Identity Plane — automated certificate management (cert-manager), secrets injection (ESO), RBAC, workload identity (IRSA/Workload Identity), policy enforcement (Kyverno/OPA).
- Resource Plane — self-service infrastructure: databases (Crossplane or operators), queues, caches — provisioned via Kubernetes CRDs or Backstage workflows.
A developer on the platform should be able to: scaffold a new service, set up CI/CD, deploy to staging, provision a database, see their service's metrics and logs — all without a ticket to the platform team.
Backstage (CNCF, originally Spotify) is an open-source platform for building developer portals. Its core is a software catalog that provides a centralized view of all services, their owners, documentation, CI/CD status, and infrastructure dependencies.
Core concepts:
- Software Catalog — every service, library, website, and ML model is registered as an entity with a
catalog-info.yamlfile in its repo. The catalog aggregates metadata across thousands of components. - Software Templates (Scaffolder) — golden path templates for new services. A developer fills in a form, and Backstage creates a GitHub repo, generates code structure, registers the service in the catalog, and sets up CI/CD — all automated.
- TechDocs — docs-as-code rendered in Backstage. Markdown in the service repo → searchable documentation portal.
- Plugins — 200+ community plugins (Kubernetes, Argo CD, PagerDuty, Datadog, GitHub Actions). Backstage becomes the single pane of glass for developer tooling.
# catalog-info.yaml — register a service in Backstage
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: payments-api
description: Payment processing service
annotations:
github.com/project-slug: myorg/payments-api
backstage.io/techdocs-ref: dir:.
argocd/app-name: payments-api-production
pagerduty.com/service-id: P1234ABC
spec:
type: service
lifecycle: production
owner: payments-team
system: checkout-system
dependsOn:
- component:postgres-payments
DORA (DevOps Research and Assessment) metrics are the four metrics that most strongly predict software delivery performance and organizational performance, validated by years of research in the State of DevOps Report.
- Deployment Frequency — how often do you deploy to production? Elite performers deploy multiple times per day. Measure via CI/CD pipeline data.
- Lead Time for Changes — time from code commit to running in production. Elite: less than one hour. Measure via Git commit timestamp to deployment timestamp.
- Change Failure Rate — percentage of deployments that cause a production failure requiring hotfix/rollback. Elite: 0-15%. Measure via incidents tagged to deployments.
- Time to Restore Service — how long to recover when a failure does occur. Elite: less than one hour. Measure via incident management tools (PagerDuty, incident.io).
The first two measure velocity (speed); the last two measure stability. Elite teams achieve high velocity AND high stability — they are not in tension when engineering practices are mature. A fifth metric was added recently: Reliability (operational performance against SLOs).
Track DORA metrics in your engineering metrics dashboard (LinearB, Cortex, Backstage plugins, or custom Grafana dashboards using data from GitHub, PagerDuty, and your CD system).
Reasons for multiple clusters:
- Environment isolation: dev, staging, production in separate clusters. Strong blast radius isolation — a bad change in dev can't affect production.
- Regulatory/compliance: PCI scope, HIPAA, data residency requirements (EU data stays in EU clusters).
- Scale limits: A single Kubernetes cluster has practical limits (5,000 nodes, 150,000 pods per cluster). At hyperscale, you need multiple clusters.
- Multi-region HA: Active-active deployments across regions for disaster recovery and reduced latency for global users.
- Team isolation: Separate clusters per business unit for strong multi-tenancy (stronger than namespace isolation).
Multi-cluster management tools:
- Argo CD — single Argo CD instance managing N clusters. ApplicationSets for templated deployments across clusters.
- Fleet (Rancher) — GitOps at scale, designed for 1000+ clusters.
- Cluster API (CAPI) — Kubernetes-native cluster lifecycle management (provisioning, upgrading, deleting clusters via CRDs).
- Cilium ClusterMesh — cross-cluster service discovery and load balancing at the network layer.
Platform maturity can be assessed across several dimensions, each on a scale from ad-hoc to optimized:
- Provisioning: Level 1 — manual, ticket-based. Level 2 — scripted (Terraform in CI). Level 3 — self-service via portal (Backstage templates). Level 4 — policy-governed, automated, cost-optimized.
- Deployment: Level 1 — manual. Level 2 — automated push. Level 3 — GitOps with drift detection. Level 4 — progressive delivery with automated metric gates.
- Observability: Level 1 — logs only. Level 2 — metrics + alerting. Level 3 — traces, SLOs, on-call runbooks. Level 4 — automated anomaly detection, AIOps.
- Security: Level 1 — ad-hoc. Level 2 — automated scanning. Level 3 — policy as code, SBOM, image signing. Level 4 — zero-trust, continuous compliance validation.
Measuring platform success:
- DORA metrics of platform consumers (are teams deploying faster?)
- Time to first deployment for new service (minutes vs days)
- Platform adoption rate (% of teams using golden paths)
- Developer satisfaction surveys (NPS of the platform)
- Number of platform-related support tickets (trending down = good)
Multi-cluster service meshes enable services in different clusters to communicate with mTLS, observability, and load balancing — as if they were in the same cluster.
Istio multi-cluster models:
- Multi-primary (replicated control plane): Each cluster has its own Istiod. Clusters share a root CA so mTLS works cross-cluster. Services are discovered via
ServiceEntryor multi-cluster service discovery. Most resilient — control plane failure in one cluster doesn't affect others. - Primary-remote: One cluster has Istiod; remote clusters connect to it. Simpler, but the primary Istiod is a single point of failure for all clusters.
# Enable multi-cluster service discovery in Istio
# In cluster 1, create a secret with cluster 2's kubeconfig
istioctl create-remote-secret --context=cluster2 \
--name=cluster2 | kubectl apply -f - --context=cluster1
# Services in cluster1 can now reach services in cluster2
# via the east-west gateway with automatic mTLS
Cilium ClusterMesh is a simpler alternative at the network layer — no sidecars, pod IPs are routable across clusters, Kubernetes Services are globally discoverable. Pairs well with a Linkerd or no-mesh architecture for teams that want cross-cluster routing without full Istio complexity.
FinOps is the practice of bringing financial accountability to cloud spending. It involves engineering, finance, and product working together to maximize cloud value. The phases: Inform (understand spending) → Optimize (reduce waste) → Operate (make cost a continuous practice).
Kubernetes cost optimization tactics:
- Right-size resource requests: Over-provisioned requests waste reserved capacity. Use VPA recommendations or Goldilocks. This is often the highest-impact optimization.
- Spot/preemptible instances: Run stateless workloads and batch jobs on spot nodes (60-80% cheaper). Use node taints + tolerations to segment workloads. Configure PodDisruptionBudgets and retry logic to handle spot interruptions.
- Cluster consolidation: Karpenter or Cluster Autoscaler scale-down. Bin-pack pods onto fewer, larger nodes during off-peak hours.
- Namespace/team cost allocation: Label all resources with team/service/environment. Use Kubecost or OpenCost to show each team their spend. Visibility creates accountability.
- Scale to zero: KEDA + scale-to-zero for event-driven workloads. HPA min replicas = 0 for non-production environments with KEDA triggers.
- Reserved/committed use discounts: For baseline compute, use 1 or 3-year reserved instances/committed use contracts (30-60% savings). Spot for burst on top.
# Goldilooks — shows VPA recommendations for all deployments
helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install goldilocks fairwinds-stable/goldilocks
kubectl label namespace production goldilocks.fairwinds.com/enabled=true
# Access dashboard: kubectl port-forward svc/goldilocks-dashboard 8080
Running ML training and inference on Kubernetes adds specific challenges: GPU resource management, large model artifacts, distributed training coordination, and cost optimization for expensive GPU instances.
GPU resource management:
# Request GPU resources in pod spec
resources:
limits:
nvidia.com/gpu: 1 # request 1 GPU
requests:
nvidia.com/gpu: 1 # requests must equal limits for GPUs
# Node taint for GPU nodes
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
# Pod toleration
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Key considerations:
- NVIDIA Device Plugin — DaemonSet that exposes GPU as a Kubernetes resource. Required on every GPU node.
- GPU time-slicing / MIG (Multi-Instance GPU): A single A100 can be partitioned into 7 independent GPU instances (MIG). Crucial for inference serving where single models don't saturate a full GPU.
- Distributed training (Kubeflow, Volcano): Multi-node distributed training (PyTorch DDP, TensorFlow) requires coordinated pod scheduling — all pods must start simultaneously and have fast networking (RDMA/InfiniBand on bare metal). Kubeflow's Training Operator handles this with PyTorchJob, TFJob CRDs.
- Model storage: Large model weights (10-100GB) stored in S3/GCS, pulled at startup. Use init containers + shared emptyDir (tmpfs) for model caching. Model caching operators (KServe, Triton Inference Server) handle this.
- Inference serving: KServe (formerly KFServing) provides standardized model serving APIs, autoscaling (including scale-to-zero), canary deployments for models, and A/B testing.
- Cost: GPU nodes are expensive (A100 = $10-30/hour). Use KEDA to scale inference deployments to zero during off-hours. Use spot GPU instances for training (checkpoint frequently).