System Design
A complete senior-level system design interview guide covering scalability fundamentals, distributed systems theory, databases and storage, caching strategies, messaging and event-driven architecture, API design, reliability patterns, and real-world case studies like designing URL shorteners, feeds, and payment systems.
Fundamentals
9 questionsA disciplined framework prevents you from jumping to solutions before understanding the problem. A proven approach:
- Clarify requirements (5 min): Ask about scale (DAU, QPS, data size), must-have vs nice-to-have features, read/write ratio, latency SLAs, consistency requirements. Never assume.
- Capacity estimation (3 min): Back-of-envelope math β requests per second, storage per day, bandwidth. This drives architectural choices.
- High-level design (10 min): Draw the main components β clients, load balancers, services, databases, caches. Cover the happy path end-to-end.
- Deep dive (15 min): Pick 2β3 critical components and go deep. The interviewer will guide you here based on what they care about.
- Identify bottlenecks and trade-offs (5 min): Where does the system fail? What did you trade off? What would you do differently at 10x scale?
Key habits: Think aloud β interviewers evaluate reasoning, not just answers. State trade-offs explicitly. Ask clarifying questions rather than assuming. It's a conversation, not a monologue.
// Quick capacity estimation template
Daily Active Users (DAU): 10M
Requests/user/day: ~10 reads, ~1 write
Total RPS: (10M Γ 10) / 86400 β 1,160 read RPS
: (10M Γ 1) / 86400 β 116 write RPS
Peak (3Γ): ~3,500 read RPS, ~350 write RPS
Storage per record: ~1KB
New records/day: 10M Γ 1 = 10M
Storage/day: 10M Γ 1KB = 10GB
5-year storage: 10GB Γ 365 Γ 5 β 18TB
The CAP theorem states that a distributed system can guarantee at most two of three properties simultaneously:
- Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time.
- Availability (A): Every request receives a (non-error) response, though it may not be the most recent data.
- Partition Tolerance (P): The system continues operating even if network messages are dropped or delayed between nodes.
The practical insight: Network partitions are inevitable in distributed systems, so you must tolerate P. The real choice is between C and A when a partition occurs.
- CP systems (Zookeeper, HBase, etcd): Return an error or wait rather than serve potentially stale data. Used for coordination, leader election, financial transactions.
- AP systems (Cassandra, DynamoDB, CouchDB): Continue serving possibly stale data during partitions and reconcile later. Used for social feeds, shopping carts, DNS.
Nuances: CAP describes a worst-case (during a partition). The PACELC theorem extends it: even without partitions, there's a trade-off between Latency and Consistency. Most modern systems offer tunable consistency levels rather than a binary choice.
// Cassandra tunable consistency example
// Write consistency
session.execute(
QueryBuilder.insertInto("users")
.value("id", userId)
.value("name", name)
.setConsistencyLevel(ConsistencyLevel.QUORUM) // majority of replicas must ack
);
// Read consistency
session.execute(
QueryBuilder.selectFrom("users").all()
.whereColumn("id").isEqualTo(literal(userId))
.setConsistencyLevel(ConsistencyLevel.ONE) // fast, may be stale
);
Vertical scaling (scale up): Add more resources (CPU, RAM, faster disks) to an existing machine. Simple β no code changes needed. Has a hard ceiling (largest available instance). Single point of failure. Works well for stateful workloads like traditional RDBMS.
Horizontal scaling (scale out): Add more machines and distribute load across them. Theoretically unlimited scale. Requires the application to be stateless or to externalize state. Adds complexity: load balancing, distributed coordination, data partitioning, network overhead.
Decision factors:
- Vertical first when it's cheaper and simpler β don't over-engineer prematurely.
- Horizontal when you hit vertical limits, need high availability (no SPOF), or need geographic distribution.
- Stateless services (web servers, API gateways) are natural horizontal scaler candidates.
- Databases are harder to scale horizontally β requires sharding, replication, or moving to a distributed DB.
// Horizontal scaling requires statelessness
// Bad: session stored in-process
HttpSession session = request.getSession();
session.setAttribute("cart", cartItems); // breaks with multiple instances
// Good: session stored externally
@Bean
public RedisIndexedSessionRepository sessionRepository(RedisOperations redisOperations) {
return new RedisIndexedSessionRepository(redisOperations);
// now any instance can serve any user
}
Latency is the time to complete a single operation (e.g., p99 response time of 50ms). Throughput is how many operations the system can complete per unit of time (e.g., 10,000 RPS).
They are often in tension: Batching increases throughput but increases latency per item. Parallelism can improve both but adds coordination overhead. Caching dramatically improves both until the cache is cold.
Little's Law: L = Ξ»W β the average number of items in a system equals arrival rate Γ average time in system. This links latency and throughput: if latency increases (e.g., a slow dependency), items pile up, reducing throughput unless you add capacity.
Design implications:
- Always measure P50, P95, P99 β averages hide tail latency disasters.
- A system with great average latency but terrible P99 will frustrate your heaviest users.
- Async processing trades latency (user waits less for ack) for deferred throughput (background workers process at their own pace).
- Connection pools, thread pools, and queues all buffer between producers and consumers to absorb throughput spikes at the cost of latency under load.
// Tail latency matters β one slow call tanks perceived performance
// With 10 parallel microservice calls, P99 of the chain β 1 - (1 - P99_single)^10
// If each service has P99=10ms, the combined P99 β 100ms (worst-case fan-out)
// Hedged requests: send duplicate requests, use first response
CompletableFuture primary = callService();
CompletableFuture hedge = primary.orTimeout(50, MILLISECONDS)
.exceptionallyCompose(_ -> callService()); // retry after deadline
hedge.join();
Synchronous: The caller blocks waiting for a response. Simple to reason about, easy error handling, natural request-response semantics. Tight coupling β if the downstream is slow or down, the caller is blocked.
Asynchronous: The caller sends a message and continues. Loose coupling, better fault isolation, natural for workflows that span long durations. Harder to debug, eventual consistency is the norm, requires idempotency.
Use synchronous when:
- The response is needed to continue (e.g., auth check, payment authorization).
- Operations are fast (<100ms) and the dependency is reliable.
- You need immediate confirmation (e.g., form validation).
Use asynchronous when:
- The operation is slow (email sending, video encoding, ML inference).
- You want to absorb bursts β queue acts as a buffer.
- You need to fan out to multiple consumers (event-driven).
- You want temporal decoupling β sender and receiver don't need to be up simultaneously.
// Synchronous: user waits for the whole chain
POST /checkout β PaymentService β InventoryService β EmailService β 200 OK
// Total latency: sum of all calls. Email failure fails the checkout.
// Asynchronous: user gets immediate ack
POST /checkout β PaymentService β 202 Accepted
β publishes OrderPaid event
InventoryService subscribes β reserves stock
NotificationService subscribes β sends email
// Faster response, better isolation, but eventual consistency.
A load balancer distributes incoming traffic across multiple backend instances to improve availability, throughput, and reliability. It also performs health checks and removes unhealthy instances from rotation.
Layer 4 vs Layer 7: L4 (transport) routes based on IP/TCP without inspecting content β fast but less intelligent. L7 (application) can route based on HTTP path, headers, cookies β enables URL-based routing, SSL termination, and advanced features.
Algorithms:
- Round Robin: Requests distributed sequentially. Simple; poor for heterogeneous requests (some much heavier than others).
- Weighted Round Robin: Servers with more capacity get proportionally more requests. Good when instances differ in size.
- Least Connections: Route to the instance with fewest active connections. Better for long-lived connections (WebSockets, gRPC streams).
- Least Response Time: Route to fastest-responding server. Adapts to real-time performance differences.
- IP Hash / Consistent Hashing: Same client always hits same server. Enables server-side session affinity (sticky sessions). Required for stateful protocols.
- Random with Two Choices (Power of Two): Pick two random servers, route to the less loaded one. Near-optimal distribution with low coordination overhead. Used by Nginx, Envoy.
# Nginx upstream configuration
upstream backend {
least_conn; # algorithm
server 10.0.0.1:8080 weight=3; # 3x more traffic
server 10.0.0.2:8080 weight=1;
server 10.0.0.3:8080 backup; # only used if others are down
keepalive 32; # connection pool to backends
}
server {
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout http_502;
}
}
A Content Delivery Network is a geographically distributed network of edge servers that cache and serve content closer to users, reducing latency and offloading origin traffic.
How it works: When a user requests a resource, DNS resolves to the nearest edge PoP (Point of Presence). If the edge has the content (cache hit), it serves it directly. On a cache miss, the edge fetches from origin, caches it, and serves it. Subsequent users in that region get the cached version.
What to put on a CDN:
- Static assets: JS, CSS, images, fonts, video files.
- Cacheable API responses with appropriate Cache-Control headers.
- Software downloads, large media files.
Advanced CDN features: SSL termination at the edge, DDoS mitigation, WAF (Web Application Firewall), image optimization (resize/format conversion), edge compute (Cloudflare Workers, Lambda@Edge) for dynamic logic at the edge without hitting origin.
Cache invalidation: The hardest part. Strategies: versioned URLs (/app.v4.js), short TTLs, CDN purge APIs on deploy, stale-while-revalidate for gradual rollout.
# Effective Cache-Control headers
# Static assets (versioned by hash in filename)
Cache-Control: public, max-age=31536000, immutable
# HTML β must be fresh
Cache-Control: no-cache # revalidate every request, use ETag
# API response β cacheable but short-lived
Cache-Control: public, max-age=60, stale-while-revalidate=300
# Never cache
Cache-Control: no-store # sensitive data, user-specific pages
The problem with naive modulo hashing: If you assign keys to nodes with node = hash(key) % N, adding or removing a node changes N and remaps nearly every key β causing a massive cache invalidation or data migration storm.
Consistent hashing: Place nodes and keys on a virtual ring (hash space 0 to 2Β³Β²). Each key is assigned to the first node clockwise from its hash position. When a node is added or removed, only the keys between the new/removed node and its predecessor need to be remapped β O(K/N) keys instead of O(K).
Virtual nodes: Each physical node maps to multiple positions on the ring (virtual nodes). This improves load distribution and minimizes hotspots when nodes have different capacities. Used by Cassandra, Amazon DynamoDB, Riak.
// Simplified consistent hashing concept
TreeMap<Long, String> ring = new TreeMap<>();
int virtualNodes = 150;
// Add servers to ring
for (String server : servers) {
for (int i = 0; i < virtualNodes; i++) {
long hash = hash(server + "#" + i);
ring.put(hash, server);
}
}
// Find server for a key
String getServer(String key) {
long hash = hash(key);
Map.Entry<Long, String> entry = ring.ceilingEntry(hash);
if (entry == null) entry = ring.firstEntry(); // wrap around
return entry.getValue();
}
// Adding a server: only keys between previous and new node remapped
// Removing a server: only that server's keys remapped to successor
Real-world users: Memcached client libraries, Redis Cluster, Cassandra, Chord DHT, Nginx upstream hashing.
Monolith: All functionality in one deployable unit. Simple to develop, test, and deploy initially. As it grows: slow builds, risky deploys (one bug can take everything down), hard to scale individual components, organizational friction (hundreds of devs touching one codebase).
SOA (Service-Oriented Architecture): Services communicate via an Enterprise Service Bus (ESB). Chunked into larger, shared services. Common in enterprises circa 2000β2010. ESB became a bottleneck and single point of failure; services were often too coarse-grained.
Microservices: Small, independently deployable services each owning their data. Communicate via APIs or events. Enables independent scaling, deployment, and team ownership (Conway's Law). The cost: distributed systems complexity β network calls, eventual consistency, distributed tracing, service discovery, operational overhead.
The uncomfortable truth: Most startups should start with a monolith. Microservices pay off only once you have the scale problem and the team to manage the complexity. A well-structured monolith beats a poorly implemented microservices architecture every time.
- Start with monolith if: team <20 engineers, product-market fit not yet proven, domain not well understood.
- Move to microservices when: teams are stepping on each other, need to scale individual components 10Γ differently, or deployment velocity is bottlenecked by a large shared codebase.
- Strangler Fig pattern: Extract services incrementally from a monolith rather than a big-bang rewrite.
Scalability
8 questionsA rate limiter controls the rate of requests a client can make to prevent abuse, ensure fair usage, and protect downstream services.
Algorithms:
- Token Bucket: A bucket holds tokens (max capacity = burst limit). Tokens refill at a fixed rate. Each request consumes one token; rejected if empty. Allows bursts up to bucket capacity. Used by AWS API Gateway.
- Leaky Bucket: Requests enter a queue (the bucket) and are processed at a fixed rate β smooths out bursts completely. Good for rate-shaping outbound traffic, bad if you want to allow bursts.
- Fixed Window Counter: Count requests per time window (e.g., 1000/min). Simple. Vulnerable to boundary attacks β 1000 at 00:59 + 1000 at 01:00 = 2000 in 2 seconds.
- Sliding Window Log: Keep a log of request timestamps. Count only timestamps within the rolling window. Accurate but memory-intensive for high QPS.
- Sliding Window Counter: Approximation: combine current + previous window count weighted by overlap. Memory-efficient, near-accurate. Used by Cloudflare, Redis-based solutions.
-- Redis sliding window rate limiter (Lua atomic script)
local key = KEYS[1] -- "rate:user:{id}"
local limit = tonumber(ARGV[1]) -- 100
local window = tonumber(ARGV[2]) -- 60 (seconds)
local now = tonumber(ARGV[3])
-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window * 1000)
local count = redis.call('ZCARD', key)
if count < limit then
redis.call('ZADD', key, now, now) -- add current timestamp
redis.call('EXPIRE', key, window)
return 1 -- allowed
else
return 0 -- rejected, return 429
end
Distributed rate limiting: Use Redis as a shared counter. For multi-region, use local counters with periodic sync (eventual) or a dedicated rate-limiting service (e.g., Envoy's global rate limiting via gRPC).
Apply these techniques in roughly this order β each adds complexity, so go as far as you need:
- Index optimization: Missing indexes are the #1 cause of slow queries. Use
EXPLAIN ANALYZEto find sequential scans. Add composite indexes for common query patterns. Drop unused indexes (they slow writes). - Query optimization: Avoid N+1 queries, SELECT *, and functions on indexed columns in WHERE. Use query caching and connection pooling (PgBouncer, HikariCP).
- Read replicas: Replicate to one or more read-only replicas. Route read traffic there, writes stay on primary. Solves read-heavy workloads. Replication lag is a trade-off.
- Caching layer: Cache hot query results in Redis/Memcached. Reduces DB hits by 80β90% for read-heavy workloads. Cache invalidation is the challenge.
- Vertical scaling: Upgrade to a larger instance. Buys time cheaply.
- Partitioning: Range or list partition large tables (e.g., events by month). Each partition is a separate file β queries against one partition scan less data.
- Sharding: Horizontal partitioning across multiple database servers. Each shard holds a subset of rows (e.g., by user ID range or hash). Complex β cross-shard queries, shard rebalancing, and distributed transactions are hard.
- CQRS: Separate write model (normalized, transactionally consistent) from read model (denormalized, fast). Allows the read store to be a different technology optimized for queries.
-- Read replica routing in Spring Boot
@Configuration
public class DataSourceConfig {
@Bean @Primary
public DataSource writeDataSource() { /* primary */ }
@Bean
public DataSource readDataSource() { /* replica */ }
@Bean @Primary
public DataSource routingDataSource() {
var routing = new AbstractRoutingDataSource() {
@Override protected Object determineCurrentLookupKey() {
return TransactionSynchronizationManager.isCurrentTransactionReadOnly()
? "read" : "write";
}
};
routing.setTargetDataSources(Map.of("write", writeDataSource(), "read", readDataSource()));
return routing;
}
}
// Usage: @Transactional(readOnly = true) routes to replica automatically
Sharding splits data across multiple database instances (shards), each owning a subset. Each shard is an independent database server; together they hold the full dataset.
Sharding strategies:
- Range-based: Shard by value range (user IDs 1β1M on shard 1, 1Mβ2M on shard 2). Easy to reason about; can cause hotspots (new users go to the last shard).
- Hash-based:
shard = hash(key) % N. Even distribution; resharding requires moving ~all data when N changes. Mitigated by consistent hashing. - Directory-based: A lookup service maps keys to shards. Flexible β can rebalance without rehashing. The lookup service is a critical dependency.
- Geographic: Users in EU on EU shard, US users on US shard. Natural for data residency requirements.
Challenges:
- Cross-shard queries: JOINs across shards require scatter-gather (query all shards and merge in application code) β slow and expensive.
- Distributed transactions: Two-phase commit or saga patterns needed for transactions spanning shards.
- Hotspot / celebrity problem: A single very active entity (huge user, viral post) overloads its shard. Solution: sub-shard that entity or use a separate dedicated shard.
- Rebalancing: Adding shards requires migrating data. Non-trivial to do without downtime.
- Schema changes: Must be applied to every shard, often serially β slow for large shard counts.
When a user with millions of followers (celebrity) posts, delivering that post to all followers' feeds requires massive fan-out. At Twitter's scale, a single BeyoncΓ© tweet needed to be written to 30M+ feeds within seconds.
Push model (fan-out on write): When a post is created, immediately write it to all followers' feed caches. Feed reads are O(1). Writes are expensive for celebrities (30M writes per post). Wastes compute for inactive followers who never load their feed.
Pull model (fan-out on read): When a user loads their feed, fetch the latest posts from everyone they follow and merge. Works for small follow lists. At scale (following 1000 people), merging 1000 streams per load is too slow.
Hybrid model (Twitter's approach):
- For regular users: push model β fan-out on write to followers' cached feeds.
- For celebrities (followers > threshold): pull model β do NOT fan-out at write time. Instead, at read time, merge the user's pre-built feed with a real-time fetch of any celebrity posts they follow.
- Result: fast reads for everyone, bounded write amplification, and fresh celebrity content.
// Hybrid feed generation (pseudocode)
List<Post> getFeed(long userId) {
// Pre-built feed (push model for regular users)
List<Post> cachedFeed = redisCache.get("feed:" + userId);
// Fetch celebrities the user follows (pull model)
List<Long> celebrities = followGraph.getCelebrities(userId);
List<Post> celebPosts = postStore.getRecentPosts(celebrities, limit=20);
// Merge and sort by timestamp
return Stream.concat(cachedFeed.stream(), celebPosts.stream())
.sorted(Comparator.comparing(Post::createdAt).reversed())
.limit(50)
.toList();
}
Forward proxy: Sits between clients and the internet. Clients configure it explicitly. Hides client identity from the server (anonymization). Used for content filtering, corporate internet access control, privacy (VPNs, Tor).
Reverse proxy: Sits in front of backend servers. Clients don't know it exists β they think they're talking to the server. Hides the backend topology from clients. Used for load balancing, SSL termination, caching, compression, authentication, and as an API gateway.
Reverse proxy use cases:
- SSL termination: Decrypt HTTPS once at the proxy; backends communicate over plain HTTP internally. Centralizes certificate management.
- Caching: Cache static content at the proxy layer to reduce backend load.
- Compression: Gzip responses at the proxy, saving bandwidth without backend code changes.
- Security: Hide backend IP addresses, apply WAF rules, rate limiting.
- Blue-green deployments: Switch traffic between old and new backend versions by updating proxy config.
# Nginx as reverse proxy with SSL termination + caching
server {
listen 443 ssl;
ssl_certificate /etc/ssl/certs/app.crt;
ssl_certificate_key /etc/ssl/private/app.key;
# Serve cached static content directly
location /static/ {
root /var/cache/nginx;
expires 1y;
add_header Cache-Control "public, immutable";
}
# Proxy dynamic requests to backend cluster
location /api/ {
proxy_pass http://backend_cluster;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
gzip on;
}
}
WebSocket connections are persistent, bidirectional TCP connections. Each consumes a file descriptor and memory. Handling 1M concurrent connections requires event-driven, non-blocking I/O β not a thread-per-connection model.
Key design decisions:
- Connection servers: Use event-loop based servers (Node.js, Netty, Go goroutines, Nginx). A single Nginx/Node process can hold 100K+ connections. Need 10β20 connection servers for 1M. These are I/O bound, not CPU bound.
- Separate connection handling from business logic: Connection servers should only manage the socket β forwarding messages to application servers via a message bus (Kafka, Redis pub/sub). Don't put business logic in connection servers.
- Presence and routing: Track which connection server holds which user (store in Redis:
user:{id} β server:{id}). When sending a message to a user, look up their server and forward via pub/sub. - Heartbeats: Detect dead connections with periodic pings. Purge stale presence records. TCP keepalive alone is insufficient across load balancers.
// Message routing between connection servers
// Server A holds user Alice's WebSocket
// Server B holds user Bob's WebSocket
// Alice sends Bob a message:
// 1. Alice's message arrives at Server A
// 2. Server A publishes to channel "user:bob" on Redis
redis.publish("user:" + bobId, message.toJson());
// 3. Server B (holding Bob's socket) is subscribed
redis.subscribe("user:" + bobId, (channel, msg) -> {
WebSocketSession bobSession = connectionRegistry.get(bobId);
if (bobSession != null) bobSession.sendMessage(new TextMessage(msg));
});
// Horizontal scaling: 20 servers Γ 50K connections = 1M total
Push: The producer actively sends data to consumers as soon as it's available. Low latency for consumers. Producer must know about consumers and manage delivery. Risk of overwhelming slow consumers (backpressure needed). Examples: WebSockets, SSE, webhooks, push notifications, email.
Pull: Consumers request data from producers when they're ready. Consumers control their ingestion rate naturally. Higher latency (polling interval). Producer doesn't need to know consumers. Simpler delivery semantics. Examples: REST polling, Kafka consumers, database read replicas, RSS.
Backpressure in push systems: If the producer sends faster than the consumer can handle, you need a buffer (queue) and a signal to slow the producer. Reactive Streams (Project Reactor, RxJava) define a standard backpressure protocol. Without backpressure, consumers crash or drop messages.
Long polling: A hybrid β the consumer sends a request, the server holds it open until data is available (or times out). Simulates push over HTTP without WebSockets. Higher latency than true push but works through firewalls. Used by older chat systems, Comet applications.
Multi-region design is fundamentally about trading latency against consistency and operational complexity.
DNS-based geo-routing: Route users to the nearest region via GeoDNS (AWS Route 53 latency-based routing, Cloudflare). The edge is the first hop β CDN PoPs handle static content globally.
Data residency strategies:
- Active-passive: One region handles all writes; others are replicas. Simple, strong consistency. Failover takes seconds to minutes. Users far from primary region suffer write latency.
- Active-active: Multiple regions accept writes. Lower write latency globally. Requires conflict resolution for concurrent writes to the same entity. Very complex. Used by DynamoDB Global Tables, CockroachDB, Spanner.
- Region-pinning: Each user's data lives in their home region (EU users' data stays in EU β GDPR). Reads are local and fast. Cross-region requests (user travels) are slower or redirected.
Operational concerns: Replication lag monitoring, inter-region traffic costs (expensive), split-brain scenarios, region failover runbooks, data sovereignty compliance (GDPR, CCPA).
# AWS Route 53 latency-based routing
# Automatically routes users to lowest-latency region
aws route53 create-record --hosted-zone-id Z123 --change-batch '{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.myapp.com",
"Type": "A",
"Region": "us-east-1", β AWS measures latency from client
"SetIdentifier": "us-east",
"AliasTarget": { "DNSName": "us-east-alb.amazonaws.com" }
}
}]
}'
Distributed Systems
10 questionsEventual consistency guarantees that if no new updates are made to a data item, all replicas will eventually converge to the same value. There is no guarantee on when. Between writes and convergence, readers may see stale or conflicting data.
Handling it in code:
- Read-your-own-writes consistency: After a user updates their profile, route their next read to the same primary or use a "write token" β a timestamp or version that the read service must respect.
- Idempotent operations: Make all writes idempotent with a client-generated UUID. Safe to retry without double-processing.
- Optimistic locking: Include a version/etag in updates; reject if the version has changed since you read it.
- Conflict resolution: Last-Write-Wins (LWW) by timestamp (risky if clocks skew), semantic merging (CRDT β Conflict-free Replicated Data Types), or user-prompted resolution.
- UI patterns: Optimistic UI β show the user their write immediately, reconcile if a conflict is detected later.
// Read-your-own-writes: attach a replication token
String updateProfile(long userId, ProfileUpdate update) {
String replicationToken = profileService.update(userId, update);
// Token encodes the replication lag watermark
return replicationToken;
}
// Subsequent read passes the token to ensure it sees the write
Profile getProfile(long userId, String replicationToken) {
// Replica waits until it has applied at least up to replicationToken
return profileService.read(userId, replicationToken);
}
In microservices, a single business transaction (e.g., place an order) spans multiple services (Order, Payment, Inventory, Shipping), each with its own database. You can't use a traditional ACID transaction across them. The Saga pattern solves this.
A saga is a sequence of local transactions, each publishing an event or calling the next service. If any step fails, compensating transactions undo the previous steps.
Choreography-based saga: Services react to events. No central coordinator. Decentralized. Hard to visualize the overall flow; debugging is harder. Good for simple flows.
Orchestration-based saga: A saga orchestrator (a dedicated service or workflow engine) tells each participant what to do and handles failures. The flow is explicit and in one place. Easier to monitor. Single point of coordination (not a true SPOF since it can be replicated). Used by Apache Camel, Temporal, AWS Step Functions, Conductor.
// Temporal orchestration-based saga (Java SDK)
@WorkflowImpl
public class OrderSagaImpl implements OrderSaga {
private final PaymentService paymentService =
Workflow.newActivityStub(PaymentService.class, options);
@Override
public void placeOrder(Order order) {
String paymentId = null;
try {
paymentId = paymentService.charge(order.userId(), order.amount());
inventoryService.reserve(order.items());
shippingService.schedule(order);
} catch (Exception e) {
// Compensate in reverse order
if (paymentId != null) paymentService.refund(paymentId);
inventoryService.release(order.items()); // idempotent
throw new OrderFailedException(e);
}
}
}
Key requirements: All saga steps must be idempotent (retries are inevitable). Compensating transactions must always succeed (they can't fail without a saga for the saga). The system will be temporarily inconsistent between steps β design the UI and downstream services to tolerate this.
2PC is a distributed protocol to achieve atomicity across multiple participants (databases). It has two phases:
- Prepare phase: The coordinator asks all participants "can you commit?" Each participant acquires locks, writes to a durable log, and responds Yes or No.
- Commit phase: If all say Yes, coordinator sends Commit. If any say No, coordinator sends Abort. Participants release locks after applying the decision.
Problems:
- Blocking protocol: If the coordinator fails after Prepare but before Commit, participants are stuck holding locks indefinitely β they can't commit or abort without the coordinator's decision.
- Single point of failure: Coordinator failure can halt the entire system.
- Performance: Two network round trips plus log flushes; locks held across the network. Terrible latency under contention.
- Not suitable for microservices: Requires all participants to support XA-style distributed transactions. Most message brokers, NoSQL stores, and modern services don't.
3PC adds a pre-commit phase to reduce the blocking window but adds another round trip and still can't handle network partitions cleanly.
Modern alternatives: Saga pattern for microservices. Google Spanner uses TrueTime + 2PC but within a tightly controlled infrastructure with atomic clocks. Most teams avoid 2PC and design for eventual consistency instead.
Leader election ensures that exactly one node in a cluster acts as the "leader" for coordination tasks β accepting writes in a primary-replica DB, assigning tasks to workers, or holding a distributed lock.
Common algorithms:
- Raft: The most widely used modern consensus algorithm. Nodes are followers or candidates. If a follower doesn't hear from a leader within an election timeout, it becomes a candidate and requests votes. A node wins if it gets a majority. Leader sends heartbeats to maintain authority. Used by etcd, CockroachDB, TiKV.
- Paxos: The original consensus algorithm. Provably correct but hard to understand and implement. Multi-Paxos variants used in Google Chubby, Spanner.
- Bully algorithm: The node with the highest ID wins. Simple but produces leader churn when the highest-ID node is flaky.
In practice β use ZooKeeper or etcd: Don't implement Raft yourself. Use a battle-tested coordination service.
// Leader election with etcd (Java)
// Use an etcd lease β if the leader dies, lease expires and a new election runs
EtcdClient client = Client.builder().endpoints("http://etcd:2379").build();
Lease leaseClient = client.getLeaseClient();
// Create a lease with TTL
long leaseId = leaseClient.grant(10).get().getID(); // 10-second TTL
// Try to win election by setting key
KV kv = client.getKVClient();
String electionKey = "/leader/my-service";
PutOption option = PutOption.newBuilder().withLeaseId(leaseId).build();
kv.put(ByteSequence.from(electionKey, UTF_8),
ByteSequence.from(myNodeId, UTF_8), option).get();
// Keep lease alive (heartbeat)
leaseClient.keepAlive(leaseId, observer); // sends keep-alive every 3s
// If this node dies: lease expires β key deleted β other nodes detect absence β new election
Strong (Linearizable) consistency: Any read always returns the most recent write. The system behaves as if there is a single copy of the data. Reads and writes happen atomically in real-time order. Slow β requires coordination before every operation. Example: ZooKeeper znodes, Google Spanner, relational DB on a single primary.
Sequential consistency: Operations appear to execute in some sequential order consistent with each program order, but not necessarily real-time order. Weaker than linearizability. Easier to implement. Example: multi-core CPU memory models (without memory barriers).
Causal consistency: Operations that are causally related appear in the same order to all nodes. Concurrent (causally independent) operations may appear in different orders. Example: "reply to a message" must appear after the original message. MongoDB sessions offer causal consistency.
Eventual consistency: All replicas will converge to the same value, eventually, if writes stop. No guarantee on when. Example: DNS propagation, Cassandra with ONE consistency, S3 object reads after overwrite.
Weak consistency: No guarantees at all β after a write, reads may or may not see it. Used in caches and best-effort delivery systems where performance trumps accuracy. Example: Memcached, UDP-based metrics.
The problem: You want to update your database and publish a message to Kafka/RabbitMQ atomically. But these are two separate systems β you can't span them in a single transaction. If you update the DB and the app crashes before publishing the event, the event is lost. If you publish first and the DB update fails, you've sent a phantom event.
The Outbox Pattern: Write the event into an outbox table in the same database transaction as the business operation. A separate relay process reads uncommitted events from the outbox and publishes them to the message broker, then marks them as published. The database transaction guarantees atomicity of the business data + event; the relay guarantees delivery.
// In one ACID transaction:
@Transactional
public void placeOrder(Order order) {
orderRepository.save(order); // business operation
// Persist event to outbox β same transaction
OutboxEvent event = new OutboxEvent(
UUID.randomUUID(),
"OrderPlaced",
objectMapper.writeValueAsString(new OrderPlacedEvent(order.id()))
);
outboxRepository.save(event);
// Transaction commits: order + outbox entry saved atomically
}
// Separate relay (runs every second, or CDC-based):
@Scheduled(fixedDelay = 1000)
public void relayOutboxEvents() {
List<OutboxEvent> pending = outboxRepo.findByPublishedFalse();
for (OutboxEvent event : pending) {
kafkaTemplate.send("orders", event.payload()).get();
event.setPublished(true);
outboxRepo.save(event);
}
}
CDC-based Outbox (better): Use Debezium to capture the outbox table changes from the DB's write-ahead log. Zero polling overhead, near-real-time delivery, no need for a scheduled job.
A distributed lock ensures mutual exclusion across processes on different machines β only one holder can perform a critical section at a time (e.g., cron job, payment processing, leader-level tasks).
Simple Redis implementation:
// SET NX EX: atomic "set if not exists with expiry"
// Returns OK if lock acquired, null if already held
String token = UUID.randomUUID().toString(); // unique per lock holder
Boolean acquired = redis.set("lock:resource", token,
SetArgs.Builder.nx().ex(30)); // 30s TTL prevents deadlock if holder crashes
if (acquired == null) throw new LockUnavailableException();
try {
doWork();
} finally {
// MUST only delete if WE still hold the lock (compare-and-delete, atomic via Lua)
String script = "if redis.call('get',KEYS[1])==ARGV[1] then " +
"return redis.call('del',KEYS[1]) else return 0 end";
redis.eval(script, List.of("lock:resource"), List.of(token));
}
Critical pitfalls:
- TTL too short: Lock expires while holder is still working β two holders simultaneously. Use lock extension (refresh TTL if still working) or a fencing token to detect stale lock holders.
- Deleting someone else's lock: Always use a unique token and compare-and-delete, never a plain DEL.
- Single Redis node failure: If the Redis node dies, the lock is lost. RedLock algorithm acquires the lock on N/2+1 independent nodes for better fault tolerance. Martin Kleppmann has a famous critique of RedLock β for maximum safety, use ZooKeeper or etcd.
- GC pause problem: A JVM GC pause can cause the lock TTL to expire while the thread is paused but thinks it still holds the lock. Use fencing tokens (monotonically increasing IDs) and have the resource validate the token to detect this.
Physical clocks can't be trusted for ordering events across distributed nodes (clock skew, NTP drift). Logical clocks solve this.
Lamport timestamps: Each node maintains a counter. It increments the counter on every send, and on receive, sets its counter to max(local, received) + 1. Gives a partial causal order: if A happened-before B, then timestamp(A) < timestamp(B). The converse is not true β equal timestamps don't mean concurrent events.
Vector clocks: Each node maintains a vector of counters (one per node). On send, increment own counter. On receive, merge vectors element-wise with max, then increment own. Two events are concurrent if neither vector dominates the other. Used by DynamoDB, Riak, CouchDB to detect conflicting writes.
// Vector clock merge
Node A: [A=3, B=1, C=0] // after writing
Node B: [A=2, B=4, C=0] // after its own write
// Both updated same key concurrently β CONFLICT detected:
// A's vector does not dominate B's (A=3 > B=2 but B=1 < B=4)
// Requires conflict resolution: LWW, semantic merge, or user prompt
// vs. causal sequence:
Node A writes: [A=1, B=0]
Node B reads A's version, then writes: [A=1, B=1]
// B's vector dominates A's β B's write supersedes A's, no conflict
Practical use: Vector clocks add metadata overhead and complexity. Many systems (DynamoDB, Cassandra) use simpler Last-Write-Wins with physical timestamps instead, accepting occasional lost writes as an acceptable trade-off.
At-most-once: Message delivered zero or one times. Sender fires and forgets. Fast, low overhead. If the consumer crashes before processing, the message is lost. Acceptable for metrics, logs, best-effort notifications. UDP is the canonical example.
At-least-once: Message delivered one or more times. Sender retries until it gets an ack. Duplicates are possible if the consumer processes but crashes before acking. Requires consumers to be idempotent (processing the same message twice has the same effect as once). This is the default in most message brokers (Kafka, RabbitMQ).
Exactly-once: The ideal, but the hardest. True exactly-once across two systems requires distributed consensus. Options:
- Idempotency + deduplication: Use at-least-once + a dedup store (Redis set, DB unique constraint on message ID). Semantically behaves like exactly-once.
- Kafka transactional producers + idempotent consumers: Kafka's transactions ensure atomic produce + consumer offset commit within Kafka. Exactly-once within the Kafka stream, but side effects (DB writes) still need idempotency.
- Transactional outbox: Write the event and business operation in one DB transaction, relay atomically.
// Idempotent consumer β deduplication with a unique constraint
@KafkaListener(topics = "orders")
@Transactional
public void handle(OrderPlaced event) {
// DB unique constraint prevents double-processing
if (processedEventRepo.existsById(event.eventId())) {
log.info("Duplicate event {}, skipping", event.eventId());
return;
}
orderService.process(event);
processedEventRepo.save(new ProcessedEvent(event.eventId()));
}
Service discovery solves the problem of services finding each other in a dynamic environment where instances appear and disappear, and IP addresses change.
Client-side discovery: Clients query a service registry (Consul, Eureka) to get a list of healthy instances and apply their own load balancing logic. More control, more complexity in the client. Netflix Ribbon + Eureka.
Server-side discovery: The client makes a request to a router (load balancer or service mesh). The router queries the registry and forwards. Client doesn't need to know about the registry. AWS ALB + ECS service discovery, Kubernetes Services.
In Kubernetes: Service discovery is built in via DNS. Each Service object gets a stable DNS name (my-service.my-namespace.svc.cluster.local). kube-proxy maintains iptables rules to load-balance across pod IPs. Pods register themselves via labels; the Endpoints object tracks healthy pod IPs. No external registry needed β etcd (the Kubernetes control plane store) is the registry.
# Kubernetes Service β stable DNS name regardless of pod IP changes
apiVersion: v1
kind: Service
metadata:
name: order-service
spec:
selector:
app: order-service # matches pods with this label
ports:
- port: 8080
targetPort: 8080
# DNS: order-service.default.svc.cluster.local β round-robin across matched pods
# Pods come and go; DNS entry always resolves to healthy pods
# Traditional (Consul):
# Service registers on startup β Consul knows its IP + health check
# Client queries: curl http://consul/v1/health/service/order-service
# Gets back: [{Address: "10.0.1.5", Port: 8080}, ...]
Databases & Storage
10 questionsChoose NoSQL when: schema is dynamic or evolving rapidly; horizontal scale is a requirement from day one; access patterns are predictable and simple (no complex ad-hoc queries); you need very high write throughput; data is naturally hierarchical or graph-shaped.
Choose SQL when: data relationships are complex; ACID transactions are required; you need flexible ad-hoc querying; team knows SQL well; data integrity constraints are important.
NoSQL types:
- Document stores (MongoDB, CouchDB, Firestore): Store JSON/BSON documents. Great for content, user profiles, catalogs. Flexible schema. Secondary indexes available. Not great for highly relational data.
- Key-Value stores (Redis, DynamoDB, Riak): Ultra-fast lookup by key. No query flexibility. Perfect for caching, sessions, leaderboards, feature flags.
- Wide-column stores (Cassandra, HBase, BigTable): Rows with dynamic columns grouped into column families. Optimized for high-volume time-series, event logs, IoT. Design queries first β data model follows access patterns.
- Graph databases (Neo4j, Amazon Neptune): Nodes and edges with properties. Efficient for multi-hop relationship traversal. Social networks, recommendation engines, fraud detection, knowledge graphs.
- Time-series databases (InfluxDB, TimescaleDB, Prometheus): Optimized for append-heavy, time-ordered numerical data. Automatic downsampling and retention policies. Metrics, monitoring, financial tick data.
- Search engines (Elasticsearch, OpenSearch): Inverted indexes for full-text search, faceting, aggregations. Used alongside primary DBs for search features.
B-tree: A self-balancing tree where each node contains sorted keys and pointers to children or data pages. Reads and writes are O(log N). Leaf nodes are linked for efficient range scans. The height of a B-tree for 1 billion rows might be only 4β5 levels β that's 4β5 disk I/Os to find any record.
Clustered index: The table rows are physically stored in the order of the index key. There can only be one per table (InnoDB always clusters on the primary key). Range scans on the PK are extremely fast β sequential disk reads. Inserting rows out of order causes page splits (expensive).
Non-clustered (secondary) index: Separate structure with the indexed columns and a pointer back to the clustered index key (not the raw heap row in InnoDB). A secondary index lookup does two B-tree traversals: first the secondary index, then the PK to fetch the full row (index scan + heap fetch). Covering indexes avoid the second lookup by including all needed columns in the index.
-- Covering index: includes all columns the query needs
-- No "back to table" step needed
CREATE INDEX idx_orders_user_status
ON orders (user_id, status, created_at)
INCLUDE (total_amount); -- covering column
-- Query can be satisfied entirely from the index
SELECT status, created_at, total_amount
FROM orders
WHERE user_id = 123 AND status = 'PAID'
ORDER BY created_at DESC;
-- EXPLAIN shows: "Index Only Scan using idx_orders_user_status"
When to add an index: Columns in WHERE, JOIN ON, ORDER BY, GROUP BY with high cardinality and frequent queries. Beware: every index slows INSERT/UPDATE/DELETE (must be maintained). Rule of thumb: don't index columns with <10 distinct values (low cardinality) β the optimizer may prefer a full table scan anyway.
ACID: Atomicity (all-or-nothing transaction), Consistency (DB invariants are maintained), Isolation (concurrent transactions appear sequential), Durability (committed data survives crashes).
Isolation anomalies:
- Dirty read: Reading uncommitted data from another transaction.
- Non-repeatable read: Reading the same row twice within a transaction and getting different values (another transaction committed between reads).
- Phantom read: Repeating a range query yields different rows (another transaction inserted/deleted rows matching the predicate).
- Lost update: Two transactions read-then-write the same row; one's update overwrites the other's.
- Write skew: Two transactions read overlapping data and make decisions that together violate a constraint (classic: on-call doctor scheduling).
SQL isolation levels and what they prevent:
- READ UNCOMMITTED: Allows dirty reads. Fastest. Almost never use.
- READ COMMITTED: Prevents dirty reads. Default in PostgreSQL, Oracle, SQL Server. Non-repeatable reads possible.
- REPEATABLE READ: Prevents dirty + non-repeatable reads. Default in MySQL/InnoDB. Phantoms still possible (InnoDB uses gap locks to prevent them too).
- SERIALIZABLE: Full isolation. Slowest. Prevents all anomalies. PostgreSQL uses Serializable Snapshot Isolation (SSI) β optimistic, no locks until conflict detected.
-- PostgreSQL: detect and prevent write skew with SERIALIZABLE
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN;
SELECT count(*) FROM doctors ON CALL WHERE shift_id = 42; -- returns 2
-- Both Dr. A and Dr. B make this check simultaneously
-- Both see count=2 and decide to go off-call
UPDATE oncall SET active = false WHERE doctor_id = :me AND shift_id = 42;
COMMIT;
-- PostgreSQL SSI detects the conflict and aborts one transaction
-- β ERROR: could not serialize access due to read/write dependencies
Cassandra is a wide-column store optimized for high write throughput and low-latency reads on known access patterns. The golden rule: design tables for queries, not for entities.
Partition key: Determines which node(s) store the data. All rows with the same partition key are stored together on disk. Querying by partition key = single-node lookup (fast). Querying without it = scatter-gather across all nodes (slow, avoid).
Clustering columns: Determine the sort order of rows within a partition. They enable efficient range queries within a partition (e.g., "messages in conversation 123, ordered by timestamp descending, limit 20").
-- Design for the query: "get last N messages in a conversation"
CREATE TABLE messages (
conversation_id UUID, -- partition key: all messages in one convo on one node
created_at TIMESTAMP,-- clustering col: sorted within partition
message_id UUID, -- clustering col: breaks ties
sender_id UUID,
content TEXT,
PRIMARY KEY ((conversation_id), created_at, message_id)
) WITH CLUSTERING ORDER BY (created_at DESC);
-- Query is fast (single partition, sorted):
SELECT * FROM messages
WHERE conversation_id = :id
LIMIT 20;
-- NEVER do this (no partition key β full cluster scan):
SELECT * FROM messages WHERE sender_id = :id ALLOW FILTERING; -- β
Cassandra pitfalls: Partition size β keep partitions under 100MB / 100K rows (large partitions cause GC pressure and slow reads). Tombstones accumulate from deletes and expire after gc_grace_seconds β excessive tombstones slow reads dramatically. Anti-patterns: ALLOW FILTERING, unbounded partitions, and updating primary key components (impossible β delete + reinsert).
B-tree: Updates in-place on disk pages. Random I/O for writes. Excellent for reads β point queries and range scans. Better read performance, worse write performance. Used by PostgreSQL, MySQL, SQLite, most traditional RDBMS.
LSM-tree: All writes go to an in-memory buffer (MemTable). When full, flush to an immutable SSTable (Sorted String Table) on disk β sequential I/O, extremely fast. Background compaction merges SSTables and removes old versions. Reads must check MemTable + multiple SSTables (mitigated by Bloom filters and sparse indexes). Used by Cassandra, RocksDB, LevelDB, HBase, InfluxDB.
Trade-offs:
- LSM pros: High write throughput (sequential I/O), space-efficient (compression), handles write-heavy workloads (IoT, logging, time-series).
- LSM cons: Read amplification (check multiple levels), write amplification from compaction, compaction I/O can compete with user queries.
- B-tree pros: Predictable read performance, no read amplification for point queries, simpler implementation.
- B-tree cons: Random write I/O, write amplification from page updates, fragmentation over time.
// Bloom filter in LSM: avoid disk reads for missing keys
// Check if a key might exist in an SSTable before reading from disk
// False positives possible (say "might exist" when it doesn't)
// No false negatives (if it says "definitely absent", trust it)
BloomFilter<String> filter = BloomFilter.create(
Funnels.stringFunnel(UTF_8), 1_000_000, 0.01); // 1% FPR
filter.put("user:12345");
filter.mightContain("user:12345"); // β true (might exist, do disk read)
filter.mightContain("user:99999"); // β false (definitely absent, skip SSTable)
CDC captures every row-level change (INSERT, UPDATE, DELETE) from a database's write-ahead log (WAL) and streams them as events. No queries needed β it reads the DB's internal replication log.
How it works: Tools like Debezium connect to the DB's replication slot (PostgreSQL), binlog (MySQL), or redo log (Oracle). Changes are emitted as structured events to Kafka in near real-time.
Use cases:
- Event publishing without dual writes: Replace the Outbox pattern's scheduled polling with Debezium watching the outbox table β real-time, no polling overhead.
- Cache invalidation: Automatically invalidate or update Redis caches when the DB changes.
- Search index sync: Keep Elasticsearch in sync with the primary DB without dual-write code.
- Data warehousing / ETL: Stream changes to a data warehouse (BigQuery, Snowflake) for analytics.
- Audit logging: Capture a complete history of every data change for compliance.
- Materialized view maintenance: Keep denormalized read models up to date.
# Debezium PostgreSQL connector config (Kafka Connect)
{
"name": "postgres-cdc",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.dbname": "mydb",
"database.user": "debezium",
"plugin.name": "pgoutput", # PostgreSQL logical replication plugin
"table.include.list": "public.orders,public.outbox_events",
"topic.prefix": "myapp"
}
}
# β emits to Kafka topics: myapp.public.orders, myapp.public.outbox_events
# Each message: {"before": {...}, "after": {...}, "op": "u"}
A time-series database (TSDB) is optimized for data that is indexed primarily by time β metrics, events, logs, sensor readings, financial tick data. Writes are almost always appends (past data rarely changes).
What makes TSDBs special:
- Chunk-based storage: Data for a time range stored contiguously β efficient range scans and compression (delta encoding + Gorilla compression can achieve 12Γ compression on float metrics).
- Automatic downsampling: Aggregate old data (e.g., keep 1-second resolution for 24h, 1-minute for 30d, 1-hour for 2y). Dramatically reduces storage costs.
- Retention policies: Auto-expire old data without expensive DELETE operations.
- Purpose-built query languages: PromQL, Flux β designed for time-window aggregations, rate calculations, anomaly detection.
When to use a TSDB (InfluxDB, TimescaleDB, Prometheus, Victoria Metrics):
- Storing application metrics, infrastructure metrics, IoT sensor data.
- Financial tick data, order book snapshots.
- High-cardinality append-only data with time-range query patterns.
When a general-purpose DB suffices: Low volume (<10K events/sec), you already have PostgreSQL (TimescaleDB is a PG extension), or the data has complex relationships beyond time-series.
-- TimescaleDB: PostgreSQL extension, familiar SQL with TSDB optimizations
CREATE TABLE metrics (
time TIMESTAMPTZ NOT NULL,
device_id TEXT,
temperature DOUBLE PRECISION
);
SELECT create_hypertable('metrics', 'time'); -- partitioned by time chunks
-- Automatic time-window aggregation
SELECT time_bucket('1 hour', time) AS bucket,
device_id,
avg(temperature),
max(temperature)
FROM metrics
WHERE time > NOW() - INTERVAL '7 days'
GROUP BY bucket, device_id
ORDER BY bucket;
Full-text search requires an inverted index: for each unique word (term), a posting list of documents containing it with position data. This enables O(1) term lookup and efficient multi-term intersection.
Built-in DB search (PostgreSQL tsvector, MySQL FULLTEXT):
- Sufficient for basic search on small datasets (<10M rows).
- No additional infrastructure. Transactional β data is always in sync.
- Limited features: no fuzzy matching, poor ranking, no facets, no analytics.
Elasticsearch / OpenSearch: Purpose-built for search. Supports fuzzy matching, stemming, synonyms, multi-language analyzers, faceted search, aggregations, vector search (k-NN for semantic search). Horizontally scalable. Eventually consistent with the primary DB (sync via CDC or app-level dual write).
Architecture with Elasticsearch:
// Sync strategy: write-through via CDC (Debezium) or event-driven
@EventListener
public void onProductUpdated(ProductUpdatedEvent event) {
Product p = event.product();
IndexRequest request = new IndexRequest("products")
.id(String.valueOf(p.id()))
.source(Map.of(
"name", p.name(),
"description", p.description(),
"tags", p.tags(),
"price", p.price()
));
elasticsearchClient.index(request);
}
// Search with fuzzy + filters
SearchRequest search = SearchRequest.of(s -> s
.index("products")
.query(q -> q.bool(b -> b
.must(m -> m.multiMatch(mm -> mm
.query(userQuery)
.fields("name^3", "description", "tags") // boost name 3x
.fuzziness("AUTO")))
.filter(f -> f.range(r -> r.field("price").lte(JsonData.of(100))))
))
.aggregations("by_category", a -> a.terms(t -> t.field("category")))
);
Operational concerns: Keep index mappings locked (changing field types requires reindex). Use index aliases for zero-downtime reindexing. Monitor shard count β too many small shards are expensive (default 1000 shards per node limit). Use ILM (Index Lifecycle Management) for time-based data.
Graph databases store data as nodes (entities) and edges (relationships) with properties on both. Traversing relationships is O(1) per hop β no JOINs that scale with table size. Performance doesn't degrade as the graph grows in size, only in the depth/breadth of the traversal.
Use graph databases when:
- Your queries require multi-hop traversals: "friends of friends who bought X", "shortest path between nodes", "all ancestors in a hierarchy".
- The relationship itself has properties (e.g., an FOLLOWS edge with a
sincedate, a TRANSACTION edge withamount). - The schema is dynamic β nodes can have different properties without requiring schema migration.
Classic use cases: Social networks (LinkedIn's economic graph), fraud detection (ring of connected accounts), recommendation engines ("people who bought X also bought Y"), knowledge graphs, access control (RBAC hierarchies), network topology.
// Neo4j Cypher β find friends-of-friends who like the same genre
MATCH (user:User {id: $userId})-[:FOLLOWS]->(:User)-[:FOLLOWS]->(fof:User),
(fof)-[:LIKES]->(:Genre {name: 'Sci-Fi'})
WHERE NOT (user)-[:FOLLOWS]->(fof) // not already following
RETURN fof.name, count(*) AS mutual_connections
ORDER BY mutual_connections DESC
LIMIT 10;
// Equivalent SQL requires self-joins that get exponentially more complex
// at depth 3+ and don't scale with table size
When to stick with SQL: Most OLTP applications don't have deeply connected traversal queries. A relational DB with proper foreign keys and indexes handles typical "related records" queries efficiently. Only move to a graph DB when relationship traversal is the primary query pattern.
CQRS separates the write model (Commands β operations that change state) from the read model (Queries β operations that return data). They can be different classes, different services, or even different databases.
Why: Read and write workloads often have very different requirements. Writes need to maintain consistency and integrity (normalized, transactional). Reads need to be fast and flexible (denormalized, pre-joined, often from replicas). Optimizing one model for both is a compromise. CQRS lets you optimize each independently.
Simple CQRS: Same database, but separate objects β Command handlers write via domain model, Query handlers read via optimized DTOs directly from SQL. Low complexity. Good first step.
Full CQRS + Event Sourcing: Write side stores events (not current state). Read side builds multiple materialized views from the event stream. Maximum flexibility β add a new read model by replaying events. Very complex. Only for systems where audit history and multiple read model shapes are truly needed.
// Simple CQRS: separate command and query paths
// Command: goes through domain model, enforces business rules
@CommandHandler
public void handle(PlaceOrderCommand cmd) {
Order order = Order.place(cmd.userId(), cmd.items()); // domain logic
orderRepository.save(order); // writes to normalized DB
eventPublisher.publish(new OrderPlacedEvent(order));
}
// Query: reads directly, no domain model overhead
@QueryHandler
public OrderSummaryDTO handle(GetOrderSummaryQuery query) {
// Reads from a denormalized view/replica, no joins needed
return jdbcTemplate.queryForObject(
"SELECT o.id, o.status, u.name, SUM(oi.price) total " +
"FROM orders o JOIN users u ON... WHERE o.id = ?",
(rs, row) -> new OrderSummaryDTO(...), query.orderId());
}
When NOT to use CQRS: Simple CRUD applications where reads and writes are symmetric. It adds two codepaths, sync complexity, and eventual consistency. Apply only when there's a real read/write impedance mismatch.
Caching
8 questions- Cache-aside (Lazy loading): The application checks the cache first. On miss, it reads from the DB, populates the cache, returns data. Most flexible. Data is only cached when actually needed. Risk: cold start, cache stampede, stale data on DB updates (need explicit invalidation or TTL).
- Read-through: The cache sits in front of the DB. On miss, the cache itself fetches from DB and populates. Application always talks to the cache. Consistent cache population. Cache library/provider must support it (e.g., NCache, Ehcache with a CacheLoader).
- Write-through: Every write goes to the cache AND the DB synchronously. Cache is always in sync. Write latency doubles. Unnecessary caching of data that's never read again. Good when reads are frequent and data must always be fresh after a write.
- Write-behind (Write-back): Writes go to the cache immediately; DB is updated asynchronously later in a batch. Very fast writes. Risk: data loss if cache fails before the DB write. Complex to implement correctly. Used in CPU L1/L2 caches and some high-throughput systems.
// Cache-aside pattern (most common in practice)
public Product getProduct(String id) {
String key = "product:" + id;
Product cached = redis.get(key, Product.class);
if (cached != null) return cached;
Product product = db.findById(id); // DB read on miss
redis.set(key, product, Duration.ofMinutes(10)); // populate cache
return product;
}
public void updateProduct(Product product) {
db.save(product);
redis.delete("product:" + product.id()); // invalidate, not update
// Next read will miss β repopulate from fresh DB data
}
A cache stampede occurs when a popular cache entry expires and many concurrent requests simultaneously miss the cache, all racing to query the DB and repopulate it. Under heavy load, this can overwhelm the DB.
Prevention strategies:
- Locking / mutex: Only one requester refreshes the cache; others wait or return stale data while the refresh is in flight.
- Probabilistic early expiration (PER): Slightly before the TTL expires, with increasing probability, start refreshing the cache proactively. Prevents the cliff-edge expiry moment.
- Stale-while-revalidate: Serve the stale value immediately and refresh in the background. Clients never wait for a cache miss.
- Jitter on TTL: Add random jitter to TTLs so mass-populated caches (e.g., after a deploy) don't all expire simultaneously:
ttl = baseTTL + random(0, baseTTL * 0.2).
// Mutex-based stampede prevention with Redis
public Product getProductSafe(String id) {
String key = "product:" + id;
Product cached = redis.get(key, Product.class);
if (cached != null) return cached;
String lockKey = "lock:product:" + id;
boolean locked = redis.set(lockKey, "1", SetArgs.Builder.nx().ex(10)) != null;
if (locked) {
try {
Product product = db.findById(id);
redis.set(key, product, Duration.ofMinutes(10));
return product;
} finally {
redis.delete(lockKey);
}
} else {
// Another thread is refreshing; wait briefly and retry
Thread.sleep(50);
return redis.get(key, Product.class); // hopefully populated now
}
}
When a cache is full and a new item must be added, an eviction policy decides which existing item to remove.
- LRU (Least Recently Used): Evict the item not accessed for the longest time. Good for temporal locality β recently used items are likely to be used again. Default for most caches. Redis:
maxmemory-policy allkeys-lru. Implemented with a doubly linked list + hash map in O(1). - LFU (Least Frequently Used): Evict the item accessed the fewest times. Better for workloads with stable hot/cold access patterns. Avoids LRU's problem where a one-off large scan evicts the entire working set. Redis supports LFU with a Morris counter approximation.
- FIFO: Evict the oldest inserted item regardless of access. Simple but ignores access patterns β may evict hot items. Rarely used in practice.
- Random: Evict a random item. Surprisingly effective in some workloads, O(1), and avoids worst cases of LRU. Redis uses approximated LRU via random sampling (samples 5β10 keys, evicts least recently used among them) β good enough in practice.
- TinyLFU (Caffeine): Modern algorithm used by Caffeine (Java's best in-process cache). Maintains a frequency sketch and an admission window. Outperforms LRU and LFU on most real-world traces. Use Caffeine for in-process caching.
// Caffeine (TinyLFU) β best in-process cache for Java
LoadingCache<String, Product> cache = Caffeine.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(10, MINUTES)
.refreshAfterWrite(5, MINUTES) // async background refresh
.recordStats() // hit rate monitoring
.build(id -> productRepo.findById(id)); // auto-load on miss
// Redis eviction config
// redis.conf: maxmemory 4gb
// maxmemory-policy allkeys-lfu (use LFU for caching workloads)
Cache penetration: Requests for data that doesn't exist in either the cache or the DB. Each request passes through the cache (miss) and hits the DB (also miss). An attacker can deliberately query non-existent keys to DDoS your DB.
Defenses against penetration:
- Cache negative results: Cache a sentinel value (e.g.,
"NULL") with a short TTL for keys that don't exist in the DB. Next request hits the cache and gets the sentinel. - Bloom filter: Before hitting the cache, check a Bloom filter that contains all valid keys. If the filter says "absent," return 404 immediately. False positives are harmless (extra cache miss). No false negatives.
- Input validation: Reject malformed IDs before they reach the cache/DB layer.
Cache poisoning: Malicious or incorrect data is written into the cache, causing subsequent reads to receive bad data. Can happen via CSRF attacks on cache management endpoints, BGP hijacking (for CDN caches), or HTTP response splitting attacks.
Defenses against poisoning: Authenticate cache management APIs. Validate cache entries before serving (e.g., schema validation). Use HTTPS to prevent in-transit manipulation. For CDNs: use Vary headers correctly, sanitize user-controlled cache keys (Host, X-Forwarded-For).
Redis: Rich data structures (strings, hashes, lists, sets, sorted sets, streams, HyperLogLog, geospatial). Persistence (RDB snapshots, AOF log). Pub/Sub. Lua scripting. Cluster mode. Transactions (MULTI/EXEC). Redis Streams for message queuing. Can be used as a primary store, not just a cache.
Memcached: Simple key-value only. Multi-threaded (can use all CPU cores). No persistence. No replication. Pure cache β nothing else. Simpler protocol, slightly lower latency for pure cache-get workloads. Supports CAS (check-and-set).
Choose Redis when: You need data structures (leaderboards with sorted sets, session storage with hashes), pub/sub messaging, Lua atomicity, distributed locks, persistence, Streams, or you might want more than a cache later.
Choose Memcached when: Pure cache, extreme simplicity, multi-threaded performance on many-core machines, and team prefers a minimal operational footprint. Very rare choice today β Redis handles all Memcached use cases and more.
// Redis use cases beyond simple caching
// Sorted set leaderboard
redis.zadd("leaderboard:game:1", score, userId);
redis.zrevrange("leaderboard:game:1", 0, 9); // top 10
// Atomic rate limiting with Lua
// HyperLogLog for unique visitor counts (uses ~12KB for 10M items)
redis.pfadd("visitors:2026-05-25", userId);
long uniqueVisitors = redis.pfcount("visitors:2026-05-25");
// Geo search
redis.geoadd("restaurants", longitude, latitude, restaurantId);
redis.georadius("restaurants", userLon, userLat, 5, GeoUnit.KM);
Redis Cluster shards data across multiple nodes using hash slots (16384 total). Each key maps to a slot via CRC16(key) % 16384. Each master node owns a range of slots. Clients can connect to any node β if they send a command for a key on a different node, they receive a MOVED redirect and query the correct node directly (clients cache the slot map).
Setup: Minimum 3 masters (for majority quorum in failover). Each master should have at least one replica for HA. Cluster automatically promotes a replica to master if a master fails.
Limitations:
- Multi-key operations restricted: MGET, MSET, transactions (MULTI/EXEC), Lua scripts β all keys must map to the same slot. Use hash tags to force co-location:
{user:123}:profileand{user:123}:settingshash onuser:123β same slot. - Pub/Sub: Messages only go to the connected node's subscribers β not cluster-wide. Use Redis Streams or a dedicated pub/sub node instead.
- Cross-slot operations: No JOINs or cross-slot set operations. Application must orchestrate.
# Redis Cluster hash tags β force related keys to same slot
# Without tags: {user:profile:123} and {user:session:123} β different slots
# With tags: use {} to designate the hashing part
SET {user:123}:profile "..." EX 3600
SET {user:123}:session "..." EX 3600
GET {user:123}:profile
# Both keys hash on "user:123" β guaranteed same slot β MGET works
MGET {user:123}:profile {user:123}:session
Write-through: Every write updates both the cache and the DB. Cache is always current. Good for read-heavy workloads where you want every written record to be immediately cache-warm. Downside: cache fills with data that may never be read (wasted memory, cache churn).
Write-around: Writes go directly to the DB, bypassing the cache. Cache is only populated on the next read (lazy). Avoids cache pollution for write-heavy data. Better when newly written data is rarely immediately re-read (e.g., large file ingestion, bulk imports, log data). The next read pays the cache miss cost.
Decision framework:
- Data is written and frequently read immediately β write-through.
- Data is written in bulk and may not be read for a long time (or never) β write-around.
- Most web app session data and user profile updates β write-through or cache-aside with invalidation.
- Reporting data, analytics exports, bulk data pipeline outputs β write-around (don't pollute the cache).
Key metrics:
- Cache hit rate:
hits / (hits + misses). The most important metric. A general cache hit rate below 80% is a problem β revisit TTLs, key design, and cache size. For critical hot paths, target 95%+. - Eviction rate: If items are being evicted before their TTL expires, the cache is too small. Scale up memory or review what's being cached.
- Miss latency vs. hit latency: A cache hit should be <1ms (Redis). A miss triggers a DB query β measure this separately to understand the cost of misses.
- Memory utilization: Keep under 80β85% to leave headroom and avoid aggressive eviction.
- Connection count: High connection counts can cause Redis latency spikes. Use connection pooling (Lettuce pool size = CPU cores Γ 2).
- Command latency: Redis
SLOWLOGcaptures commands over a threshold. Large O(N) commands (SMEMBERS on huge sets, KEYS in production) are common culprits.
# Redis monitoring commands
redis-cli info stats # keyspace hits/misses, evictions
redis-cli info memory # used_memory, mem_fragmentation_ratio
redis-cli slowlog get 10 # last 10 slow commands
redis-cli monitor # real-time command stream (debug only)
# Caffeine (in-process cache) stats
CacheStats stats = cache.stats();
System.out.printf("Hit rate: %.2f%%, Evictions: %d, Load time: %dns%n",
stats.hitRate() * 100,
stats.evictionCount(),
stats.averageLoadPenalty());
Messaging & Events
8 questionsKafka is a distributed, append-only log. Producers write to topics; consumers read from them at their own pace. Data is retained for a configurable period (not deleted on consume).
Topics and partitions: A topic is divided into partitions β ordered, immutable, append-only sequences of records. Partitions enable parallelism. Messages with the same key always go to the same partition (hash partitioning), ensuring ordering per key. Without a key, messages are distributed round-robin.
Offsets: Each record in a partition has a monotonically increasing offset. Consumers track their position by committing offsets. On restart, they resume from the last committed offset. Kafka retains all records regardless β consumers can replay from any offset.
Consumer groups: A group of consumers that jointly consume a topic. Each partition is assigned to exactly one consumer in the group β parallelism up to partition count. Different consumer groups each get their own offset β one Kafka topic can be consumed independently by multiple services.
// Kafka producer with key-based partitioning
ProducerRecord<String, OrderEvent> record = new ProducerRecord<>(
"orders", // topic
order.id(), // key β same order always on same partition
new OrderEvent(order)
);
producer.send(record, (metadata, ex) -> {
log.info("Sent to partition={} offset={}", metadata.partition(), metadata.offset());
});
// Consumer group β 3 consumers Γ 6 partitions = 2 partitions each
@KafkaListener(topics = "orders", groupId = "order-processor",
concurrency = "3") // 3 consumer threads
public void handleOrder(OrderEvent event, Acknowledgment ack) {
orderService.process(event);
ack.acknowledge(); // manual offset commit after successful processing
}
Key design decisions: More partitions = more parallelism but more overhead. Replication factor = 3 for production (ISR). Retention: 7 days is common, longer for replay use cases. Compacted topics: keep only the latest value per key β useful for materializing current state.
RabbitMQ is a traditional message broker β messages are pushed to consumers, acknowledged, and deleted on consumption. Rich routing (exchanges, bindings, headers, topics). Request-reply patterns. Dead letter queues. Excellent for task queues, job distribution, and RPC. Message ordering not guaranteed across consumers. Not designed for replay or long-term storage.
Kafka is an append-only distributed log. Pull-based β consumers read at their own pace. Messages are retained (configurable). Replay historical events. Multiple independent consumer groups. Massive throughput (millions of events/sec). Event sourcing, stream processing, audit logs, real-time pipelines.
| Dimension | Kafka | RabbitMQ |
|---|---|---|
| Message retention | Days/weeks (configurable) | Until consumed |
| Throughput | Millions/sec | Tens of thousands/sec |
| Routing | Topic + partition key | Rich (exchanges, headers) |
| Best for | Event streaming, pipelines, replay | Task queues, RPC, routing |
Rule of thumb: If you need to replay events, have multiple independent consumers of the same events, or need massive throughput β Kafka. If you need rich routing logic, request-reply, or simple task queue semantics β RabbitMQ.
Instead of storing current state, event sourcing stores the full sequence of events that led to the current state. The current state is derived by replaying events. The event log is immutable and append-only.
Benefits:
- Complete audit trail: Every state change is recorded with who did it and when. Perfect for compliance (finance, healthcare, legal).
- Time travel: Reconstruct state at any point in time by replaying events up to that timestamp.
- Event-driven integration: Events are naturally publishable to other systems. No dual-write needed.
- Projection flexibility: Build new read models by replaying the event log. Add a new query model without migrating existing data.
- Debuggability: Reproduce any bug by replaying the exact event sequence that caused it.
Challenges:
- Eventual consistency: Read models are built asynchronously β they lag behind.
- Schema evolution: Immutable events with evolving schemas are hard. Use upcasting β transform old event formats to new on read.
- Replay performance: An entity with 10,000 events takes a long time to reconstitute. Snapshots (periodic state checkpoints) solve this.
- Complexity: Significant conceptual overhead. Only justified for domains where audit, replay, and temporal queries are genuinely needed.
// Event sourcing: reconstitute order from events
public Order reconstitute(List<DomainEvent> events) {
Order order = new Order(); // empty state
for (DomainEvent event : events) {
switch (event) {
case OrderPlaced e -> order.apply(e);
case ItemAdded e -> order.apply(e);
case PaymentReceived e-> order.apply(e);
case OrderShipped e -> order.apply(e);
}
}
return order;
}
// Snapshot optimization: store state at event 500, replay only from 500 onward
A DLQ is a special queue/topic where messages land when they cannot be successfully processed after a configured number of retries. Instead of losing failed messages or blocking the consumer, the DLQ preserves them for inspection, alerting, and manual or automated remediation.
Retry strategy: Immediate retry β short delay retry β longer delay retry β DLQ. Use exponential backoff with jitter to avoid hammering a failing downstream service.
// Spring Kafka: retry with exponential backoff β DLQ
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<?, ?> template) {
// Publish to DLQ after 3 attempts
DeadLetterPublishingRecoverer recoverer =
new DeadLetterPublishingRecoverer(template,
(record, ex) -> new TopicPartition(record.topic() + ".DLQ", -1));
// Exponential backoff: 1s, 2s, 4s (max 3 attempts)
ExponentialBackOffWithMaxRetries backOff = new ExponentialBackOffWithMaxRetries(3);
backOff.setInitialInterval(1000);
backOff.setMultiplier(2.0);
DefaultErrorHandler handler = new DefaultErrorHandler(recoverer, backOff);
// Don't retry on deserialization errors β they'll never succeed
handler.addNotRetryableExceptions(DeserializationException.class);
return handler;
}
// DLQ consumer: alerting + manual replay tooling
@KafkaListener(topics = "orders.DLQ")
public void handleDLQ(ConsumerRecord<?, ?> record, Exception ex) {
alerting.sendAlert("Message processing failed", record, ex);
dlqRepository.save(new FailedMessage(record)); // store for ops review
}
DLQ best practices: Always have alerts on DLQ depth. Include original exception, stack trace, and retry count as message headers. Build a replay tool to re-enqueue fixed messages after a bug fix. DLQ messages should never be silently discarded.
Batch processing: Process a large, bounded dataset all at once. High throughput, efficient for large-scale transformations. Results are available after the entire batch completes β latency is hours or days. Examples: Apache Spark, Hadoop MapReduce, nightly ETL jobs, monthly billing.
Stream processing: Process an unbounded, continuous stream of events in real-time (milliseconds to seconds). Results are continuously updated. Lower throughput than batch for the same data volume due to per-event overhead. Examples: Apache Flink, Kafka Streams, Apache Spark Structured Streaming, fraud detection, live dashboards, alerting.
Lambda architecture: Combines both β a batch layer for accurate historical processing and a speed layer for real-time, with a serving layer merging both. Complex to operate (two codebases). Mostly replaced by Kappa architecture.
Kappa architecture: Stream processing only. Historical reprocessing is done by replaying the Kafka log. Simpler (one codebase). Works when your streaming framework can handle batch-like reprocessing at scale (Flink, Kafka Streams). Modern default.
// Kafka Streams: real-time fraud detection pattern
KStream<String, Transaction> transactions = builder.stream("transactions");
// Tumbling window: count transactions per user per 5 minutes
KTable<Windowed<String>, Long> txnCount = transactions
.groupBy((key, tx) -> tx.userId())
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5)))
.count();
// Alert if >10 transactions in 5 minutes
txnCount.toStream()
.filter((windowedKey, count) -> count > 10)
.map((windowedKey, count) ->
KeyValue.pair(windowedKey.key(), new FraudAlert(windowedKey.key(), count)))
.to("fraud-alerts");
Backpressure is a flow control mechanism where a slow consumer signals to a fast producer to slow down, preventing buffer overflow and system collapse. Without backpressure, fast producers overwhelm slow consumers, causing out-of-memory errors, dropped messages, or cascading failures.
Reactive Streams: The Reactive Streams specification (Project Reactor, RxJava, Akka Streams) standardizes backpressure. Subscribers signal how many items they can handle via request(n). The publisher sends at most n items. The subscriber requests more when ready.
Backpressure strategies when the consumer can't keep up:
- Buffer: Queue items (risky if unbounded β OOM).
- Drop: Discard new items when buffer is full (acceptable for metrics/sampling).
- Latest: Keep only the most recent item (live data feeds, stock prices).
- Error: Fail fast and propagate the error to the caller.
- Slow down producer: Propagate the backpressure signal upstream (ideal β demand-driven).
// Project Reactor backpressure strategies
Flux<Event> eventStream = eventSource.hotStream();
// Buffer with bounded limit β fail if exceeded
eventStream
.onBackpressureBuffer(1000, e -> log.warn("Buffer overflow, dropping: {}", e))
.flatMap(this::processEvent, 10) // concurrency = 10
.subscribe();
// Drop strategy for metrics (losing a few data points is acceptable)
eventStream
.onBackpressureDrop(e -> metrics.increment("dropped.events"))
.subscribe(this::collectMetric);
// Demand-driven: subscriber requests only what it can handle
eventStream.subscribe(new BaseSubscriber<>() {
@Override protected void hookOnSubscribe(Subscription s) { request(10); }
@Override protected void hookOnNext(Event e) {
process(e);
request(1); // request next item only after processing current
}
});
Point-to-point (queue): A message is delivered to exactly one consumer. Multiple competing consumers on the same queue share the load β each message processed once. Good for task distribution, work queues, load leveling. RabbitMQ queues, Amazon SQS.
Publish-subscribe (topic): A message is delivered to all interested subscribers. Each subscriber gets its own copy. Good for broadcasting events to multiple systems. Event-driven architectures. Kafka topics, SNS topics, Redis pub/sub.
Hybrid (Kafka consumer groups): Kafka topics provide pub/sub semantics between consumer groups, but point-to-point semantics within a consumer group (each partition goes to one consumer). This gives you both: multiple services each independently consume the full topic, but within each service, work is distributed across instances.
// Kafka: one topic, multiple independent consumer groups
// OrderPlaced topic consumed by:
// 1. inventory-service (group: inventory-consumers) β reserves stock
// 2. notification-service (group: notification-consumers) β sends emails
// 3. analytics-service (group: analytics-consumers) β updates dashboards
// Each group gets every message independently
// vs. point-to-point: only one service ever processes each message
// Good for: distributed task queue with worker pool
// Bad for: broadcasting events to multiple different services
Message ordering is guaranteed within a Kafka partition for messages with the same key. The challenge is that parallelism (multiple partitions, multiple consumers) breaks global ordering.
Strategies:
- Single partition: All messages on one partition β total order. Zero parallelism. Only for very low-throughput scenarios.
- Partition by entity key: Route all events for the same entity (same order ID, same user ID) to the same partition via a message key. Events for one entity are ordered; events across different entities may be interleaved. This is almost always the right approach.
- Sequence numbers: Embed a per-entity sequence number. Consumers detect gaps and buffer out-of-order messages until the missing sequence arrives (resequencer pattern).
- Single consumer per entity: Use consistent hashing to route processing for an entity to the same consumer thread. Used in actor frameworks (Akka) and virtual actor models (Orleans).
// Kafka: partition by entity key to preserve per-entity order
producer.send(new ProducerRecord<>(
"order-events",
order.id(), // β KEY: same order always on same partition
new OrderEvent(order)
));
// All events for order-789 go to the same partition in creation order:
// [OrderCreated, ItemAdded, PaymentReceived, OrderShipped] β in order β
// Events for order-790 go to its partition β possibly different consumer β
// Cross-order ordering not preserved β acceptable for most use cases
API Design
8 questionsGood REST design principles:
- Resources, not actions: URIs represent nouns (resources), HTTP methods represent verbs.
POST /ordersnotPOST /createOrder. - Correct HTTP semantics: GET (read, safe, idempotent), POST (create, non-idempotent), PUT (full replace, idempotent), PATCH (partial update), DELETE (remove, idempotent).
- Consistent naming: Plural nouns (
/users, not/user). Lowercase with hyphens (/user-profiles). Hierarchical for owned resources (/users/{id}/orders). - Proper status codes: 200 OK, 201 Created (POST), 204 No Content (DELETE), 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 409 Conflict, 422 Unprocessable Entity, 429 Too Many Requests, 500 Internal Server Error.
- Versioning: URL versioning (
/v1/users) is most common. Header versioning (Accept: application/vnd.app.v2+json) is cleaner but harder to test in a browser. - Pagination: Cursor-based pagination for large datasets (efficient); offset pagination for fixed, small datasets. Always include total count and next/previous links.
Common mistakes:
- Using GET for mutations (cache-hostile, dangerous).
- Returning 200 with
{"success": false, "error": "..."}β use proper status codes. - Chatty APIs requiring many calls to complete one user action β consider aggregation endpoints or GraphQL.
- No idempotency keys for POST operations β clients can't safely retry.
- Returning too much data β use field selection (
?fields=id,name) or sparse fieldsets.
# Good REST API design
GET /orders # list orders (paginated)
POST /orders # create order β 201 + Location header
GET /orders/{id} # get order by ID
PATCH /orders/{id} # partial update (status, items)
DELETE /orders/{id} # cancel/delete order
# Idempotency key for safe POST retry
POST /payments
Idempotency-Key: client-uuid-abc-123
{ "amount": 99.99, "currency": "USD" }
# Server stores result by key; duplicate requests return same response
GraphQL advantages:
- No over/under-fetching: Clients request exactly the fields they need. One query can fetch nested related data that would require multiple REST calls.
- Single endpoint: All queries through
/graphql. Simpler client integration for multiple resource types. - Strongly typed schema: Schema-first development. Auto-generated docs. Type-safe client code generation.
- Ideal for diverse clients: Mobile apps needing minimal data, desktop apps needing full data β same API, different queries. Avoids versioning proliferation.
GraphQL challenges:
- N+1 problem: Nested resolvers can trigger N queries for N entities. Must use DataLoader (batching + caching) to solve.
- Caching complexity: REST GET responses cache trivially (URL-based). GraphQL POST queries require client-side normalized caches (Apollo, Relay) or persisted queries at the CDN.
- Security: Unbounded query depth/complexity attacks. Must implement depth limiting and complexity analysis.
- Overkill for simple APIs: REST is simpler for CRUD with stable access patterns and few clients.
// GraphQL with DataLoader to prevent N+1
@Component
public class OrderDataLoader {
// Batch: load all users for a set of orders in one DB query
public DataLoader<Long, User> userDataLoader() {
return DataLoader.newMappedDataLoader(userIds ->
CompletableFuture.supplyAsync(() ->
userRepository.findAllById(userIds).stream()
.collect(toMap(User::id, identity()))
)
);
}
}
// Query: get 100 orders + their users = 1 order query + 1 batched user query (not 101)
query {
orders(first: 100) {
id status
user { name email } # DataLoader batches all user fetches
items { sku quantity }
}
}
gRPC is a high-performance, contract-first RPC framework using Protocol Buffers (protobuf) for serialization and HTTP/2 for transport. Developed by Google.
Advantages over REST:
- Performance: Protobuf binary serialization is 3β10Γ smaller and faster than JSON. HTTP/2 multiplexing means multiple calls over one connection with no head-of-line blocking.
- Streaming: Native bidirectional streaming (server stream, client stream, bidi stream) over a single connection. Impossible with standard REST.
- Strong typing + code generation: Define services in
.protofiles, generate type-safe clients and servers in 10+ languages. Eliminates a class of integration bugs. - Deadlines + cancellation: First-class propagation of timeouts and cancellation across service boundaries.
When to use gRPC: Internal microservice-to-microservice communication where performance matters. Polyglot environments. Real-time streaming (chat, live updates, ML model serving). Mobile backends where bandwidth is precious.
When REST is better: Public APIs (browser support, firewall compatibility). Simple CRUD with no streaming. Teams unfamiliar with protobuf toolchain. When HTTP caching is critical.
// gRPC service definition (proto3)
syntax = "proto3";
service OrderService {
rpc PlaceOrder (PlaceOrderRequest) returns (Order);
rpc StreamOrderUpdates (OrderId) returns (stream OrderStatusUpdate); // server streaming
rpc ProcessOrders (stream Order) returns (BatchResult); // client streaming
}
message PlaceOrderRequest {
string user_id = 1;
repeated OrderItem items = 2;
}
// Java Spring Boot gRPC server (grpc-spring-boot-starter)
@GrpcService
public class OrderGrpcService extends OrderServiceGrpc.OrderServiceImplBase {
@Override
public void placeOrder(PlaceOrderRequest req, StreamObserver<Order> resp) {
Order order = orderService.place(req.getUserId(), req.getItemsList());
resp.onNext(order.toProto());
resp.onCompleted();
}
}
API versioning allows you to evolve the API without breaking existing clients. The key is distinguishing between backwards-compatible changes (additive, no version bump needed) and breaking changes (require a new version).
Backwards-compatible (non-breaking) changes: Adding new optional fields, adding new endpoints, adding new enum values, making required fields optional, relaxing validation rules.
Breaking changes: Removing or renaming fields/endpoints, changing field types, making optional fields required, changing authentication schemes, changing error response shapes.
Versioning strategies:
- URL versioning (
/v1/users): Most common, explicit, easy to route at the gateway, easy to test. Pollutes the URI semantically (version isn't a resource property). Easy to forget to update version in bookmarked URLs. - Header versioning (
API-Version: 2): Clean URIs. Harder to test in browser. Invisible to caches unless you set Vary header. - Accept header (
Accept: application/vnd.api.v2+json): Semantically correct per HTTP spec. Most complex. Used by GitHub. - Query parameter (
?version=2): Easy but considered an anti-pattern β query params should filter resources, not select API versions.
Deprecation strategy: Announce deprecation timelines publicly. Add Deprecation and Sunset headers to deprecated endpoints. Log usage of old versions to notify lingering clients. Maintain old versions for at least 12β18 months after deprecation announcement.
An API gateway is the single entry point for all client requests to a microservices backend. It handles cross-cutting concerns centrally.
What belongs in the gateway:
- SSL termination and TLS certificate management.
- Authentication (validate JWT tokens, API keys) β not authorization.
- Rate limiting per client.
- Request routing to the correct backend service.
- Load balancing across service instances.
- Request/response logging and distributed tracing context injection.
- Response caching for cacheable endpoints.
- Request transformation (header injection, path rewriting).
- Circuit breaking and timeout enforcement.
What does NOT belong in the gateway:
- Business logic: The gateway is infrastructure. Business rules belong in services. A fat gateway becomes a bottleneck and a deployment risk.
- Authorization: Fine-grained "can user X access resource Y?" belongs in the service, not the gateway. The gateway only validates that the token is genuine.
- Data aggregation: If the gateway is composing responses from multiple services, it's becoming a BFF (Backend for Frontend) β that's a separate pattern with its own service.
# Kong API Gateway declarative config
services:
- name: order-service
url: http://order-service:8080
routes:
- name: orders-route
paths: [/v1/orders]
plugins:
- name: jwt # validate JWT token
- name: rate-limiting
config:
minute: 100 # 100 req/min per consumer
- name: request-transformer
config:
add:
headers: [X-Request-ID:$(uuid)]
An operation is idempotent if applying it multiple times produces the same result as applying it once. Critical because in distributed systems, networks fail and clients must retry. Without idempotency, retries cause duplicate charges, double-shipped orders, or duplicate records.
Naturally idempotent: GET, PUT, DELETE. A GET repeated returns the same data. A DELETE on an already-deleted resource should return 204 (or 404), not an error.
Not naturally idempotent: POST β creating a resource twice creates two. Requires explicit idempotency keys.
Implementation:
// Idempotency key pattern for payment API
@PostMapping("/payments")
public ResponseEntity<Payment> createPayment(
@RequestHeader("Idempotency-Key") String idempotencyKey,
@RequestBody PaymentRequest request) {
// Check if we've seen this key before
Optional<Payment> existing = idempotencyStore.get(idempotencyKey);
if (existing.isPresent()) {
return ResponseEntity.ok(existing.get()); // return cached result
}
// Validate and process
Payment payment = paymentService.charge(request);
// Store result atomically with idempotency key (with TTL, e.g. 24h)
idempotencyStore.store(idempotencyKey, payment, Duration.ofHours(24));
return ResponseEntity.status(201).body(payment);
}
// Redis-backed idempotency store
// SET idempotency:{key} {serializedResponse} EX 86400 NX
// NX = only set if not exists (atomic check-and-set)
Key considerations: The idempotency key + user/client ID must be validated together (one client can't use another's key). Store the full response, not just "processed" β repeated calls must return identical responses. Use database unique constraints as a backstop for true atomic idempotency.
Offset pagination (?page=3&limit=20): Simple. Can jump to any page. Unstable β if items are inserted before the current page, results shift (items skipped or repeated). Expensive for large offsets: OFFSET 1000000 requires scanning 1M rows before returning 20. DB must count total rows for page count. Good for small, stable datasets.
Cursor-based pagination (?after=cursor123&limit=20): A cursor is an opaque token encoding the position (often base64-encoded last item's ID + timestamp). No total count possible. Can't jump to page N. Stable under inserts/deletes β no skip/repeat problem. Used by Twitter, Facebook, Stripe. Best for infinite scroll and real-time feeds.
Keyset pagination (?after_id=1234&limit=20): The efficient implementation of cursor pagination. Uses an indexed column (ID, timestamp) in the WHERE clause instead of OFFSET. Extremely fast regardless of depth β index seek, not scan. The "cursor" is just the value of the last seen column.
-- Offset: slow at depth (must scan and skip N rows)
SELECT * FROM orders ORDER BY created_at DESC
OFFSET 100000 LIMIT 20; -- scans 100020 rows, returns 20
-- Keyset: O(log N) regardless of depth
SELECT * FROM orders
WHERE created_at < :lastSeenCreatedAt -- or (created_at, id) for ties
OR (created_at = :lastSeenCreatedAt AND id < :lastSeenId)
ORDER BY created_at DESC, id DESC
LIMIT 20;
-- Uses index on (created_at, id) β seeks directly, scans only 20 rows
-- Cursor = base64("{lastSeenCreatedAt}_{lastSeenId}")
-- Response includes: {"data": [...], "nextCursor": "eyJjcm..."}
Never make a client hold an HTTP connection open for minutes. Instead, use the async job pattern: accept the request, return immediately with a job ID, allow polling or webhooks for completion.
Polling pattern (simple):
// 1. Client submits job
POST /reports/generate
Body: { "type": "monthly", "month": "2026-04" }
β 202 Accepted
Location: /jobs/job-uuid-789
Body: { "jobId": "job-uuid-789", "status": "PENDING" }
// 2. Client polls until done
GET /jobs/job-uuid-789
β { "jobId": "...", "status": "PROCESSING", "progress": 45 }
GET /jobs/job-uuid-789
β { "jobId": "...", "status": "COMPLETE",
"resultUrl": "/reports/rep-uuid-456" }
// 3. Client fetches result
GET /reports/rep-uuid-456
Webhook pattern (push-based): Client registers a callback URL when submitting. Server POSTs to the callback when complete. No polling overhead. Requires the client to expose a public HTTPS endpoint. Used by Stripe, GitHub, Twilio.
Server-Sent Events (SSE): Keep a one-way HTTP connection open; server pushes status updates as events. Good middle ground β simpler than WebSockets, supported by all browsers. Client reconnects automatically on disconnect. Used for progress bars, live dashboards.
Best practices: Job IDs should be UUIDs (unpredictable). Store job state in a durable store (DB, not in-memory). Include retry-after hints in polling responses. Set job TTL and clean up old jobs. Support cancellation (DELETE /jobs/{id}).
Reliability & Resilience
10 questionsThe circuit breaker wraps a call to a remote service and monitors for failures. When failure rate exceeds a threshold, the circuit "opens" β subsequent calls fail immediately without trying the downstream service, giving it time to recover and preventing cascade failures.
Three states:
- Closed (normal): Requests pass through. Failures are counted. If failure rate exceeds threshold (e.g., 50% in a 10-second window), transition to Open.
- Open (tripped): All requests fail immediately with a fallback (cached response, default value, error). No calls to the downstream. After a configured wait time (e.g., 60s), transition to Half-Open.
- Half-Open: Allow a limited number of test requests through. If they succeed β Close. If they fail β Open again.
// Resilience4j circuit breaker
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // open if 50% of calls fail
.slidingWindowSize(10) // in last 10 calls
.waitDurationInOpenState(Duration.ofSeconds(30))
.permittedNumberOfCallsInHalfOpenState(3)
.build();
CircuitBreaker cb = CircuitBreakerRegistry.of(config)
.circuitBreaker("payment-service");
// Decorate the call
Supplier<PaymentResult> decorated = CircuitBreaker
.decorateSupplier(cb, () -> paymentClient.charge(request));
// Execute with fallback
Try.ofSupplier(decorated)
.recover(CallNotPermittedException.class,
ex -> PaymentResult.cached("Circuit open, try again later"))
.get();
Circuit breaker metrics to monitor: State transitions (log every open/close event), failure rate, number of calls in each state, fallback activation rate.
Named after ship hull compartments that can be sealed to prevent flooding from spreading. The bulkhead pattern isolates resources (thread pools, semaphores, connection pools) so that a failure in one component can't exhaust resources needed by others.
Thread pool bulkhead: Each downstream service gets its own bounded thread pool. If the inventory service is slow and its 10 threads are all waiting, the payment service's 10 threads are unaffected. Without bulkheads, a slow service monopolizes the shared thread pool and starves everything else.
Semaphore bulkhead: Limits concurrent calls without a separate thread pool. Lighter weight. The calling thread is blocked on a permit. Used for short, fast calls.
// Resilience4j bulkhead β separate thread pool per downstream service
ThreadPoolBulkheadConfig config = ThreadPoolBulkheadConfig.custom()
.maxThreadPoolSize(10)
.coreThreadPoolSize(5)
.queueCapacity(20) // queue up to 20 before rejecting
.build();
ThreadPoolBulkhead inventoryBulkhead =
ThreadPoolBulkheadRegistry.of(config).bulkhead("inventory-service");
// Calls to inventory use isolated pool; payment service unaffected
CompletableFuture<Inventory> result =
inventoryBulkhead.executeSupplier(() -> inventoryClient.check(sku));
result.exceptionally(BulkheadFullException.class::isInstance
? e -> Inventory.UNAVAILABLE
: e -> { throw (RuntimeException) e; });
// Semaphore bulkhead for lighter isolation
BulkheadConfig semConfig = BulkheadConfig.custom()
.maxConcurrentCalls(20)
.maxWaitDuration(Duration.ofMillis(100))
.build();
Fail-fast: The system immediately reports an error when it detects a problem rather than attempting to continue in an invalid state. Errors surface quickly and at the point of origin β easier to debug. Prevents propagating corrupted state deeper into the system.
Fail-safe: The system continues operating with degraded functionality when a component fails, providing a fallback or default. Better user experience β partial functionality beats total outage. Risk: degraded state may cause subtle data integrity issues if not designed carefully.
When to use each:
- Fail-fast for: Data validation at boundaries (reject bad input immediately), resource exhaustion (queue full β 429, don't block), configuration errors on startup.
- Fail-safe for: Non-critical features (search suggestions failing shouldn't break the search results page), external dependencies (recommendation engine down β show popular items instead), analytics and monitoring (never let observability code crash the main app).
The two coexist: fail-fast at the component boundary (detect the failure), fail-safe at the system level (degrade gracefully). A circuit breaker is the canonical combination β it fails fast (rejects calls when open) to achieve system-level fail-safe behavior.
SLA math: 99.9% = ~8.7 hours downtime/year. 99.99% = ~52 minutes/year. 99.999% (five nines) = ~5 minutes/year. Each nine is dramatically harder and more expensive to achieve.
Strategies for high availability:
- Eliminate single points of failure: Every component must be redundant. Load balancers in pairs (active-passive or active-active). Databases with replicas. Multi-AZ deployments. Redundant network paths.
- Health checks and automatic failover: Load balancers must detect and route around unhealthy instances. DB replicas must auto-promote on primary failure. Use short health check intervals (5β10s) but with enough consecutive failures before removal to avoid flapping.
- Graceful degradation: Non-critical features degrade, core functionality continues. Feature flags to disable broken features without a deploy.
- Zero-downtime deploys: Rolling deployments, blue-green, or canary. Never take the whole fleet down at once.
- Chaos engineering: Deliberately inject failures in production (Netflix Chaos Monkey) to find single points of failure before they find you.
- Runbooks and on-call: Automation can only do so much. You need humans with runbooks for the failures automation can't handle.
# Availability of a system with multiple components:
# Availability = product of all component availabilities
# Two components in series: A Γ B
# Two components in parallel (redundant): 1 - (1-A)(1-B)
# Example: 3-tier app
Web tier: 99.9% Γ App tier: 99.9% Γ DB tier: 99.9% = 99.7% (series)
β 26 hours downtime/year β unacceptable!
# With redundancy at each tier (2 instances each):
Web tier: 1-(0.001Β²) = 99.9999%
App tier: 99.9999%
DB tier (primary + replica): 99.9999%
Combined: ~99.9997% β 16 minutes/year
Blue-green deployment: Two identical production environments (blue = current, green = new). Switch traffic from blue to green all at once. Instant rollback (switch back). Requires 2Γ infrastructure cost. No gradual rollout. Database schema changes are tricky β both versions must be able to run against the same DB simultaneously.
Rolling deployment: Replace instances one at a time (or in small batches). Gradual rollout. Only needs N+1 capacity (one extra instance during rollout). Old and new versions run simultaneously during the rollout window β requires backwards compatibility. Slow rollback (need to roll forward or roll backwards batch by batch).
Canary release: Route a small % of traffic (1β5%) to the new version. Monitor error rates, latency, business metrics. If healthy, gradually increase to 100%. If issues detected, route all traffic back to stable version instantly. Requires sophisticated traffic splitting (weighted load balancing, feature flags, service mesh like Istio). Best for high-risk changes. Named after the canary-in-a-coal-mine metaphor.
# Kubernetes canary with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 5 # 5% to canary
- pause: {duration: 5m}
- analysis: # automated metrics check
templates: [{templateName: success-rate}]
- setWeight: 25
- pause: {duration: 10m}
- setWeight: 100 # promote to full traffic if healthy
canaryService: order-service-canary
stableService: order-service-stable
Observability is the ability to understand the internal state of a system from its external outputs. The three pillars:
- Metrics: Numeric measurements over time. Aggregatable, queryable, alertable. Low storage cost. No context on individual requests. Use: Prometheus + Grafana. Key metrics: RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for resources.
- Logs: Structured records of discrete events. Full context (request details, stack traces). High storage cost. Hard to correlate across services without trace IDs. Use structured logging (JSON) with consistent fields. Avoid logging in tight loops.
- Traces: Records the path of a request through distributed services, with timing for each span. Answers "why is this specific request slow?" Sampling required at scale (typically 1β10%). Use: Jaeger, Zipkin, OpenTelemetry.
How they connect: A metric spike alerts you to a problem. You drill down with logs to find affected request IDs. You use a trace ID from the logs to see the full distributed call tree and identify which service/query is slow.
// OpenTelemetry β instrument once, export to any backend
// Automatic instrumentation for Spring Boot (HTTP, DB, Kafka auto-instrumented)
// Manual spans for business-critical operations
@WithSpan("checkout.process")
public Order processCheckout(@SpanAttribute("user.id") String userId,
Cart cart) {
Span span = Span.current();
span.setAttribute("cart.item_count", cart.size());
try {
Order order = placeOrder(userId, cart);
span.setAttribute("order.id", order.id());
return order;
} catch (Exception e) {
span.setStatus(StatusCode.ERROR, e.getMessage());
span.recordException(e);
throw e;
}
}
// Trace propagation: W3C TraceContext headers (traceparent, tracestate)
// inject into all outgoing HTTP/Kafka calls automatically
Retrying failed requests increases resilience against transient failures (network blips, brief overload). Poorly designed retries can amplify problems.
What to retry: Transient errors β network timeouts, 503 Service Unavailable, 429 Too Many Requests (with respect to Retry-After header). Never retry: 4xx errors (bad request, auth failure β retrying won't help), non-idempotent operations without idempotency keys.
Exponential backoff: Double the wait time between retries β attempt 1 immediately, attempt 2 after 1s, attempt 3 after 2s, attempt 4 after 4s, cap at ~30s. Reduces thundering herd.
Jitter: Add randomness to the backoff: wait = min(cap, random(0, base Γ 2^attempt)). Without jitter, all retrying clients hit the recovering service simultaneously. With jitter, retries spread out in time β dramatically reduces synchronized load spikes. AWS research shows "full jitter" is the most effective strategy.
// Resilience4j retry with exponential backoff + jitter
RetryConfig config = RetryConfig.custom()
.maxAttempts(4)
.intervalFunction(IntervalFunction.ofExponentialRandomBackoff(
Duration.ofMillis(500), // initial
2.0, // multiplier
0.5, // randomization factor (jitter)
Duration.ofSeconds(30) // max wait
))
.retryOnException(e -> e instanceof ServiceUnavailableException
|| e instanceof TimeoutException)
.build();
// Wait times with jitter: ~500ms, ~750-1250ms, ~1500-2500ms, cap at 30s
// Multiple clients all jittered β no synchronized thundering herd
Retry budget: Set a maximum total retry time (deadline propagation). If the original request deadline has 500ms left, don't retry with a 2-second backoff. Use Resilience4j's TimeLimiter or gRPC's deadline propagation to enforce end-to-end deadlines.
A service mesh is a dedicated infrastructure layer for handling service-to-service communication, implemented as a set of network proxies (sidecars) deployed alongside each service pod.
Problems it solves (without changing application code):
- mTLS (mutual TLS): Encrypt all service-to-service traffic and authenticate both sides automatically. No certificates to manage in application code.
- Traffic management: Fine-grained canary routing (route 5% of traffic to v2), retry policies, timeout enforcement, circuit breaking β all in config, not code.
- Observability: Automatic distributed traces and service-level metrics for every service call β no code instrumentation needed.
- Authorization policies: Service-level firewall β "only the order-service is allowed to call the payment-service" enforced at the mesh level, not in application code.
Popular implementations: Istio (with Envoy sidecar), Linkerd (lighter weight), AWS App Mesh, Consul Connect.
Cost: Sidecars add memory (~50MB/pod) and latency (~1ms per hop). Operational complexity of the control plane. Worth it at scale; overkill for small microservices deployments.
# Istio VirtualService: canary routing without code changes
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata: { name: payment-service }
spec:
http:
- match: [{ headers: { x-user-segment: { exact: beta } } }]
route: [{ destination: { host: payment-service, subset: v2 } }]
- route: # default
- destination: { host: payment-service, subset: v1 }
weight: 95
- destination: { host: payment-service, subset: v2 }
weight: 5
RPO (Recovery Point Objective): How much data loss is acceptable? An RPO of 1 hour means you can tolerate losing up to 1 hour of transactions. Drives backup/replication frequency.
RTO (Recovery Time Objective): How long can the system be down? An RTO of 15 minutes means you must restore service within 15 minutes of a disaster. Drives failover automation and infrastructure design.
DR strategies (ascending cost and complexity):
- Backup and restore: RPO = hours (backup frequency), RTO = hours (restore time). Cheapest. Suitable for non-critical systems.
- Pilot light: Minimal standby in DR region (just core data layer). RPO = minutes (continuous replication), RTO = 10β30 min (scale up on failover). Medium cost.
- Warm standby: Scaled-down but fully functional DR environment always running. RPO = minutes, RTO = minutes (just redirect traffic, scale up). Higher cost.
- Active-Active (multi-region): Both regions serve traffic. RPO β 0 (or replication lag), RTO β 0 (automatic traffic routing). Highest cost and complexity. For financial systems, healthcare, e-commerce at scale.
The uncomfortable truth: DR plans that aren't tested don't work. Schedule regular DR drills. Chaos engineering (simulate region failure, db failover) is the only way to know your RTO is achievable.
SLI (Service Level Indicator): A quantitative measure of service behavior. Examples: request success rate (%), P99 latency (ms), availability (%). The raw measurement.
SLO (Service Level Objective): A target for an SLI. "99.9% of requests succeed" or "P99 latency < 200ms, measured over a 30-day rolling window." An internal goal β the team's commitment to itself and stakeholders.
SLA (Service Level Agreement): A contractual commitment to customers, often with financial penalties for breach. SLAs are typically looser than internal SLOs (buffer for when things go wrong). You should breach your SLO internally before breaching the customer SLA.
Error budget: The allowed amount of unreliability within the SLO period. SLO = 99.9% β error budget = 0.1% of requests or ~43 minutes of downtime/month. If the error budget is healthy, the team can take risks (faster deploys, experiments). If it's burning fast, freeze risky changes and focus on reliability. Error budgets align the incentive between velocity (engineering) and reliability (operations).
// Error budget calculation
double sloTarget = 0.999; // 99.9%
double errorBudget = 1 - sloTarget; // 0.1%
long totalRequests = 1_000_000;
long allowedFailures = (long)(totalRequests * errorBudget); // 1,000 failures
long actualFailures = 800;
double budgetConsumed = (double) actualFailures / allowedFailures; // 80%
double budgetRemaining = 1 - budgetConsumed; // 20%
// 20% budget remaining β safe to proceed with risky deploys
// Alert threshold: burn remaining budget within 2 hours of current rate
// Multi-window burn rate alerting: Google SRE workbook recommends
// fast burn (1h window) + slow burn (6h window) for comprehensive coverage
Case Studies
9 questionsRequirements: Shorten a URL β 7-char code. Redirect short β original. ~100M URLs stored. Read-heavy (1000:1 read/write). Redirection latency <50ms. Analytics (click count, referrer).
Short code generation: Use base62 (a-zA-Z0-9) encoding of an auto-increment ID or hash. 7 chars of base62 = 62β· β 3.5 trillion unique codes. Use a distributed ID generator (Snowflake-style) to avoid ID collisions across multiple write servers without coordination.
Storage: Simple key-value: shortCode β {originalUrl, createdAt, userId, clickCount}. Use a DB (Cassandra or DynamoDB) for primary storage. Redis cache for hot redirections (cache the shortβlong mapping with TTL).
Redirection flow:
GET /abc1234
β Check Redis cache: hit β 301/302 redirect (immediate)
β Cache miss β DB lookup β cache result β redirect
// 301 vs 302:
// 301 Permanent: browser caches permanently β no future requests to our servers
// β analytics lose visibility after first redirect
// 302 Temporary: browser requests our server every time β analytics work
// β more traffic to handle. Use 302 for analytics.
Analytics: Don't write analytics synchronously on each redirect (adds latency). Publish click events to Kafka asynchronously. A Flink/Kafka Streams job aggregates counts per URL per time window and writes to a time-series store.
Custom aliases: Allow user-specified short codes. Check uniqueness before storing. Store in the same table β the short code is always the primary key.
Scaling: Read path is trivially scalable with more Redis nodes + CDN (redirect responses are cacheable). Write path is low volume β a single DB with replicas handles it easily. The hot read/cold write asymmetry makes this system easier to scale than it appears.
Problem: A rate limit of 100 req/min per user must be enforced consistently across 10 server instances. Each instance seeing ~10 req/min per user would not enforce the limit if they each have independent counters.
Centralized counter (Redis): All instances share a Redis counter. Atomic increment + expire. Simple, consistent. Redis becomes a single point of coordination (mitigated by Redis Cluster or Sentinel).
-- Redis sliding window (Lua script, atomic):
local key = "rate:" .. KEYS[1] -- "rate:user:123"
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2]) -- seconds
local now = tonumber(ARGV[3]) -- current timestamp ms
redis.call('ZREMRANGEBYSCORE', key, 0, now - window * 1000)
local count = tonumber(redis.call('ZCARD', key))
if count < limit then
redis.call('ZADD', key, now, now .. math.random())
redis.call('EXPIRE', key, window)
return {1, limit - count - 1} -- allowed, remaining
else
return {0, 0} -- rejected
end
Local + global hybrid: Each instance maintains a local counter. Periodically sync with Redis (every 100ms or N requests). The instance allows requests as long as localCount < localLimit (e.g., 20% of global limit). Reduces Redis calls by 80β90%. Slightly imprecise β may allow up to (syncInterval Γ rate) excess requests during sync lag. Acceptable for most use cases.
Envoy + gRPC global rate limiter: Production systems often use Envoy's rate limiting filter with a dedicated gRPC rate limit service (e.g., Ratelimit by Lyft). Centralized policy management, per-service configuration, low-latency gRPC calls.
Scale: 1B/day = ~11,600/sec average, peaks 5β10Γ higher. Each channel has different throughput, cost, and latency characteristics.
Architecture:
- Notification API: Accepts notification requests. Validates, enriches, and publishes to Kafka. Returns immediately (async). Horizontally scaled.
- Kafka topics per channel:
notifications.push,notifications.email,notifications.sms. Partitioned by user ID for ordering guarantees per user. - Channel workers: Dedicated consumer groups per channel. Workers fetch user preferences, perform deduplication, and call the channel provider (APNs/FCM for push, SendGrid/SES for email, Twilio for SMS).
- User preferences service: Cached in Redis. "User X has push enabled, email disabled, quiet hours 22:00-08:00." Workers check before sending.
- Delivery tracking: Store delivery status per notification. Allow clients to poll or subscribe via webhooks.
// Notification worker with batching and rate limiting
@KafkaListener(topics = "notifications.email", concurrency = "20")
public void processEmailNotification(List<NotificationRequest> batch) {
// Batch calls to SendGrid (up to 1000 per API call)
Map<Long, UserPrefs> prefs = prefsCache.getAll(
batch.stream().map(r -> r.userId()).collect(toSet())
);
List<Email> toSend = batch.stream()
.filter(r -> prefs.get(r.userId()).isEmailEnabled())
.filter(r -> !isDuplicate(r.dedupKey()))
.map(this::buildEmail)
.collect(toList());
sendGridClient.sendBatch(toSend); // batch API call
trackDelivery(toSend);
}
Critical features: Deduplication (idempotency key per notification event). Rate limiting per channel (APNs has per-app limits, SMS providers charge per message). Priority queues β transactional notifications (OTP) ahead of marketing. Quiet hours enforcement. Opt-out handling. Dead letter queues for failed deliveries with alerting.
Components:
- URL Frontier: The queue of URLs to crawl. Prioritized by importance (PageRank, recency, domain freshness). Per-domain queues to enforce politeness delays. Backed by a distributed queue (Kafka) with priority tiers.
- Fetcher workers: Download HTML. Respect robots.txt. Enforce politeness (min 1s between requests to the same domain). Distributed across many IPs. Handle redirects, compression, encoding.
- Parser: Extract links, metadata, content. Add new URLs to the frontier. Compute a content hash for deduplication.
- Deduplication store: Bloom filter for URL-level dedup (fast, probabilistic β no false negatives). Content hash store (SHA-256) for near-duplicate detection (canonicalize URL, check hash before storing).
- Content store: Raw HTML in object storage (S3). Parsed content + metadata in a document store for search indexing.
// URL dedup with Bloom filter (memory-efficient)
BloomFilter<String> seenUrls = BloomFilter.create(
Funnels.stringFunnel(UTF_8),
10_000_000_000L, // 10B URLs
0.001); // 0.1% false positive rate
String canonicalize(String url) {
// Remove tracking params, normalize scheme, lowercase host
return url.replaceAll("[?&](utm_|fbclid|gclid)[^&]*", "")
.toLowerCase().replaceAll("/$", "");
}
void enqueueIfNew(String url) {
String canonical = canonicalize(url);
if (!seenUrls.mightContain(canonical)) {
seenUrls.put(canonical);
frontier.enqueue(canonical);
}
}
Politeness: Respect robots.txt (cache per domain, refetch weekly). Minimum delay between requests to the same host (crawl-delay directive or default 1s). Honor Crawl-Delay directive. Set a descriptive User-Agent. Don't crawl during peak hours for the site if possible (considerate crawling).
The core challenge: multiple users editing the same document concurrently without conflicts.
Operational Transformation (OT): Each edit is expressed as an operation (insert at position 5, delete at position 10). When two concurrent operations are received, transform them against each other so both apply correctly. Used by Google Docs. Complex to implement correctly for all edge cases.
CRDTs (Conflict-free Replicated Data Types): Mathematical structures that automatically merge concurrent edits without conflicts. No central server needed for conflict resolution. Examples: Logoot, LSEQ, Yjs (JavaScript), Automerge. Modern systems (Notion, Figma) increasingly use CRDTs.
Architecture:
// Simplified OT flow
// User A inserts "X" at position 5 (op: {type:"insert", pos:5, char:"X"})
// User B deletes char at position 3 (op: {type:"delete", pos:3})
// Both sent to server simultaneously
// Server transforms A's op against B's op:
// B deleted at position 3, so A's position 5 β 4 (shift left by 1)
// Transformed A: {type:"insert", pos:4, char:"X"}
// Both clients apply both ops in the same transformed order β consistent state
// WebSocket-based real-time sync
// Connection server: holds WebSocket per user per document
// Each user's edits β broadcast to all other users in the same document session
// Document state: event sourced β full history of operations
// Snapshot every N ops for fast document loading
// Presence: who's currently in the document, cursor positions (ephemeral, via pub/sub)
Consistency model: Strong eventual consistency β all clients eventually converge to the same document state. Offline editing supported β client queues ops locally, syncs when reconnected, server transforms queued ops against ops that occurred during offline period.
Payment systems are the poster child for reliability engineering. A double charge or missed payment is a serious business failure.
Key requirements: Exactly-once processing (no double charges). Idempotency for all operations. Audit log of every state transition. Reconciliation with external payment providers. Strong consistency for balance operations.
Core design patterns:
- Idempotency keys: Every payment request carries a client-generated UUID. Server stores the result indexed by this key. Retries return the same result.
- State machine: Payment goes through explicit states:
PENDING β PROCESSING β COMPLETED / FAILED / REFUNDED. State transitions are atomic DB updates with optimistic locking (version column). - Outbox + event sourcing: Every state change stored as an immutable event. The outbox publishes events to downstream systems (notification, analytics, ledger). Complete audit trail for compliance.
- Reconciliation: Nightly (or continuous) reconciliation job compares internal payment records against payment provider statements. Automatically flags discrepancies for investigation.
@Transactional
public Payment processPayment(PaymentRequest request) {
// Idempotency check
Optional<Payment> existing = paymentRepo.findByIdempotencyKey(request.idempotencyKey());
if (existing.isPresent()) return existing.get();
Payment payment = Payment.create(request); // PENDING state
paymentRepo.save(payment);
try {
// Call payment provider (Stripe, Adyen, etc.)
ProviderResponse resp = stripeClient.charge(payment.asStripeRequest());
payment.complete(resp.chargeId()); // COMPLETED state
outboxRepo.save(new PaymentCompletedEvent(payment)); // same transaction
} catch (PaymentDeclinedException e) {
payment.fail(e.declineCode());
outboxRepo.save(new PaymentFailedEvent(payment));
}
paymentRepo.save(payment);
return payment; // transaction commits: payment + outbox event atomically
}
Never: Charge the card before creating the DB record (if DB write fails, money taken but no record). Store card numbers β use tokenization (PCI DSS). Allow payment state to go backwards (enforce valid transitions).
Requirements: Globally unique IDs. Sortable by creation time (so DB indexes stay efficient β sequential inserts are fast). High throughput (100K+ IDs/sec). No coordination (no DB round trip). Fits in 64 bits.
Snowflake format (64 bits):
// 64-bit Snowflake ID layout
[1 bit: 0] [41 bits: millisecond timestamp] [10 bits: machine ID] [12 bits: sequence]
// 2^41 ms = 69 years from epoch 1024 machines 4096 IDs/ms/machine
// 1024 machines Γ 4096 IDs/ms = ~4M IDs/ms = ~4B IDs/sec globally
long generateId() {
long timestamp = System.currentTimeMillis() - EPOCH; // custom epoch (e.g. 2020-01-01)
if (timestamp == lastTimestamp) {
sequence = (sequence + 1) & 0xFFF; // 12 bits, max 4095
if (sequence == 0) {
// Sequence exhausted β wait for next millisecond
while (timestamp == lastTimestamp)
timestamp = System.currentTimeMillis() - EPOCH;
}
} else {
sequence = 0;
}
lastTimestamp = timestamp;
return (timestamp << 22) // shift timestamp to top 41 bits
| (machineId << 12) // machine ID in middle 10 bits
| sequence; // sequence in bottom 12 bits
}
Machine ID assignment: Use ZooKeeper/etcd to assign unique machine IDs to each service instance at startup. Or use a hash of the container ID / IP. Must handle machine ID recycling when instances are replaced.
Alternatives: UUIDv7 (time-ordered UUIDs, 128-bit, standardized, no coordination, collision-resistant). ULID (Universally Unique Lexicographically Sortable Identifier β 128 bits, base32 encoded). Use UUIDv7 for simplicity unless you specifically need 64-bit IDs for storage efficiency.
Requirements: Return top 5β10 suggestions for a query prefix. <100ms response time. Millions of QPS. Personalized vs. global suggestions. Updated as search trends change.
Data structure β Trie: A prefix tree where each node represents a character. Traversing "app" reaches a node with children representing all searches starting with "app". Each node stores the top-K suggestions for that prefix (ranked by search frequency). Querying any prefix is O(prefix length). In-memory trie for fast access.
Scale challenges: A trie for billions of searches doesn't fit in one machine's memory. Solutions:
- Partition the trie by first character (a-z: 26 shards).
- Or use a consistent-hashing-based distribution by prefix.
- Cache top prefixes in Redis (most queries are for common short prefixes β cache "a", "ap", "app" covers huge traffic).
Building and updating the trie:
// Offline trie building (runs hourly or daily):
// 1. Aggregate search log events from Kafka into hourly counts
// 2. Compute weighted score: recency Γ frequency
// 3. Rebuild trie with top-K suggestions per prefix
// 4. Serialize and distribute to all autocomplete servers
// Real-time updates (for trending queries):
// Maintain a separate "trending" Redis sorted set
// ZINCRBY trending 1 "query" on each search event
// Autocomplete servers merge trie results with trending results
// Redis sorted set for prefix lookup (simpler alternative to trie for moderate scale)
// Store all query strings; on prefix "app", use ZRANGEBYLEX:
redis.zrangebylex("suggestions", "[app", "[app\xff", 0, 10)
// Returns all strings starting with "app" alphabetically
// Rank by score (separate ZSCORE lookup) β return top-K
Personalization: Blend global suggestions with user's personal search history. Store the user's recent and frequent searches in Redis. At serve time, merge global trie results with personal history using a weighted ranking.
Upload pipeline:
- Upload API: Accept the raw video file. Stream directly to object storage (S3) using pre-signed upload URLs β never buffer the whole file in the API server.
- Transcoding service: Video transcoding is CPU-intensive. Use a message queue (SQS/Kafka). A fleet of transcoding workers (GPU or CPU instances) pick up jobs. Transcode to multiple resolutions (2160p, 1080p, 720p, 480p, 360p) and formats (H.264 for compatibility, AV1 for efficiency). Output to S3 per resolution.
- Thumbnail generation: Extract frames at intervals, rank by visual quality, store top-N as thumbnails.
- Metadata service: Store title, description, tags, duration, transcoding status in a DB. Trigger search indexing on publish.
Delivery pipeline:
- Adaptive bitrate streaming (ABR): Videos served as segments (2β6 second chunks) via HLS or DASH. The client player monitors bandwidth and switches quality levels dynamically. A 10-minute video at 1080p is split into ~100 segments stored in S3.
- CDN: Video segments are served from CDN edge nodes (Cloudflare, Akamai, CloudFront). First viewer of a segment triggers an origin pull from S3 and caches it at the edge. Subsequent viewers in the same region get the cached segment. CDN is critical β video delivery without CDN is infeasible at scale.
- Resumability: S3 pre-signed URLs per segment. Client tracks which segment it's on. Seeking = fetch segments starting from the seek position.
# HLS manifest (m3u8) β served to the player
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
https://cdn.example.com/videos/abc123/1080p/index.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=2000000,RESOLUTION=1280x720
https://cdn.example.com/videos/abc123/720p/index.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=854x480
https://cdn.example.com/videos/abc123/480p/index.m3u8
# Player switches resolution based on measured bandwidth
# Each resolution index.m3u8 lists 2-6 second .ts segments
Storage cost optimization: Tier storage by video age/popularity. Hot (recently uploaded, popular) on S3 Standard. Cold (old, low-view) on S3 Glacier. Deduplicate identical video uploads (content hash comparison). YouTube stores ~500 hours of video uploaded per minute β storage optimization is essential.