Designing for Failure: Lessons from Building Distributed Systems

After years of designing distributed systems at scale, the most important mindset shift I've made is this: stop designing for the happy path and start designing for failure.

Every network call will eventually fail. Every database will eventually become unavailable. Every downstream service will eventually time out. The question isn't if these things happen — it's whether your system degrades gracefully when they do.

The Fallacies of Distributed Computing

Peter Deutsch's famous list of fallacies is worth revisiting for anyone new to this space:

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous

Every one of these gets violated in production. Your architecture should acknowledge them from day one.

Circuit Breakers Are Not Optional

One of the most common failure patterns I see is cascading failures. Service A calls Service B, which calls Service C. Service C starts timing out, so Service B's thread pool fills up, so Service A's thread pool fills up, and now everything is down — even though the original problem was isolated.

The circuit breaker pattern prevents this. When failures exceed a threshold, the circuit "opens" and subsequent calls fail fast instead of waiting for the timeout. This keeps upstream services healthy and gives the downstream service time to recover.

// Pseudocode: circuit breaker logic
if circuitBreaker.isOpen() {
  throw CircuitOpenException("Service C unavailable")
}
try {
  result = callServiceC()
  circuitBreaker.recordSuccess()
} catch (TimeoutException e) {
  circuitBreaker.recordFailure()
  throw
}

Embrace Eventual Consistency

Many engineers resist eventual consistency because it feels like giving up correctness. But strong consistency has a cost: it requires coordination, which introduces latency and reduces availability (see: CAP theorem).

For most business operations — order placement, notification delivery, analytics — eventual consistency is perfectly acceptable. The key is being explicit about which operations require strong consistency (e.g., financial transactions) and designing those with appropriate isolation levels.

Idempotency Is Your Best Friend

In a distributed system, "at-least-once delivery" is far easier to guarantee than "exactly-once." This means your consumers must be idempotent — processing the same message twice should produce the same result as processing it once.

The simplest approach: assign a unique ID to every operation and store processed IDs. Before processing, check if the ID has been seen before.

Observability from Day One

You cannot fix what you cannot see. By the time a distributed system reaches production, you need:

Structured logs with correlation IDs that span service boundaries
Distributed tracing (OpenTelemetry is the standard now)
RED metrics: Rate, Errors, Duration — for every service boundary
Meaningful alerts based on user-facing SLOs, not just raw CPU

Adding observability after the fact is painful. Build it in from the start.

Closing Thoughts

Distributed systems are genuinely hard. The good news is that the patterns are well-understood — circuit breakers, sagas, outbox pattern, idempotent consumers. The discipline is in applying them consistently and resisting the temptation to cut corners under deadline pressure.

Design for failure, and you'll sleep better at night.