The One Question That Made Me a Better Engineer

A senior engineer once derailed my entire design review with five words.

I'd spent a week on an elegant architecture. Diagrams, abstractions, the works. I walked everyone through it, proud. When I finished, she didn't poke at my abstractions or question my database choice. She just asked:

"What happens when this fails?"

I didn't have an answer. I'd designed the entire thing for the path where everything works. That question has lived in my head ever since, and it's done more for my engineering than any framework, book, or course. Now I ask it everywhere.

Quick Answer

The question is: "What happens when this fails?" Most engineers design for the happy path — the case where everything works. Asking what happens when each part fails surfaces the missing error handling, the silent data corruption, the cascading outage, and the bad assumptions before they reach production. Ask it of every component, every external call, and every design doc.

We're trained to build the happy path

Tutorials, demos, and deadlines all push us toward the path where everything works.

The example code never shows the network timing out. The demo never has the disk fill up. So we learn to think of the success case as "the code" and error handling as an annoying afterthought we'll add if there's time. There's never time.

But production is mostly the unhappy path. Networks fail. Disks fill. Services return garbage. Users do things no sane person would do. The difference between code that works in a demo and code that works in production is almost entirely in how it handles things going wrong — and internalizing that gap is a big part of the brutal truth about becoming a senior developer. Even the MDN reference on error handling frames exceptions and rejected promises as the normal shape of real programs, not edge cases.

Anyone can write the success case. Engineering is what you do about everything else.

Code and error messages on a developer's screen Photo by Ilya Pavlov on Unsplash

What the question actually surfaces

"What happens when this fails?" isn't one question — it unpacks into a dozen.

Point it at any external call and watch the gaps appear:

What if it times out? Do we retry, and if so, will the retry cause a double-charge or duplicate record?
What if it returns an error? Do we handle it, or does it crash three layers up with a confusing stack trace?
What if it returns success but garbage data? Do we validate, or trust it and corrupt our own state?
What if it's slow but doesn't fail? Does our whole system back up waiting on it?
What if it fails halfway through? Are we left in a consistent state, or a partial one nobody can recover?

Every one of those is a real incident I've seen happen. Every one of them is invisible until someone asks the question out loud. That's the magic of it — it converts vague unease into a specific checklist.

How I use it in design docs

I now write a "failure modes" section into every design doc, and it's reliably the most valuable part.

Before, my design docs described the system working. Now, for each major component, I write down how it fails and what we do about it. This does two things. It forces me to actually think it through before I've written a line of code, when changes are cheap. And it makes the review far better, because reviewers can poke at my failure handling instead of just admiring my boxes and arrows.

The shift in mindset shows up in concrete decisions:

Happy-path thinking	"What happens when this fails?" thinking
Call the service, use the result	Handle timeout, error, and bad data
Assume the write succeeds	Plan for half-finished writes
One retry fixes flakiness	Retries must be safe to repeat
Errors bubble up somehow	Errors have an owner and a plan

A design document and notes on a desk Photo by Priscilla Du Preez on Unsplash

It changed how I review other people's code

The question turned my code reviews from cosmetic to substantive.

I used to comment on naming and style — the easy stuff. Now my first pass through any change is just hunting for unhandled failure. Where does this trust an external call it shouldn't? What happens if this loop gets an empty list, a null, a value ten times bigger than expected? Where does an error get swallowed silently so we'll never know it happened?

These questions catch the bugs that actually page you at night. Naming is real, but a misnamed variable never took down production. An unhandled timeout did. Reviewing for failure modes is how you stop the expensive bugs at the cheapest possible moment.

It also models the thinking for everyone else. When the same question keeps showing up in reviews, the whole team starts pre-empting it, and the unhappy-path thinking spreads on its own.

The deeper habit: assume nothing is reliable

Underneath the question is a worldview, and adopting it was the real upgrade.

Everything your code talks to will eventually fail. The database, the network, the third-party API, the disk, the other team's service, even your own code under unexpected load. Treating every dependency as fundamentally unreliable isn't pessimism — it's accuracy. The systems that survive are the ones built by people who assumed their parts would betray them and planned accordingly.

You don't have to handle every failure exhaustively for every component — some failures are acceptable to log and move on from. But you have to decide, consciously, for each one. The question forces the decision into the open instead of leaving it to a 2 a.m. surprise.

A little automation helps make this real rather than theoretical: failure cases that are easy to forget become tests, so "what happens when this fails" has an answer that runs on every build.

The outage that proved the point

I learned how much this mattered the expensive way, on a system I'd built before I started asking the question.

We had a feature that called an external service to enrich some data. In every test, every demo, every day for months, that service answered quickly and correctly. I'd written the code assuming it always would. There was no timeout, no fallback, no plan for the service being slow rather than down.

Then one afternoon the service didn't fail — it just got slow. Responses that normally took a few milliseconds started taking thirty seconds. My code waited politely for every one of them. Requests piled up behind those waits. Threads got consumed holding open connections that were going nowhere. Within minutes, a problem in someone else's system had taken down ours entirely, because I'd never asked what happens when this is slow but doesn't fail.

The fix, once I understood it, was small: a timeout, a fallback to cached data, and a circuit breaker that stopped hammering the struggling service. Maybe twenty lines. Twenty lines that would have prevented the whole outage if I'd written them up front, when the change was free instead of an incident.

That's the thing about the question. The cost of asking it is a few minutes of thinking and a handful of lines. The cost of not asking it is a war room, an apology, and a postmortem with your name on it.

The failures you don't plan for don't go away. They just wait for the worst possible moment to introduce themselves.

Now "what happens when this is slow?" is right there next to "what happens when this fails?" — because down is obvious, and slow is the one that quietly takes everything with it.

Why this one question beats a hundred rules

You might wonder why I lean on a single question instead of a long checklist of resilience patterns. The answer is that questions travel and checklists don't.

A checklist of failure-handling rules is long, intimidating, and easy to skip when you're busy. You read it once, nod, and forget it under deadline pressure. A question is small enough to actually carry around in your head. "What happens when this fails?" fits in the gap between writing a line and moving to the next one. It costs nothing to ask and it works in any language, any framework, any layer of the stack.

It also scales with your experience. A junior asking it might just remember to add error handling. A senior asking it thinks about retries, idempotency, partial failure, cascading load, and recovery — the same five words unlock progressively deeper answers as you grow. The question doesn't get stale; you just bring more to it. It's also one of the quiet reasons some junior developers get stuck and others don't: the ones who climb fastest start interrogating the failure path early.

And it spreads. When you ask it consistently in reviews and design docs, the people around you start pre-empting it. The question becomes a shared reflex, and a team that reflexively designs for failure builds systems that survive production. That cultural ripple is something no static checklist ever achieves, because a checklist is a document and a question is a habit.

Give a team a checklist and they'll skip it. Give them one good question and they'll ask it forever.

Try carrying this one question into your next design review or pull request, and follow along if you want more small habits that quietly compound into better engineering.

FAQ

Q: Won't handling every failure make the code bloated and slow to write? You don't handle every failure heavily — you consciously decide what to do for each one, which can be "log it and move on." The cost is a little thinking up front; the saving is the production incidents you never have.

Q: When's the best time to ask this question? At design time, when changes are cheapest — write a failure-modes section into the design doc. Then again in code review as your first pass. Catching a failure gap on paper is far cheaper than catching it in an outage.

Q: How is this different from just writing tests? The question drives the tests. First you ask how each part fails, then you write tests for those failure modes. Without the question, people only test the happy path and call it covered.

Q: Is this just for senior engineers? No — it's exactly how you become more senior. The habit of designing for failure is most of what separates engineers whose systems survive production from those whose systems surprise everyone.

The bottom line

One question reshaped how I build software: "What happens when this fails?" It pulls the unhappy path out of the shadows, turns vague worry into a concrete checklist, and catches the expensive bugs while they're still cheap.

The happy path is the part anyone can write. Everything that makes you a real engineer is in the answer to that one question.

What's one place in your current system where you've never actually asked it?