All articles
Engineering 9 min readFebruary 26, 2024

Error Handling Is Not Exception Handling: The Patterns That Actually Scale

Most error handling in production codebases is reactive rather than intentional — try/catch blocks that were added after something broke, not designed to make failure modes observable and recoverable.

Production incidents follow a consistent pattern in most systems: something fails, the error is caught somewhere in the call stack, a generic error message is logged, the system continues operating in an undefined state, and an engineer spends hours reconstructing the failure from insufficient evidence. The cause of the incident was in the code. The duration of the incident was in the error handling — or rather, the absence of intentional error handling.

The distinction between exception handling and error handling is more than semantic. Exception handling treats errors as exceptional, unexpected events to be caught and suppressed. Error handling treats errors as first-class parts of the system design — expected, modeled, and made observable.

Model Errors as Values

The most impactful shift in error handling philosophy is treating errors as values rather than exceptions — returning them from functions rather than throwing them. A function that can fail should return a result that encodes both the success case and the failure case, forcing the caller to handle both explicitly rather than optionally wrapping the call in a try/catch.

This pattern is enforced by the type system in Rust (Result) and Haskell (Either), and available as a convention in most other languages. In TypeScript, a simple result type — { ok: true; value: T } | { ok: false; error: E } — makes error handling visible at every call site and eliminates the category of bugs where errors are silently swallowed by an empty catch block.

Distinguish Error Classes Explicitly

Not all errors are the same, and treating them as the same produces error handling code that can't respond appropriately to different failure modes. The error taxonomy that's most useful distinguishes: expected failures (user input validation, network timeouts, rate limits) that should be handled gracefully and returned to the caller; unexpected failures (null references, assertion violations, invariant breaches) that indicate a programming error and should crash loudly; and system failures (database unavailability, disk full, out of memory) that require operational response rather than code-level handling.

Expected failures should be explicit in function return types. Unexpected failures should never be caught silently — if you catch an unexpected failure, log everything useful and then rethrow or crash. System failures should be detectable before they cause data corruption and surfaced to monitoring immediately.

What Gets Logged Determines What Gets Debugged

The quality of your error handling is measured at 3 AM during an incident, when the information logged at the time of failure is all you have. A log message that says "Error processing request" tells you nothing. A log message that includes the request identifier, the user identifier, the operation being performed, the specific error code and message from the dependency that failed, and the retry count tells you almost everything you need to reconstruct the failure sequence.

Log at the boundary where context is richest, not at the boundary where the error is first detected. The error is often detected deep in a call stack where only the immediate context is available. The boundary with the full context — the request handler, the background job, the webhook processor — is where a useful log entry can be created by aggregating the context from the full call chain.

Error Handling in Code Review

A checklist for evaluating error handling in code review: Does every function that can fail make that fact visible in its signature? Are expected failures handled explicitly rather than caught generically? Does every catch block do something specific rather than swallowing the error? Are logged errors rich enough to reconstruct the failure context? Is there a recovery path for transient failures (retries with backoff) distinct from the handling of permanent failures? These questions catch the majority of error handling antipatterns before they become production incidents.

Try CodeMouse on your next PR

Free AI code review on every pull request. Bring your own API key — no subscription needed.

Install on GitHub — Free