Structured Logging in Production: The Patterns That Make Incidents Solvable
Most logging is written to make the system appear to be working, not to help engineers understand what's happening when it isn't. Here's the difference — and the logging practices that actually help.
There is a type of log that engineers write to satisfy the feeling that they're logging things, and a type of log that engineers write because they've been on call and know what information they'll need at 3 AM. The difference between these two types of logs is the difference between incidents that take 20 minutes to resolve and incidents that take four hours.
Most production logging falls into the first category: log entries that confirm that the system did what it was supposed to do, written while the system was working, by engineers who didn't yet know what would go wrong. The second category requires a different mindset: logging written as if you're leaving a message for the future engineer who needs to understand a failure you haven't experienced yet.
Structured vs. Unstructured Logs
Unstructured log output — human-readable strings like "Processing order 12345 for user john@example.com" — is easy to write and readable in a terminal. It's difficult to query, difficult to aggregate, and impossible to build dashboards from. When you have millions of log lines and need to find all requests that failed for users in a specific region with a specific error code, unstructured logs require regex searches that are slow, fragile, and can't be pre-indexed.
Structured logging outputs log entries as machine-readable key-value objects — JSON in most implementations. Each field is explicitly named and typed. The trade-off is that structured logs are less readable in a terminal and more readable in a log aggregation tool. At production scale, you're always querying log aggregation tools, never tailing terminals. Structured logging is the right default for production systems.
The Fields That Matter
Every log entry at the INFO level or above should include: a timestamp with millisecond precision, a severity level, a request or trace ID that correlates all log entries for a single user request, the operation name, and the duration for operations that have meaningful duration. Error log entries should additionally include: the full error message and stack trace, the specific identifier of the resource that failed (user ID, order ID, request path), the dependency that failed if the error originated externally, and any retry information if the operation was retried.
These fields feel like overhead when you're writing the code. They feel like lifelines when you're debugging an incident and every piece of context is available in the log rather than requiring a database query to reconstruct.
The PII Problem
Logging enough context to debug incidents creates a tension with privacy: the user ID and email address that make a log entry debuggable are personally identifiable information that may have regulatory implications if stored in a logging system. The practical resolution: log user IDs (opaque identifiers) rather than email addresses (direct PII), ensure your logging infrastructure has appropriate retention policies and access controls, and add a log scrubbing step for any fields that might inadvertently contain PII (request bodies, error messages that might echo back user input).
Code review for logging changes should explicitly check for PII exposure. A log that captures a request body to debug a parsing error might also be capturing a password or credit card number. The rule: log identifiers that refer to sensitive data, never the sensitive data itself.
What Reviewers Should Check
When reviewing code that touches logging: Are structured fields used rather than string concatenation? Does every error log include enough context to identify what operation failed and for which user or resource? Are there any log entries that might capture PII? Are log levels used appropriately (DEBUG for verbose development information, INFO for significant operational events, WARN for recoverable unexpected conditions, ERROR for failures that require attention)? Logs are too important to production operations to skip in review.
Try CodeMouse on your next PR
Free AI code review on every pull request. Bring your own API key — no subscription needed.
Install on GitHub — Free