Alert Fast

A dichotomy I often see in many different teams and projects, is the dilemma between “failing fast” and “failing gracefully.” Here’s a simplified example of what I see far too often, and gets to the core of the dilemma.

public String generateGreeting() {
  try {
    return “Hello ” + getName();
  } catch (Exception e) {
    // Just in case getName() throws an exception
    return “Hello”;
  }
}

and its close cousin:

public String generateGreeting() {
  String name = getName();
  // Not sure if getName() will return null??
  if (name == null) { return “Hello”; }
  else { return “Hello ” + name;
}

On the face of it, this looks extremely reasonable. When working with complex legacy systems, it’s not always clear whether getName() may produce an exception or return a null value. Every good engineer should aim to please their users, and users certainly hate seeing errors or “nulls”. “So of course we should check for errors/nulls and add fallback logic to handle them.”

There are a couple problems with this approach. First, it makes your system brittle. Instead of debugging and fixing the root cause of your exceptions and nulls, you’re allowing them to proliferate. Instead of having one definitive way of accomplishing a specific functionality, you’ve got multiple different implementations, used only during specific fallback conditions, and it’s unclear when, or even if, they are being exercised. Fast forward 5 years, and you’ve got a legacy system that is bloated in size, with twice the cyclomatic complexity.

“Tough luck, that’s your job,” you might say. But there’s a second problem as well which customers and product owners do care about. Every piece of fallback logic is by definition, not as good as the real thing. It worsens the user experience in some way – maybe through latency, inaccurate data, or disabling of useful features. By using fallbacks everywhere, you might prevent short-term outages, but you are slowly choking the long-term user experience.

Limitations of Failing Fast

This is why many developers ardently recommend Fail-Fast as a design philosophy. To quote an article on this topic: “Preventing something from failing while it’s going to fail doesn’t solve anything. It does not solve the problem, it just hides the problems. And the longer it takes for the problems to appear on the surface, the harder it is to fix and the more it costs… If an error occurs, fail immediately and visibly. If something unusually or unexpectedly occurs, let the software fail immediately instead of postponing the failure or working around the failure.”

To go back to the original code snippet, here’s the alternative they would recommend instead:

public String generateGreeting() {
  return “Hello “ + Preconditions.checkNotNull(getName());
}

Ah, so much cleaner!

As much as I would love to whole-heartedly embrace the above approach, there are some situations where it is simply not practical. When working on mission critical software, or software that is being used by hundreds of millions of users, failing “immediately and visibly” is sometimes far too disastrous.

And then there’s the “human” aspect to consider. Imagine if you just joined a new team, this is one of the first features you’re working on, and you barely understand the monstrosity of legacy code that you’ve inherited. “Failing visibly” might be good for the long-term project, but it’s hardly going to make a positive first impression on the rest of the team. Similarly if you have a risk averse manager, are up for promotion, or your team is under particularly harsh scrutiny from C-level executives. In an ideal world, such “political” factors shouldn’t play into tech decisions. In the real world, you can ill afford to ignore such complications.

“But wait!” you say. “This is exactly why we have unit tests, integration tests, and a QA process!” If you truly have a bulletproof test suite and QA process, you are absolutely right. Feel free to take all the risks you want. I’ve personally worked at many different companies, including the more prestigious ones, and I’ve never seen such a bulletproof test suite. In most teams, tests are an exercise in risk-mitigation, not risk-elimination.

A Third Way

Fortunately, there is a 3rd way. One that many developers don’t consider nearly as often as they should. The option of combining fallbacks with operational alerts. Going back to the earlier code snippet, here’s what the “alert fast” approach would look like.

public String generateGreeting() {
  try {
    return “Hello “ + Preconditions.checkNotNull(getName());
  } catch (Exception e) {
    alert(“Unexpected error when fetching user’s name”, e);
    return “Hello”;
  }
}

On the surface, it looks almost identical to the fail-gracefully approach, except for the alert method call. The idea here is that anytime you utilize a fallback logic, you alert someone or something, so that it can be looked at and debugged. This way, all bugs in the main-logic are surfaced and fixed, instead of hiding in nooks and crannies. This ensures that any user-experience degradation caused by using fallbacks, is only temporary and not permanent.

This approach can also be used as an intermediate step towards eliminating the fallback logic entirely. When making code changes that you think are risky, add a fallback logic and programmatically keep track of how often it gets triggered. Every time it gets triggered, debug the root-cause and fix the error in your main logic. Once you’ve confirmed that the fallback is never triggered anymore, you can then safely remove it entirely. This gives you all the long-term benefits of not having to maintain fallback code, without the short-term dangers of failing fast.

Ways to Alert

You might be wondering what exactly the alert method does. It does seem a little magical. The implementation details of this method can vary wildly based on the project you’re working on. There’s only one real requirement of the alert method: it has to ensure that if it gets called too frequently, it will get looked at by a real human being.

At major companies like Amazon, there already exists internal tooling that allows for efficient collection of code “metrics”. Ie, counts of how often specific events occur, via a simple method call. You can also set up alerts on these metrics, such that your team’s on-call gets automatically paged if specific events happen more frequently than expected. This allows for efficiently tracking how often fallbacks are being used, and alerting if if they are happening unusually frequently due to a bug in the main code.

There probably exist cloud services or open-source tools that mimic similar functionality, but if you’re working at a startup or small/medium sized business, you can probably get away with something more lightweight as well. For example, a startup that I used to work for, had tooling to easily and automatically cut YouTrack tickets, via a simple method call.

And then there are even simpler ways to do it for personal projects. For example, you could use a tool that constantly scans your application logs and emails you a summary of all log.error or log.fatal messages. The approach I’ve used in many personal projects is to do alerts via simple SNS calls. Just publish error messages to a specific topic, and depending on the severity level associated with the topic, you can configure SNS to forward all alerts in the form of emails, SMS messages, and/or PagerDuty alarms.

Right Tool for the Job

I certainly don’t mean to imply that you should always alert-fast. Depending on your specific project and task, fail-fast is certainly a valid alternative if you have a good understanding of the code’s invariants, or can afford to take some risks. The fail-fast code is certainly simpler and easier to maintain, so that is the platonic ideal you should strive for.

But when you do find yourself absolutely needing a fallback mechanism, the alert-fast approach is far superior to blind-fallbacks. It gives you all the benefits of having a fallback, while also giving you visibility around why, when, and how often the fallback is needed. It surfaces potential bugs in the main-logic, and if desired, gives you a safe way to transition towards a simpler fail-fast implementation. Sunlight is the best disinfectant, and alerts are absolutely vital to sniff out bugs in your code.


Related Links

Humorous war story about the dangers of using fallback logic:

The program’s purpose was to compute certain contribution rates for certain kinds of pension funds. It did this by writing out a big CSV file. The results of this CSV file were inputs to other programs. Another program, the benefits distributor, was supposed to alert people when contributions weren’t enough for projections.

Noticing that there was no output from the first program since it had crashed, it treated this case as “all contributions are 0”. This immediately caused a massive cascade of alert emails to the internal pension fund managers. They promptly started flipping out, because one reason contributions might show up as insufficient is if projections think the economy is about to tank.

We were able to resolve the issue by hotpatching the script. But by then, substantive damage had already been done because contributions hadn’t been processed that day. It cost about $1.7M to manually catch up over the next two weeks.

One thought on “Alert Fast

Leave a comment