I recently came across a story that is just as amusing as it is shocking:
One of my clients is responsible for several of the world’s top 100 pension funds. They had a nightly batch job that … crashed. No one knew what was wrong at first. This batch job had never, ever crashed before, as far as anyone remembered or had logs for.
The person who originally wrote it had been dead for at least 15 years, and in any case hadn’t been employed by the firm for decades. The program was not that big… but it was fairly impenetrable — written in a style that favored computational efficiency over human readability. And of course, there were zero tests.
As luck would have it, a change in the orchestration of the scripts that ran in this environment had been pushed the day before. This was believed to be the culprit. Engineering rolled things back to the previous release. Unfortunately, this made the problem worse.
Another program, the benefits distributor, was supposed to alert people when contributions weren’t enough for projections… Noticing that there was no output from the first program since it had crashed, it treated this case as “all contributions are 0”. This, of course, was not what it should do. But no one knew it behaved this way since, again, the first program had never crashed.
I got an unexpected text from the CIO. “sorry to bother you, we have a huge problem. s1X. Can you fly in this afternoon?” S1X is their word for “worse than severity 1 because it’s cascading other unrelated parts of the business”.
Fortunately all of our pensions are safe and the above story has a happy ending. But it is hardly reassuring that our mission-critical financial systems are relying on antiquated software that literally no one alive understands.
Yet another story with a less happy ending – 16,000 coronavirus cases went unreported in England because of an agency’s use of an obsolete 30 year old file format.
Businesses for Bit Rot
It turns out the above is no isolated example. Back in 2012 when I left Intel to join Sun, I realized just how badly their SPARC product line was doing. Once the golden goose of the dot-com server era, it had since fallen incredibly far behind Intel’s Xeon product line. I was literally told by my own manager, to run our simulations on the Intel Xeon servers and not the SPARC server farm, because “it is very slow.” Even worse, not only did Intel’s CPUs perform better, their manufacturing edge meant that they were significantly cheaper to make as well.
The next obvious question that I had: Why on earth are people even buying our SPARC chips, if they are so far behind the competition? The answer I got from a senior architect boggled my mind. Our customers had software systems so ossified, that they could only ever be run on SPARC/solaris systems. Migrating to x86/Linux would be too herculean a task for them to take on. In many cases, they had even lost their source code, preventing them from even recompiling their applications. The best they could ever do was to upgrade to the latest generation of the same SPARC processors, regardless of how slow or expensive they were.
That’s right, our entire division’s business model revolved around corporate America’s rotting software systems.
The Cost of Keeping The Lights On
When I first joined Amazon, I found myself working on the perfect archetype of a legacy system. It was initially developed with tons of tech debt… by another team… which had long since disbanded. Ownership of the project had then been transferred over to our team… and it proved to be so unpopular, that developers started transferring out to other teams in droves. Of the ~10 team members when I joined, not a single one still remained a year later.
On the surface, the system had many things going for it. It was written in a modern language and tech stack (Java 8). It had an entire team of six-figure-income developers supporting it daily. And it was constantly being updated to fix bugs and/or add new features.
And yet, despite all this, it was easy to see how the turnover was weighing down the system like an albatross. As a result of the ownership change and team turnover, a tremendous amount of institutional knowledge was lost. Knowledge such as overall code design, end-to-end functionality, best practices and debugging techniques. We were working hard to keep things chugging along. And yet, it felt like we were bogged down in a swamp, constantly second-guessing every change, surrounded on all sides by a fog of war.
Can you imagine how much worse things would have been if the project was running on Java 1, with zero active development, and no developers assigned to own it?
Preventing Catastrophic Failure
As software developers, we strive to produce robust bug-free systems that can simply be left to run on their own, for years and years, without any manual intervention. By that metric, the pension fund’s script is surely a smashing success.
And yet, the harsh reality is that everything will break… one day. Everything will eventually need to be updated.
- Maybe your system relies on hardware that is no longer being produced
- Or your system has dependencies that are now defunct
- Or dependencies with major security vulnerabilities and the only patches are on backwards incompatible versions
- Or the application was developed based on certain assumptions that are no longer true
- Or the world has simply changed, and the application needs to change with it
No matter what the reason is, change is inevitable. The only question is how painful it will be, when it finally arrives.
If you have a system that is being actively worked on, change need not be all that painful. But if you have a system that has been completely ignored for multiple years or even decades, there are so many things that can go catastrophically wrong.
- The people who built the system may no longer be at the company
- The source code may get lost
- People may not know how to properly compile the source code, to produce the executable
- Or deploy the new executable
- Or correctly run the executable, with all flags correctly configured
- Or make sense of the code architecture and implementation
- Or the invariants and implicit assumptions that the code relies on, in order to function correctly
- Or run the automated tests
- Or debug test failures
- Or debug production failures
- Or access production logs and metrics
Trying to exhaustively document all of the above, will help. But your documentation will always be sub-optimal. It will always contain gaping holes. Comprehensive documentation is no substitute for getting your hands dirty.
An Idle Mind
Having a dedicated developer who is responsible for owning and maintaining all of the above, is definitely a good first step. But it is not sufficient.
People will only “read the docs” so many times, before getting bored. They will not get the same learnings that come only from experience. And solving real problems.
If asked to “audit” things, there’s a very good chance that they will simply rubber-stamp it and sweep it under the rug, while focusing on other newer and shinier projects… or simply goofing off. Without any real deliverables or challenges, many will simply follow the path of least resistance.
If you truly want to avoid software rot, the only way is to ensure that you’re continuously moving. Even if it seems unnecessary or risky. Simply because the best way to build and maintain and verify your institutional knowledge and capabilities, is through continually making changes, and testing your ability to successfully execute on these changes. The day you stop moving, is the day your institutional knowledge starts decaying and falling apart.
Even moving in circles, as ridiculous as it sounds, would be an improvement over neglect. But more realistically, there is always something the maintainers can do to move the needle forward, even if it is in tiny amounts.
You could update your environment to use the latest versions of all dependencies:
- Such as migrating from JDK 8 to 11
- Or updating your JVM to use the G1 garbage collector instead of CMS
- Or updating your GCC compiler from version 5 to 7
- Or upgrading your Database from Postgres 9.5 to Postgres 11
- Or updating your AWS SDK from version 1.10 to 1.11
- Or installing the latest linux distro on your production fleet
In severe cases where your dependencies are hopelessly obsolete, you can investigate migrating to a newer stack entirely:
- Such as migrating from SPARC to x86.
- Or Solaris to Linux.
You can keep your developers sharp by updating your application as well:
- Such as fixing any lingering bugs and edge cases
- Or strengthening your automated test suite
- Or cleaning up tech debt
- Or making performance optimizations
- Or implementing new features
- Or simply refactoring your code incrementally to make it more readable
The above changes are bound to introduce transient risks and incur seemingly “unnecessary” expenses. Inevitably, your developers will make mistakes and introduce bugs. When faced with such costs, it is very tempting to simply step back and do nothing at all. “If it ain’t broke, don’t fix it.”
If the system in question provides minimal business value, that may even be the rational thing to do. But for any mission-critical system, neglect will only tradeoff smaller transient risk, for permanent catastrophic risk. The risk that one day, your system will need to be urgently debugged or updated, but your organization will find itself utterly incapable of doing so.
For any mission-critical system, it is vital to preserve institutional knowledge and capability. And the only way to do that is through continual exercise. Your (company’s) brain is a muscle. Use it or lose it.