Rethinking Software Testing: Perspectives from the world of Hardware

RP Uncategorized 2019-04-282022-12-26 25 Minutes

The conventional view of Software Testing

The hardware and software worlds may seem poles apart, and in many ways, they indeed are. But there’s a wealth of knowledge that each can learn from the other. Despite the seemingly massive differences in the final product, they share more in common than you might expect.

Computer engineers at places like Intel, just like software engineers, spend most of their time sitting at their desks, writing (verilog) code that implements the desired system behavior. They then compile (synthesize) their code in order to generate lower-level outputs (digital circuits and physical layouts). And finally, they write automated tests that exercise their SUT, to ensure that the code is functionally correct. Sound familiar?

I know all this intimately, given my own past as a hardware engineer, and my later transition into software development. After finishing my Master’s degree in Computer Architecture, I spent over 5 years working at Intel and Sun as a Hardware Verification engineer, before turning down a senior-staff role at Apple, in order to reboot my career as a software developer.

In the past 5 years, I’ve worked in some great software teams at places like Google, and have also led development for multiple personal projects in my free time. The programmers that I’ve met and worked with are undoubtedly smart, and possess a number of skills that my hardware colleagues should strive to emulate. However, one thing I’ve noticed is that when it comes to testing, their instincts have been… off. Way off.

Here’s my attempt to distill the lessons I’ve learnt from my Hardware days, and how they can be applied to improve our Software testing methodologies and outcomes.

Disclaimer: this post is focused on non-UI programming, where functionality can be 100% covered without the need for “eyeballing”. Front-end/UI testing is a whole other beast that I wouldn’t comment on or touch with a ten foot pole.

Raising the Stakes

The elephant in the room in most software companies, is the perceived importance of testing. In hardware, pre-silicon verification is a first-class citizen in the development process. Dedicated verification engineers earn 6 figure salaries, sit next to their RTL design counterparts in all planning meetings, and enjoy careers that are just as prestigious and lucrative. In comparison, at most software companies I’ve come across, testing is treated as a 2nd class citizen – being a “test engineer” (or worse, “tester”) is often maligned as being less prestigious or lucrative.

This difference in culture isn’t an accident of nature. It’s a natural consequence of the much higher stakes in the hardware world. Because the tapeout process is so expensive and time consuming, finding even a single bug can delay your product launch by months, and cost you millions of dollars in additional expenses. Or worse: finding a bug after your customers have already purchased and installed the chips, can result in extremely expensive product recalls. Even if the fix is a simple one-line code change.

The consequences of software bugs can certainly be disastrous. But at least the fix is extremely cheap logistically – code deployments and software patches are vastly faster and cheaper than manufacturing and distributing new silicon. Hence why hardware organizations take testing much more seriously than comparable software companies.

The results do speak for themselves. Hardware products that are in the hands of customers, have an order-of-magnitude fewer bugs. The percentage of bugs that are caught prior to release, is vastly higher in the hardware industry as compared to the software industry.

A Better Way

It is tempting to say that hardware teams are better at testing, purely because of their greater financial investment. That software teams are already operating at their most ideal, and that improvements in testing-quality can only be had through sacrifices in time/cost.

Such a view is unjustifiably optimistic about the current state of affairs, and pessimistic about our potential for improvement. Over the past decades we have vastly improved our software-development practices and methodologies. There’s no reason to believe that we have now achieved a state of nirvana where no further improvements are possible.

Even though many programmers tend to short change it, testing-methodology is itself a skill set. One that is learned over time by an entire industry, at a rate proportional to its level of investment. And in this sense, the hardware industry is miles ahead when it comes to testing best-practices. Not because they are “smarter” in any way, but simply because their survival depends on it.

You wouldn’t expect a football player to be able to jump as high as a basketball player.
You wouldn’t expect a restaurant to take cleanliness as seriously as a hospital.
You definitely shouldn’t expect a software organization to master testing best-practices, the way a hardware company has.

If you want to master the art of testing, talk to a hardware verification engineer.

Lessons to Learn

The 0th Law of Testing: Only the Paranoid Survive

If it hasn’t been tested, it doesn’t work. “If this isn’t absolutely true, it is certainly a good working assumption for project work.” This rule forms the foundation for most other lessons listed here.

Word of Warning: The universe of all possible inputs and corner-cases is infinite. Hence, you will never attain 100% coverage via empirical testing. You will never cross the finish-line. If you ever think that you are “done” with testing, you’re in for a surprise. All you can do is chase as much coverage as can be attained, with the amount of time and resources available.

Manual Testing is not good enough

Things I’ve heard developers say:

“This is so important, that we have to test it manually. I don’t trust an automated test to do the job.”
“Don’t worry about trying to build automated tests. We’ve been manually testing these changes.”

What a Hardware engineer would say instead:

“This is so important, that we have to build an automated test suite for it. I don’t trust human testers to do the job.”
“Maybe run a few tests manually as a final sanity check, but don’t spend too much time since it’s been auto-tested pretty well.”

Running a couple tests by hand and eyeballing the results, might work in a college VLSI class. But it’s going to get you laughed at in industry. Manual testing cannot be code-reviewed on GitHub. Manual testing is subject to human error, whether due to oversight or laziness. Manual testing is extremely time and labor intensive, when subjected to every single release.

There might be specific exceptions where a test cannot be automated. But these should be the exception – not the norm. Reliability ultimately comes from the strength of your automated test suite, not how much manual testing you’re doing. Anything important enough to test by hand, is important enough to build an automated test suite for.

Testing Two Inputs in Isolation != Testing Them Together

2 unit tests. 0 integration tests pic.twitter.com/V2Z9F4G1sJ
— DEV Community 👩‍💻👨‍💻 (@ThePracticalDev) January 14, 2016

Suppose your team is implementing and testing the following method:

public static double myCustomDivider(double numerator, double divisor);

Alice: “Do we have tests checking correct behavior for negative inputs?”
Bob: “I have a test where the numerator is negative, and another test where the divisor is negative”
Alice: “Do you have a 3rd test where they are both negative?”
Bob: “No, and we don’t need that. We’ve already covered both cases individually.”

You laugh, but I’ve heard variations of this said far too many times, by far too many senior developers.

If the 2 inputs are completely decoupled, maybe it makes sense to assume we don’t need to test them in combination. But often times, 2 inputs which are assumed to be decoupled, aren’t nearly as decoupled as people think. And even if the implementation is indeed decoupled at the time of writing the test, it can often evolve to become coupled at a later time. As the saying goes: “It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”

As a general heuristic: If 2 inputs are both being parsed within the same method, there is definitely value in testing them in combination.

Perhaps you’ve decided that the risk-reward tradeoff merits not writing tests to cover some combinations. This is certainly a reasonable decision to make, depending on the particular project circumstances and the events being considered. But do so cognizant of the risk you’re taking on. Do not delude yourself into thinking that there’s no value in testing combinations of multiple events.

Testing Output_A for Event_1 != Testing Output_A for Event_2

Suppose you find yourself needing to test the following addPerson method:

public int getAge(String name);
public int getHeight(String name);
public int getWeight(String name);

// Returns true if a previous value was overwritten
public boolean addPerson(Person person);

And so you write the following tests:

@Test
public void addNewPerson_shouldReturnFalse() {
  Person person = new Person("john", 30, 175, 70);
  boolean result = system.addPerson(person);
  Truth.assertThat(result).isFalse();
  getAndCheckPerson(person);
}

@Test
public void addPerson_alreadyExists_shouldReturnTrue() {
  Person originalJohn = new Person("john", 30, 175, 70);
  system.addPerson(originalJohn);

  Person updatedJohn = new Person("john", 31, 174, 71);
  boolean result = system.addPerson(updatedJohn);

  Truth.assertThat(result).isTrue();
  getAndCheckPerson(updatedJohn);
}

private void getAndCheckPerson(Person person) {
  Truth.assertThat(system.getAge(person.name))
    .isEqualTo(person.age);
  Truth.assertThat(system.getHeight(person.name))
    .isEqualTo(person.height); 
  Truth.assertThat(system.getWeight(person.name))
    .isEqualTo(person.weight);
}

The following conversation then ensues.
John: “Hey, why are you checking the age/height/weight all over again in the second test?”
You: “Why not?”
John: “We’ve already verified in the first test that we are populating the age/height/weight correctly. The only incremental change that needs to be checked in the second test is the return boolean. You can delete the other checks”

You can convince yourself using various logical arguments that it’s impossible for the additional checks in the second test to ever fail if the first test has passed – and therefore, these additional checks aren’t needed. “I’ve inspected the code and we don’t even check to see if the person exists before blindly populating and overwriting whatever is there already!”

That’s great, but the whole point of writing automated tests is to minimize (potentially faulty) assumptions. It’s to put in place safety-nets, for when someone later decides to refactor this code which you had inspected and made verbal guarantees about.

Depending on the specific implementation, your project’s priorities, and the amount of effort involved in doing the checks, skipping them may be a reasonable risk to take on. However, don’t fool yourself into thinking there’s no risk involved. Just because something works fine for one event does not mean that it will work fine for all other events. If a one-line helper method is all it takes to avoid making such assumptions, there’s little reason not to do it.

Test Every Corner Case You Can Think Of

QA Engineer walks into a bar. Orders a beer. Orders 0 beers. Orders 999999999 beers. Orders a lizard. Orders -1 beers. Orders a sfdeljknesv.
— Bill Sempf (@sempf) September 23, 2014

I never realized how paranoid I was until I found myself having conversations like the following over and over again.

@Test
public void removeElementFromCollection() {
  var collection = MyCustomCollection.of(1);
  collection.remove(1);
  Truth.assertThat(collection).isEmpty();
}

“How do you know that the method isn’t clearing the entire collection, instead of removing just the 1 element?”

@Test
public void removeElementFromCollection_otherElementsRemaining() {
  var collection = MyCustomCollection.of(1, 2);
  collection.remove(1);
  Truth.assertThat(collection).contains(2);
}

“How do you know that the method isn’t just popping the 1st element every time?”

@Test
public void removeElementFromCollection_lastElementRemoved() {
  var collection = MyCustomCollection.of(1, 2);
  collection.remove(2);
  Truth.assertThat(collection).contains(1);
}

“How do you know that this will work fine for cases where you’re removing the middle element?”

This may seem nitpicky, but that is the level of paranoia that is needed in order to build a truly rock solid test-suite. One that can safeguard your release process against a wide variety of bugs and refactoring-induced errors. One where you feel comfortable deploying-on-success, no matter how invasive the changes are.

If a method returns a whole collection of elements, don’t just write tests that produce a single-element output. Write tests that are expected to produce a whole bunch of outputs, and then check that every one of them does show up. Write tests that are expected to produce no outputs at all. Each individual check may seem trivial and miniscule, but together, it all adds up. Test every corner case you can think of.

White-Box-Testing to Enhance Test Coverage

A quick primer:
Black-box-testing: Testing an method purely on the basis of its specs, without any regard to the specific implementation used.
White-box-testing: Using the specific implementation details in order to guide your testing priorities.

White-box-testing, when done right can greatly improve your test coverage, by better testing for correct behavior at key edge cases. For example, suppose you’re testing the following class:

public class MyCustomSet<T> implements Set<T> { … }

Pure black-box-testing: I will try adding and removing different elements, making sure that these operations cover all the various requirements of the Set interface, regardless of the specific implementation used.

Better white-box-testing: From looking at the implementation, I know that it is using Hashing and Linear-Probing, in order to achieve the desired functionality. The most tricky corner cases occur when 2 different elements collide at the same array-offset. And especially when one of these previously-inserted elements is later removed, thus producing a tombstone entry. Hence, in addition to the above black-box-tests, I will write specific tests with specific inputs that will trigger these tricky corner cases.

The first approach may be adequate, if a sufficiently large test-suite is utilized. But the second approach is more likely to find bugs using a much smaller test-suite, by identifying and triggering the specific corner cases that are most at risk.

The more controversial uses of white-box-testing arise, when it is used to weaken, rather than strengthen, the test suite. Most commonly, you’ll hear something like the following: “I know that the implementation is accomplishing XYZ functionality using ABC implementation, and it’s obvious from looking at ABC implementation that it will work as intended. Hence, we don’t need to worry about writing tests to cover XYZ functionality.”

Typically in such cases, ABC is assumed to be bulletproof either because it is very simple, or because it is utilizing reliable libraries. When done right, such reasoning can help in identifying areas of higher/lower risk, and prioritizing appropriately. When done wrong, such reasoning can lead to dangerous holes in your test coverage – there’s always a risk of your having missed something, or of someone refactoring the code later in a way that breaks your assumptions.

Using white-box-testing as justification for neglecting certain corner cases, is a double-edged sword. It can be a necessary evil when under a time crunch, but it’s best not to rely on it too much. On the flip side, using white-box-testing to enhance your test-suite, can pay huge dividends and make your codebase truly bulletproof.

Integration Tests are Worth their Weight in Gold

Unit tests pass, no integration tests pic.twitter.com/8geAsHgSBY
— Ryan Stortz  (@withzombies) February 9, 2017

“You know the difference between theory and practice? In theory, there is none. But in practice, there is.” The exact same thing can be said of unit tests and integration tests. In theory, unit tests can give you the exact same coverage you can get from integration tests. In practice, they don’t.

This is something hardware teams have learnt painfully over the years. Hence why no hardware project ever skimps on integration testing. No matter how thorough you try to be in your unit tests, you WILL find bugs when running integration tests.

This seems to be something that many software developers, especially the smarter ones, haven’t accepted yet. “If we just do a REALLY good job at unit testing, we will never need integration tests!” Sorry, no. On any project with sufficient complexity, you will never do a good-enough job. You will repeatedly build fakes that differ from the real component, in ways that turn out to be subtle but crucial. You will repeatedly fail to anticipate the disastrous emergent behavior that can result from seemingly innocuous changes.

I was once on a team staffed by brilliant and very accomplished developers that went all-in on unit tests and literally banned integration tests. We had near perfect test coverage metrics, but somehow, things would keep breaking in production every now and then. In ways that were sometimes disastrous.

With every break, we became more and more paranoid, clamped down on “unnecessary” changes, and spent more and more time on manual testing. Nothing seemed to do the trick. It was only once we put together an end-to-end test-suite, did things finally improve. At which point, we went nuts with all sorts of new features and refactoring changes, and our test suite never let us down.

The above is no isolated example either. There was a great article written by the Rust compiler devs, about how they managed to produce a new stable release every 6 weeks, even though most other compilers have much longer release cycles. They credit end-to-end tests for much of their success. They had indeed built a solid suite of unit tests – and yet, they still leaked a number of major bugs which were only found through end-to-end tests. By improving the effectiveness of their test suite, they were able to both prevent major production bugs and speed up their development cycles – a true win-win solution that we should all aim for.

Strengths and Limitations

Ironically enough, despite all my proselytizing above, you will find that most hardware testing is done at the unit (cluster) level. Why is this? Surely this validates software industry norms to prioritize unit testing as well?

Context is vital here. In hardware, a unit (cluster) test can finish in 5-15 minutes. Whereas integration (full-chip) tests take many hours, sometimes even days, to finish. This is why most hardware testing is done at the unit level.

Compare this to software, where a whole suite of end-to-end tests can be run in ~30 seconds, and the entire project-wide suite can be run in 5 minutes. Barely enough time for a dev to grab some coffee. This is every hardware tester’s wet dream. “You mean to tell me, that in 30 seconds, I can run an entire collection of end-to-end tests, without having to invest gobs of time and effort writing tests for every single sub-component and building/setting-up/debugging mocks and fakes that do a piss-poor job of mimicking the nuances of the real thing??”

Unit tests certainly have their place in your test suite. Especially for their value in reproducing obscure error-conditions (eg, network timeouts) and other rare corner cases that are hard to induce in a real system.

However, the bread and butter of your test suite should be integration tests. Not only can you cover your entire codebase with a far smaller and simpler test suite, but you can also gain rock solid coverage of the nuanced interactions between different components. Interactions that we’re far too likely to overlook and simplify at the unit level. “Write tests. Not too much. Mostly integration”

Random Testing: What separates the Amateurs from the Pros

Why the Hardware Industry has embraced Random Testing

At this point, you might be wondering how all of the above advice could possibly be achieved. “Test every combination of events? Test every output for every combination of events? Test every possible corner case? You would need an absolutely enormous number of tests in order to achieve that!”

That’s true… but only because of other limitations that software developers put on themselves. Such as the rule that all tests should be 100% deterministic, with no room for randomness.

In the hardware world, making such a statement would get you laughed out of the room. In any major hardware project, “directed” tests are table stakes for making sure your chip isn’t DOA. But if you really want to avoid billion dollar recalls, you need to up your game and adopt randomized testing as well. This is a lesson that most software teams have yet to learn, though some companies like Dropbox are rapidly catching on.

Why Random Testing Works

There are 2 main benefits that come from using randomized testing. Benefits that explain why they are such an essential part of every hardware-verification toolkit.

The first comes from minimizing test verbosity. Consider the same custom-divider method that we had discussed briefly earlier. In order to test it exhaustively, you could write tens of directed tests, covering a variety of scenarios. Chances are, the majority of them can be eliminated by a simple test that picks a random numerator, random divisor, and compares the output against the output produced by a reference model. You can then run this test in a loop a thousand times, and end up with something that gives you just as much confidence as most of your painstakingly written directed tests.

The second is more subtle. When writing directed tests, you’re mitigating the risks that come from known-unknowns. You first enumerate all the corner cases you can think of, and then write tests for each of them. This works great for mitigating your known risks, but it fails utterly to address the unknown-unknowns. By definition, you cannot write directed tests to address unknown risks – because they are unanticipated, you wouldn’t have thought to write a test for them.

For example, perhaps you thought to test for the case where the numerator is zero, and also for the case where the denominator is zero, but you failed to consider the case where they are both zero.

This is where randomized testing provides a great deal of value. By randomizing your inputs, you are testing an extremely wide variety of input combinations. Including input-combinations that you didn’t anticipate being problematic, but actually are. This will help boost your test coverage significantly, even for the unknown risks hiding beneath the surface.

But What About Consistency?

A common push back against randomness is that it can result in flaky tests. Such criticism misses the point of testing. The end goal of testing isn’t to have a deterministic test suite. The end goal is to catch bugs. A flaky test is an annoyance. A test that consistently passes despite the presence of bugs, is disastrous. Anything that reduces the risk of the latter, is fair game.

If you notice a flaky failure in your test suite, this should be resolved by debugging and fixing the root cause. If the existing failure message and logs aren’t sufficient, you can update your assertion and logging in order to get the debug information you need. If needed, you can also manually trigger the failure by running the test in a loop until it fails. This way, you’ll have all the debug information you need to root cause and fix the bug, and clean up your test results.

A quick note about having consistency in test coverage: This is indeed a worthy goal and can be addressed in two ways. The first is by running each individual test in a loop X times, where X is the minimum number that gives you the amount of coverage you feel you need from that specific test. The second is by running the entire test-suite X times, to get the same effect in a coarse-grain manner. Indeed, companies like Intel actually do this by running their test suite in an endless loop, and assigning engineers to debug and fix any failures that pop up. By using a combination of both techniques, you can ensure robust coverage prior to deploying any changes.

Footnote: Hardware test-suites usually seed all RNGs with a consistent seed, and then output this seed for any failures. This way, you can reproduce any failures by re-using the failing seed. This is usually done because “running the test in a loop” isn’t practical – a single test can take many hours to run. I have personally not needed this fixed-seed functionality in the software projects I’ve worked on, but it would certainly be a nice-to-have.

Battle Scars

Here’s an embarrassing example of a real bug that we found, thanks to random testing. We had a s3 uploader which takes in a user-supplied-file, converts it into a FileInputStream, invokes the AWS S3 SDK using this InputStream, figures out the S3-URL by concatenating the bucket, path and file-name, and returns this URL to the caller.

In the initial directed tests, everything worked fine. Only once we started randomizing the test-inputs and writing integration tests that downloaded the contents of the returned URL, did we start seeing flaky failures. Debugging these flaky failures induced a true facepalm bug: The above scheme couldn’t handle user-provided files with names containing spaces. How could it, when URLs aren’t allowed to contain spaces.

In retrospect, the problem seems blindingly obvious. Of course file names can contain spaces but URLs cannot, you have to account for that! However, most software bugs don’t occur in scenarios that you’ve accounted for. They occur in scenarios that you had never given thought to. If we had used only directed tests with an S3 mock, we would have never found this bug prior to release. It took an integration test with randomized inputs to uncover and fix this bug.

Degrees of Randomness

There are many different “levels” of randomization you can do. Each associated with its own complexity cost, and coverage benefits. You can decide on a case-by-case basis how far you want to randomize things, in order to maximize coverage and reliability without too much complexity.

For example: suppose you’re building a custom list implementation, and you want to verify that the contains method works correctly for all successful cases. Here is the kind of directed test that I would see in most software projects:

@Test
public void contains_directedTest_noRandomization() {
  List<Integer> list = MyCustomList.of(4, 5, 6);
  list.add(7);
  Truth.assertThat(list).contains(7);
}

Suppose we decided to apply some randomized inputs. Here’s one very simple way to get started:

@Test
public void contains_randomizeElements() {
  List<Integer> list = MyCustomList.of(RNG.nextInt(), RNG.nextInt(), RNG.nextInt());
  int valueToAdd = RNG.nextInt();
  list.add(valueToAdd);
  Truth.assertThat(list).contains(valueToAdd);
}

It is still mostly similar to the directed test, except that we have replaced the hard-coded numbers with randomized numbers. On the surface, this doesn’t really buy us all that much coverage. But it’s a start, and costs us almost nothing. And even though we may not realize it, it is providing us coverage for duplicate elements, and obscure corner-cases such as comparison for large ints behaving differently than small ints.

@Test
public void contains_randomizeElementsAndSize() {
  List<Integer> list = MyCustomList.of();
  int size = pickRandomSize();   // Biased RNG that equally weights empty/small/large sizes
  for (int i=0; i<size; i++) {
    list.add(RNG.nextInt());
  }
  int valueToAdd = RNG.nextInt();
  list.add(valueToAdd);
  Truth.assertThat(list).contains(valueToAdd);
}

Now we’re getting somewhere. Not only have we randomized the list contents, but we are now also randomizing the size of the list, covering the gamut from single-element lists to very large lists. If there are any bugs in the size checks, this is far more likely to find it. We’re even getting coverage for corner cases like naive-recursive implementations that will produce stack overflow errors.

@Test
public void contains_randomizeElementsSizeAndPosition() {
  List<Integer> list = MyCustomList.of();
  int size = pickRandomSize();   // Biased RNG that equally weights empty/small/large sizes
  for (int i=0; i<size; i++) {
    list.add(RNG.nextInt());
  }
  int valueToAdd = RNG.nextInt();
  int index = RNG.nextInt(i + 1);
  list.add(index, valueToAdd);
  Truth.assertThat(list).contains(valueToAdd);
}

Why stop at just randomizing the list size, when we can also randomize the position of the element that we’re searching for? Now we have the coverage we need to catch an even wider variety of off-by-one bugs. But we’re not done yet:

@Test
public void contains_randomizeElementsSizeAndPosition_moreCoverage() {
  for (int i=0; i<1000; i++) {
    contains_randomizeElementsSizeAndPosition();
  }
}

A single invocation of the base test will not give us the coverage we need. There are too many combinations of empty/small/large sizes, with small/large entries that are unique/duplicate, and searching for something that is in the start/middle/end. Hence why we wrap it in meta-test that runs it in a loop a thousand times. Thus ensuring that a single successful run provides the confidence we need to commit our changes.

Putting It All Together

With each degree of randomisation, your test complexity increases further, but so does your test reliability. If you’re used to tests that catch most but not all bugs, these techniques may seem unnecessary. But if you’re aiming for higher levels of reliability, such techniques are essential.

Regardless of where you choose to draw the line though, outlawing all randomization by fiat is almost never the right answer. In most cases, you can randomize some inputs to get coverage boosts, with only a minimal increase in complexity. For instance, the very last test shown above accomplishes in a very compact manner, what would otherwise require tens of directed tests. And even then, your directed tests would likely miss out on some corner cases that you had never thought of.

My proudest moment as a verification engineer came when I uncovered an extremely obscure bug in the system I was testing. The bug only manifested itself during a small set of overlapping corner cases. You had to perform a very specific operation, with a specific flag enabled, the operation-size had to be above a certain threshold, and the memory address involved had to very slightly cross over into a different page alignment.

I could have spent years writing hardcoded test, and I would have never thought to test this particular combination of scenarios. However, because I wrote tests with randomized inputs, we were able to eventually hit this bug and get it fixed before it reached production.

Use of Reference Models

One of the biggest questions to come up when you start doing more and more random testing: How can the test figure out what the right answer should be?

Going back to the divider example: if we wrote a directed test for Divider.divide(27.0, 3.0), we can manually derive and check for the answer being 9.0. But if we were to use random inputs, how can we figure out what the correct answer should be? The answer often is to use a reference model, generated/updated dynamically, that tells you what the correct answer should be.

I’ve seen many testing guidelines strictly advocate against having any sort of dynamic reference model that provides the expected result. They are certainly right on one point: if your test-reference-model resembles the actual production-code, then it will contain the exact same bugs as your production-code, and lead to false positives in your test results.

However, the solution is not to abandon reference models entirely. The solution is to use reference models that are sufficiently different from your production code, so as to avoid replicating the same bugs.

For instance, suppose you’re trying to test a CRUD API for creating an event, inviting other users to the event, and fetching all RSVPs. The real API will perform all this using database creates/updates/lookups, as well as various parsing of database-results. There’s plenty of room here for errors, at many different levels of the stack. A reference model can instead use simple POJOs and in-memory storage such as HashMaps. This reference model can then be cross-checked against the actual data returned by the integration test.

It is true that using reference-models increases the complexity of your tests. Unfortunately, it is often a necessary evil. The coverage benefits that we gain from random-testing, are far too great to ban universally. Hence why this is a common pattern in major hardware projects.

Test an Entire Path, not just a Single Output

Suppose you have a system where the following chain of coupled events can happen in sequence: Ai -> Bi -> Ci -> Di -> Ei

And you want to test that the above specific sequence of inputs produces the following chain of outputs: Ao -> Bo -> Co -> Do -> Eo

You could either write the following sequence of tests:

@Test
public void testA() {
  System system = new System();
  Output output = system.apply(A_I);
  Truth.assertThat(output).isEqualTo(A_O);
}

@Test
public void testB() {
  System system = new System();
  Output output = system.apply(A_I);
  output = system.apply(B_I);
  Truth.assertThat(output).isEqualTo(B_O);
}

@Test
public void testC() {
  System system = new System();
  Output output = system.apply(A_I);
  output = system.apply(B_I);
  output = system.apply(C_I);
  Truth.assertThat(output).isEqualTo(C_O);
}

@Test
public void testD() {
  System system = new System();
  Output output = system.apply(A_I);
  output = system.apply(B_I);
  output = system.apply(C_I);
  output = system.apply(D_I);
  Truth.assertThat(output).isEqualTo(D_O);
}

@Test
public void testE() {
  System system = new System();
  Output output = system.apply(A_I);
  output = system.apply(B_I);
  output = system.apply(C_I);
  output = system.apply(D_I);
  output = system.apply(E_I);
  Truth.assertThat(output).isEqualTo(E_O);
}

Or you could just write one test that covers it all:

@Test
public void testABCDE() {
  System system = new System();

  Output output = system.apply(A_I);
  Truth.assertThat(output).isEqualTo(A_O);

  output = system.apply(B_I);
  Truth.assertThat(output).isEqualTo(B_O);

  output = system.apply(C_I);
  Truth.assertThat(output).isEqualTo(C_O);

  output = system.apply(D_I);
  Truth.assertThat(output).isEqualTo(D_O);

  output = system.apply(E_I);
  Truth.assertThat(output).isEqualTo(E_O);
}

If you were to follow the “one assert per test” rule that many developers preach, you would be forced to choose the former option. I don’t know about you, but I far prefer the latter. It is infinitely more scalable, especially as you get to complex systems with long chains of events and multiple things that need to be checked for at each stage. With rules like the above, it’s little wonder that developers take shortcuts when it comes to testing – following all the prescribed rules is far too burdensome!

If you’re working with a testing framework where the only information provided is the name of the failed test, maybe their reasoning would be valid. Fortunately, most modern testing frameworks provide a lot more debug information. A well-written test should produce error messages that clearly indicate where in the test it failed, why it failed, and the differences between the expected/actual outputs. Debugging the test failure should then be a simple matter of looking at the error message.

Conciseness in your test-codebase is extremely valuable, for many of the same reasons as in the production-codebase. Hence why hardware tests often involve hundreds of different checks, all being run at different times and checking for different things, in a single test. Given the vast amount of coverage that is needed, it is unrealistic to write dedicated tests for every single event-outcome combination. Abandon dogmatic rules which produce an explosion in verbosity. It’s okay for a single test to check for multiple things, along a single code path.

How Many 9s Are you Aiming For

“Writing automated integration tests for all features? Testing every possible corner case? Randomized inputs and reference models? Is all this really necessary??”

That’s a good question, and the answer is: It depends.

In system design, the first question to ask is how many 9s of reliability we are aiming for. And if the answer is high enough, we design fantastically complex systems to meet those goals. The exact same principle applies to testing as well. The more reliable you want your test suite to be, the more complex techniques you’ll have to use to achieve those goals.

If having corner-case bugs leak into production regularly isn’t a major problem for your project, then you can get by with the same testing methodologies being used by most software projects. But if you want to build a truly bulletproof test suite, one that makes production bugs an extremely rare occurrence, you’ll need to aim for multiple 9s of reliability. You’ll need to incorporate integration tests, randomized inputs, and reference models. You’ll need to be paranoid about testing anything and everything that could possibly go wrong.

In many cases, you’ll find that the use of integration tests and randomized inputs will actually improve your test coverage, while simultaneously reducing your development time and test verbosity.

But in other instances, as you try to squeeze the last few drops of reliability, your test suite will start to become more complex. On the plus side though, you’ll have so much trust in your test suite, that you’ll feel confident launching major code changes, with minimal manual testing or fretting.

There is no right or wrong answer here. Depending on how much reliability you’re aiming for, you can make complexity-verbosity-coverage tradeoffs on a case by case basis, using many of the techniques discussed above. Be honest with yourself about your project’s priorities, and then decide what sacrifices you’re willing to make to achieve them.

A German translation of this article has been published in the golem.de magazine

Related Links:
Dropbox’s use of randomized testing, in order to improve coverage of their sync functionality
New compiler bugs found every month, using fuzzing
jqwik – Property based testing library for Java (kudos to Dan Turner for recommending this)
QuickTheories – Another property-based-testing library for Java
A fun rant against unit tests

Online discussion threads:
HackerNews – 2021/07
/r/programming – 2019/05
/r/programming – 2019/11

Published 2019-04-282022-12-26

28 thoughts on “Rethinking Software Testing: Perspectives from the world of Hardware”

paddy3118 says:

2019-04-29 at 12:58 pm

> But What About Reproducibility?

It is common in randomised testing to first seed the random number generator allowing the test-seed combination to be reproducible.

LikeLiked by 2 people

Reply
shawn.lam says:

2019-05-04 at 8:20 pm

Very thought provoking article! You’ve certainly challenged my unit testing beliefs with some valid points. I don’t agree with all of them but this has caused some internal reflection 🧐. Thanks for putting this out there!

LikeLiked by 1 person

Reply
Pingback: Java Annotated Monthly – May 2019 | IntelliJ IDEA Blog
Trisha Gee (@trisha_gee) says:

2019-05-07 at 2:40 pm

I really liked this article and I’m going to share it via a bunch of different channels. I want to give a *tiny* bit of feedback, and I hope you take it in the constructive way I intend: I personally didn’t like the “separate the boys from the men” subheading. I know it’s common phrase, but it feels exclusionary when we already know development is hugely weighted towards men and has problems appealing to women. This might seem like a stupid nitpick and maybe it is, but I liked the rest of the article so much (I am 100% on board with creating the best possible automated tests) and that particular phrase just felt like a bit of a personal blow.

LikeLiked by 1 person

Reply
1. Rajiv Prabhakar says:
  
  2019-05-07 at 3:32 pm
  
  Thank you for the feedback. I’m committed to making Tech a more inclusionary space for everyone, and have updated the wording you referenced.
  
  LikeLike
  
  Reply
fp-apprentice (@wayneseymour) says:

2019-05-08 at 10:07 pm

I’ve been a dev and tester (currently tester)…GREAT WRITEUP! Thanks for the contribution 🙂

LikeLiked by 1 person

Reply
gbell12 says:

2019-05-09 at 11:32 pm

This is a great summary of testing in both worlds – I know because I’ve also been in both!

Hey Rajiv, did you have experience with formal verification in the hardware world? What a dream – quick mathematical proof that the (even stateful) logic is correct. I had a verification engineer find a very obscure bug in some cache control logic I had written using a formal verification tool (Solidify?).

That was coming into play just as I bailed on the industry – about 12 years ago.

Quick Googling shows that there are tools that apply this rigour to software too… have never come across them otherwise.

LikeLike

Reply
1. OutlookZen says:
  
  2019-05-10 at 11:46 am
  
  Thank you, nice to meet someone who has worked in both industries as well!
  
  When I was at Intel, I heard about some subsets of the project being verified using formal proofs. However, I didn’t have first hand experience with it, and the bulk of the functionality was still tested using RTL simulations. I agree with you that formal verification would be a dream!
  
  LikeLike
  
  Reply
Pingback: Alert Fast – Software the Hard way
James says:

2019-11-20 at 11:12 pm

Interestingly, the football vs basketball player analogy isn’t accurate – NFL players actually demonstrate higher vertical leaps than NBA players: https://www.verticaljumping.com/why_NFL_players_can_jump_so_high.html

LikeLiked by 1 person

Reply
1. RP says:
  
  2019-11-21 at 4:09 am
  
  Very interesting, thanks for sharing. I particularly thought the following was a good explanation:
  
  “Another reason NFL draft standing vertical jump numbers are so high is because … grid iron is a sport where fine motor skills are less important than other sports and physical speed, power and strength are of utmost concern… it stands to reason that draft prospects will be working very hard just to master this test. Compare that to say NBA players who are recruited much more for their skill, college performances, and height than their standing vertical jump and you can start to see why the NBA draft test results don’t look so good – they simply don’t need to.”
  
  I guess the spirit of my analogy was spot on, but I got the specific entities reversed
  
  LikeLike
  
  Reply
Pingback: The Birth of Legacy Software – How Change Aversion Feeds On Itself – Software the Hard way
Pingback: When Feature Flags Do And Don't Make Sense – Software the Hard way
Rajeev Muralidhar says:

2020-05-12 at 9:43 pm

Nice article. I have been at Intel as well, so totally relate to the background and though process behind the article. Here is an interesting one I came across on using Formal methods at scale in Amazon: https://lamport.azurewebsites.net/tla/formal-methods-amazon.pdf. I was going to submit that to Goodreads, will do it anyway with some additional thoughts / pointers.

LikeLiked by 1 person

Reply
Matt Murfitt (@mfm2424) says:

2020-05-13 at 1:06 pm

Nice article. I liked the part about being flexible with tests (eg not insisting on one assert per test). I think in some ways you have to throw away ideas of perfection and embrace pragmatism, which is why I really dislike Java’s insistence that only public methods can be tested. If I’ve written a private method and decided I want to test it, why not just let me?

Another issue I have is with over-designing tests. If I’m using the same string 5 times in production code, it should be a variable. In tests I prefer copy and pasting the string. If you make a mistake, your test will catch it, and any time your tests become more complicated you have to ask who’s going to test your tests.

LikeLike

Reply
Pingback: When feature flags do and don’t make sense (2019) - GistTree
Pingback: When Feature Flags Do And Don’t Make Sense - honynews.com
Pingback: When Feature Flags Do And Don’t Make Sense – News Webcast
Pingback: When correct flags feature up and don’t hang sense (2019) - JellyEnt
Pingback: When feature flags do and don’t make sense (2019) | صحافة حرة FREE PRESS
Pingback: Mutation Driven Testing – When TDD Just Isn’t Good Enough – Software the Hard way
Pingback: === popurls.com === popular today
Pingback: Rethinking Software Testing: Perspectives from the World of Hardware (2019) – Cyber News Network
Tom says:

2021-07-10 at 10:00 pm

“For example, perhaps you thought to test for the case where the numerator is zero, and also for the case where the denominator is zero, but you failed to consider the case where they are both zero. This is where randomized testing provides a great deal of value.”

I get the point you’re trying to make, but in this particular case, I think it’s infinitely more likely that a programmer on my team would think to test myCustomDivider(0.0, 0.0) than it is our PRNG would happen to pick 0.0 twice in a row.

Not only is it incredibly unlikely that a PRNG would pick 0.0 even once, but in many PRNG algorithms, it’s mathematically impossible for it to generate the same value twice in a row.

LikeLike

Reply
1. Elvis Chidera says:
  
  2021-07-13 at 5:32 am
  
  Some (most?) randomised testing tools use boundary values (-/+ Integer.MAX, etc) and values that are known to be problematic (zero, etc). This is done together with the actual random values.
  
  LikeLike
  
  Reply
Jasper Woudenberg says:

2021-07-12 at 6:38 am

I buy a lot of the techniques listed would help find more bugs, but am used to these techniques costing a lot more than what the author describes. I’d be a big advocate for this methodology if I knew how to deliver it for the cost described by the article.
> a whole suite of end-to-end tests can be run in ~30 seconds, and the entire project-wide suite can be run in 5 minutes.
I want this! How do I get this?
> Fortunately, most modern testing frameworks provide a lot more debug information. A well-written test should produce error messages that clearly indicate where in the test it failed, why it failed, and the differences between the expected/actual outputs. Debugging the test failure should then be a simple matter of looking at the error message.
I agree that’s how it should be. When an integration test fails, I do often need debug tooling to understand what failed and why. Is that down to the test framework? My debugging technique? Something else?
> (..) integration tests (..) can you cover your entire codebase with a far smaller and simpler test suite (..)
Would love to learn more about this. My experience is that as a test covers more components, complexity goes up of the test harness, the mocking/setup code, and the tests themselves too (because there’s more code in those tests duplication becomes a concern, and we bring in abstractions to mitigate it).

LikeLike

Reply
Pingback: Hacker Bits, Issue 67 - Hacker Bits
Pingback: When Feature Flags Do and Don’t Make Sense - My Blog