To Hardcode, or Not to Hardcode: That Is the (Unit) Test-ion

In my last blog post, there was a bit of a discussion in the comments about whether data in unit tests should be hardcoded or not.  Tom Metro scored the last point with this comment:

We always coach developers to use hard coded test data to the extent practical. When writing tests you have to unlearn a lot of DRY principles. We tolerate a lot more repetition, and factor it out sparingly.

and then I said that this really required a longer discussion than blog post comments would allow.  This, then, is that longer discussion.

So, first of all, let me agree whole-heartedly with Tom’s final statement there: “We tolerate a lot more repetition, and factor it out sparingly.” Absolutely good advice.  The only problem is that, as I noted previously on my Other Blog, we humans have a tendency to want to simplify things in general, and rules and “best practices” in particular.  I don’t think that Tom was saying “always hardcode everything in your unit tests!” ... but I think some people will read it that way nonetheless.  I would like to make the argument that this is a more nuanced topic, and hopefully present some reasons why you want to consider this question very carefully.1

For an example of how hardcoding the right answers for all your unit tests can be taken too far, imagine a system where you need to test several scripts that produce output.  One simple way would be to run the scripts and capture the output, then examine that output very carefully, once.  Once you’ve verified by hand that the output is good, you save said output to a file.  Now all your unit test has to do is run the script, generate new output, diff against the Last Known Good output, and throw a fit if there are any differences.  Easy peasy.

Now, if you make a change to a script that affects its output, you will generate a diff, but then you can verify (again, manually) only the change to make sure it’s correct, then resave the new output as Last Known Good.  Still pretty basic, right?  All the correct “answers” are hardcoded, in files, and you’re checking against them all.  As your system grows, you’ll get more and more scripts, and therefore more and more LKG files to diff against, but no biggie.  As your team grows and changes, though, it can be harder to know what the problem is if an unexpected diff pops up during your testing ... you didn’t think your change could have possibly affected that test, and yet there it is.  You’ll have to track down the right person to ask about this ... assuming they still work here.

As this “system” grows to dozens or even hundreds of LKG files, the chances that any given person knows what any given diff actually means (and whether it’s an actual problem or just a symptom of the lack of foresight someone had when setting up the test in the first place) begins to drop dramatically, and the chances that any given person will just “fix” a failing test by simply replacing the LKG file with the new output starts going up.  Pretty soon you’ve invented “unit testing theater,” a system whereby everyone feels great about the number of passing tests and nothing is really being tested in a way that will ever catch any actual errors.  I would say this is an example of the failure of hardcoded tests.

Now, before you attempt the classic rebuttal of “yes, but your example is so extreme that that would never happen in the real world,” let me assure you that any coworkers of mine from $lastjob who are reading this have now buried their faces in their hands, because they know exactly what I’m talking about, and it was all too real.  By the time I got there, very few people had any real understanding of what all that hardcoded data was supposed to represent and whether it still made any sense and/or still represented the actual goals of the business.  I was brand new, so I certainly didn’t have any idea.  When I asked what I should be doing when a diff unexpectedly caused a test to fail, it was explained to me how simple the fix was—you guessed, just regenerate the LKG file.  Eventually I got sick of this unit test theater and ripped out the whole thing.

What did I replace it with?  Simple: a system whereby output was generated by a different algorithm.  The great thing about algorithms is that, like mathematical proofs (and like Perl), There’s More Than One Way To Do It.  If you’re lucky, there’s another way to calculate that answer that is just dumber and/or slower than the way the real code is doing it, and that’s the way your unit tests should be calculating it.  Why is that better than harcoding the right answers (in this particular instance)?  Well, there are a few advantages:

  • As long as your unit test algorithm is very simple, it can be much easier to verify than the algorithm in your actual code.  An ideal unit testing algorithm can be fairly quickly verified just by looking at it.
  • Even if the unit test algorithm takes longer than you’d like to verify when things stop matching up, at least you have something direct to compare it to: the algorithm in the code.  If the two algorithms don’t produce the same answer, you can figure out why, even if you have zero clue what the hell the output means or anything about its business purpose.  You can trace through the two algorithms, figure out where they differ, then go to someone more knowledgeable and ask the simple question “which of these two is correct?”
  • Conversely, hardcoded “right” answers not only depend on the algorithm, but also the input.  The original input may have made sense once upon a time, but perhaps now it doesn’t.  Instead of asking a businessperson “should the program be doing this or this?” you’re now forced to ask “if I put in this, what should I be getting out?” Not only is it much harder to get a definitive answer to such an open-ended question, but you may find the answer is “why on earth would you put that in?”
  • Your comparison algorithm is also much better at dealing with change.  As new, perhaps completely unforeseen, inputs come along, the LKG method isn’t providing any value at all: it can only help you validate algorithmic changes against the old inputs.  But you can feed any inputs into your comparison algorithm and it should manage to spit out the right answers.  If it doesn’t, then either your comparison algorithm is incomplete, or your actual algorithm is actually wrong.

So is this the perfect solution in all cases?  No, of course not.  Frederick Brooks told us there was no silver bullet for software engineering problems over three decades ago, and that’s why it’s so dangerous for us to give in to our natural human instincts to simplify.  It’s awesome to be able to tell ourselves (and our new employees) “never do this!” and “always do that!” As soon as we have to start adding a bunch of caveats and exceptions, we’ve made the whole thing more complex, and that just makes it much harder to get it right.  But as Einstein once sort-of-said: things should made as simple as possible, but no simpler.2  As with many things in computer science, you just can’t make this one simpler.

To be clear: should you always use hardcoded data for unit tests?  No. Should you never use hardcoded data for unit tests?  No. You’ve got to look at a number of aspects of the problem, including what your long-term effects are going to be (or your best guess at them anyhow), and try to come up with the best solution you can.  And sometimes you’ll still be wrong.  But you do your best.

To try to bring this discussion full circle to where it started, the blog post that started it all was discussing the unit testing for Date::Easy.  What Tom specifically asked me (later in the same comment I quote above) was:

Are you not able to explicitly set environment variable or other flags so that your tests are always executed in a known environment?


And the answer is, often, “yes.” But I don’t want to do that.  Take timezone as a simple example.  It’s not too hard to force a timezone, at least on Unixoid systems: just set $ENV{TZ}.  If I wanted to, I could force every single one of the unit tests for Date::Easy to run in the U.S. Pacific timezone, which is where I live.  That way, all my tests would pass nearly all the time, because, if they didn’t pass there, I wouldn’t have released the version in the first place.  But if I did that I’d be missing out on a golden opportunity: there are CPAN Testers machines all around the world, in all different timezones, with all different locale settings, in all different languages.  I want to know if my unit tests pass with all those parameters.  I already knew it passed in Los Angeles, in English, with the standard C locale.  I want to find out where it fails ... so I can fix it.

I hope this has been a useful discussion into why you should sometimes avoid harcoding your unit test data, even though it deliberately hesitates to pick a side in all situations.  Definitely don’t sweat a little repetition in your unit tests, and “factor it out sparingly” (as Tom suggests)—that’s absolutely the way to go.  But hardcoding unit test data also has downsides; just be aware of what they are and make the best decision you can for the situation you have to deal with in the moment.



__________

1 Note: This discussion is not limited to Perl, of course.  But it’s far more of a technical topic than my Other Blog is prepared to handle, and the audience here will appreciate it far more.

2 Actually, what Einstein said was “It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience.” But, you know, things get paraphrased.

Leave a comment

About Buddy Burden

user-pic 12 years in California, 23 years in Perl, 32 years in computers, 53 years in bare feet.