Your Test Suite is Broken

By Ovid on August 3, 2010 9:40 PM

Here are some indicators that your test suite is broken:

It has any failing tests.
It emits anything other than test information.
It takes too long to run.

People are going to have issues with the second item on that list but they're definitely going to argue with that last one. I don't care. It's time to throw the gauntlet down. Let's look at these three points in detail.

Any failing tests

If a test suite has any failing tests, even "known" failures, developers learn to overlook failing tests. Those who've been testing for any length of time will generally agree with this. It's a bad thing. In fact, it's enough of a bad thing that every developer (well, except one) that I've spoken with agrees that leaving broken tests is bad, so I won't belabour this point.

Any "non-test" output

This one throws a few people off and I confess I'm guilty of ignoring this myself; it's easy to ignore a warning which is hard to suppress and just shrug. However, like the "any failing tests" rule, this should be dealt with quickly. Specifically, if you start getting accustomed to output which isn't test related, sooner or later you'll see non-test output which might indicate a bug but you'll learn ignore it. I still remember cleaning up a test suite warning which had existed for months, only to discover that the warning was indicative of a bug in the test program, making the tests useless.

Long running tests

People will agree that this is annoying, but they won't necessarily agree that their test suite is broken. I argue that it is. I strongly argue that it is. The first problem is the "downtime" problem:

Sure, many argue "but the developer can choose another task to handle at that point", but it's often very hard to find a meaningful task that only takes an hour or two. Plus, many developers are simply going to wander off for a bit. If you expect seven hours of productivity a day and you have seven developers who run their full one-hour test suite once a day, you've lost a day of work every day. It's like having six developers!

No, I'm lying. It's like having less than six developers because they won't come back the second the test suite is done, or they won't be mentally rarin'-to-go the second the test suite is done. A single failure means rerunning that suite, so that's also more time lost. Long-running test suites cause developers to lose productivity. It causes them to lose serious amounts of productivity. (I really should start tracking this at work so I can show management the costs involved. They're staggering!)

So what happens then? Forget for a moment what you think should happen. I think employers should double my paycheck every year. Ain't gonna happen. Instead, what happens is reality. Even if management is not pushing the programmers to be more productive, the developers themselves will want to be more productive and they've found a trick to make this happen. I personally have experienced this trick at five teams across three companies with long-running test suites. I've had many developers in other companies tell me that their teams use the same trick themselves (remember, I'm "that testing guy". People talk to me about this stuff all the time).

Here's the trick: don't always run the full test suite.

Remember: I've had this happen at every company I've worked at with long-running test suites. Many developers have told me their teams do the same thing with their long-running test suites. They don't always run those tests. Forget what "should" be done. This is reality. When you're under pressure to deliver, those deadlines are looming and you just know that X only affects Y and X is well-tested and you can slip this in really quickly and no one will notice, it's easy to commit now and not wait an hour or two. I've seen it eventually happen to every team with a long-running test suite. I'm sure there are exceptions, but I've never seen one first-hand. Sooner or later that failed test is going to creep in.

At my current team, you can't commit to trunk unless you have a stuffed camel on your desk. One of the most common questions on our team is "who has the camel?"

Possibly the next most popular question is "do you mind if I commit this small change to trunk anyway? It doesn't affect anything."

It doesn't affect anything? Really? So why is it being committed? Of course, I've asked this question, too.

Mind you, these aren't lazy developers. These aren't bad or unconscientious developers. These are developers who have deadlines and have to get things done. They have to make individual value calls on whether or not the risk is worth the reward and usually they're right. When they're wrong, the effect cascades across everyone's work. As a result, I've lost a huge amount of time today trying to debug test failures that I didn't realize were in trunk. This is not the first time this has happened to me.

Maybe your team is different. Maybe your developers are so incredibly careful and meticulous that your three-hour test suite is run every single time it should be. Maybe your developers are so conscientious that they immediately find a task which should take two hours and fifty minutes when they run that three hour test suite. And maybe your developers are so anal-retentive that I'd want to hang myself after working with them for more than a week.

For the rest of us, a long-running test suite means a significant loss in productivity and a drop in quality. I've seen this too many times to think that my experiences are anomalous. There's a lot of interesting stuff which needs to be done around testing, but I think more stuff needs to be done around speeding up test suites. For the vast majority of Perl users, this is not a problem. For those of us suffering from this problem, it's pretty damned serious. I've done a lot of work in this area, but more needs to be done. Much more.

15 comments

Tagged as:

testing, tests

15 Comments

Denis Howe | August 4, 2010 12:35 AM

Does a long-running test suite imply that your system needs to be broken up into components (sub-projects)? The hope being that the component you are currently working on can have its own (smaller) set of low-level tests and that the other components with which it interfaces can be treated as black boxes? This is sort of making "not running all the tests" official by creating a clear boundary between unit testing and system testing.

Grant McLean | August 4, 2010 1:07 AM

I think it's a bit too black and white to suggest that a test suite that takes too long to run is "broken". If you have a large complex system it's going to take a long time to test.

In my $work project our test suite takes around 40 minutes to run. Obviously that means a developer can't run all the tests on every commit but we've addressed that in two ways.

Our test suite is modular. We have a tree of test directories and a developer can run individual tests, a directory full of tests or a subtree of directories of tests. This would typically be done as part of getting a code change ready to commit.

We use continuous integration. Nothing fancy, we just have a script running on a dedicated test 'server' looping through the whole test suite repeatedly. When the tests pass, the script builds a set of .deb packages and deploys them to our staging servers.

So developers run a subset of the tests before committing and then if the commit broke something outside that subset, they'll get notified within about an hour. That's typically a small enough time frame to make apportioning blame easy :-)

http://www.wgz.org/chromatic/ | August 4, 2010 2:13 AM

My personal rule is that any test suite which takes longer than ten minutes to run in full is unusable and any core tests which take longer than a minute to run are too cumbersome.

I dislike the "We'll just use continuous integration!" approach because I tend to get notifications only after I've moved on to something else. Context switching is even worse for wetware than hardware.

Ovid replied to comment from Denis Howe | August 4, 2010 6:55 AM

@Denis: breaking things up into subprojects sounds like a great idea but it's been very problematic for us. You could look at our project as having two subprojects, Dynamite (public read-only interface) and Databridge (the writer), which are now tightly coupled. Changing one often means changing the other and this causes synchronisation issues. It's not been fun. Further, not many people know anything about the Databridge side of things (I haven't even been able to get it running), thus fragmenting our knowledge along with the code. In fact, it probably wouldn't be too hard to justify breaking our team into two teams entirely ... thus ensuring that when anything has to be done, even more lines of communication have to be managed and work progresses at a slower rate.

Ovid replied to comment from Grant McLean | August 4, 2010 7:22 AM

@Grant: Your test suite only takes forty minutes to run? I'm very envious at how fast it is!

You state that part of your approach is a modular test suite and only running only a subset of tests and letting CI catch what falls through the gaps. Your solution is our problem. First of all, I firmly believe that CI should not be there to find problems; like your QA department, it should merely be verifying quality, not assessing it. Otherwise, developers get in the habit of letting CI or QA find the problems and that's when things start to get messy.

Consider our two hour test suite (it usually runs for longer on my box): I decide to run a subset of tests which appear to reasonably cover functionality and 20 minutes later, commit to trunk. Meanwhile, someone else creates a new branch. CI doesn't pick up on this right away because it's busy finishing off a last test suite run[1]. So when CI gets done running the tests, it sees a new checkin, builds it, and runs a new test suite. A hour and a half into it, there's a failure, but since it was waiting an hour for the last test suite, we now have a two and a half hour gap between failure and notification (not unusual), but not only is the first developer having to stop their work, anyone who's branched off of trunk in the interim now has potentially broken tests.

With your shorter 40 minute test runs, I suspect that running a subset of tests is more manageable, but it doesn't scale. As you add more functionality, you'll add more tests and the suite will take longer to run and it will take longer for you to be notified of errors. Our test suite simply takes too long for this to be a comfortable proposition.

While I accept that you feel that "long-running test suites are broken" is too black and white of an assertion, I still have to hold to that position. From the business side, you want to maximize productivity of your developers. If they're constantly fiddling with the scaffolding instead of working on solutions, you're not getting the best value you can.

1. Whether our Hudson server should see a new checkout and run test suites in parallel is a separate issue, not easily resolved. It's complicated by occasionally backwards-incompatible schema changes and the fact that we're actually two projects now and those need to be kept in synch on CI for them to run correctly.

Ovid replied to comment from http://www.wgz.org/chromatic/ | August 4, 2010 7:26 AM

@chromatic: the context switching nightmare is worse because often several developers are notified of the test failure, thus causing several developers to have stop while they sort out what caused it. Or several developers worked on the branch and now we need to know which one is responsible. Or several developers have checked out broken code in the interim because CI took too long to see the test failure.

Ten minutes, I think, is a good rule of thumb, but it's very hard to achieve that with the size of the systems I tend to work on. Parallelisation seems to be the only way to go.

Judioo | August 4, 2010 10:00 AM

Seems like there are 2 problems compounded into one.

1) Monolithic codebase

So you have DataBridge which depends on Dynamite. Sounds like there should be a common library that both projects share.

How does making a 3rd dependency help?

The common library consists of commonly used components ( DBIC classes, middleware interface classes, etc ) and therefore will have the majority of tests due to it's interfacing with all of the other system. It will therefore also take the longest time to run but should be quite stable as core interfaces should change less often than external.

On the other hand the main projects ( where the majority of the business logic resides ) test suites will be much faster.

Only at integration will the entire suit require running.

2) branching and merging strategy

This is commonly over looked as being one of the major causes of delay.

If a developer need to "wait" for a camel before merging maybe there is a flaw in your branching strategy.

If a developer can branch from broken ( tests failing ) code maybe there is a flaw in your branching strategy.

Poor branching strategies REALLY affects productivity.

There is a trade off between "continuous integration" and "known stable code". I think you'll find your team has chosen the former as being more important.

Ovid replied to comment from Judioo | August 4, 2010 10:58 AM

"So you have DataBridge which depends on Dynamite. Sounds like there should be a common library that both projects share."

@xover: there is a common set of classes which both projects share and yes, there is only one set of tests for it. We're not duplicating that.

"If a developer need to "wait" for a camel before merging maybe there is a flaw in your branching strategy."

I don't understand your point. If the tests take a long time to run, it's easy for another developer to come in and commit while you're running tests. This means that you need to run the full test suite again with the new code, but if another developer commits in the interim (something tempting to do because the test suite takes so long to run), then you've have to run the test suite again.

Prior to the camel, a developer could waste an entire day trying to commit code, only to find out that others were sick of waiting for the test suite to finish and committed that tiny little change that surely couldn't break anything. And that's if all goes well and no tests fail. If they fail, it's even worse.

jnareb.openid.pl | August 4, 2010 1:25 PM

Wouldn't branchy development help there, i.e. branching from code known to pass tests, and having CI or maintainer merge branches and tests results of merging?

Ovid replied to comment from jnareb.openid.pl | August 4, 2010 1:46 PM

@jnareb: I don't see how it would help. We already rely very heavily on branching. Our CI server would be overloaded if it tried to run tests on every branch commit, so it only does so on trunk commit. Just yesterday I see about 25 commits in 8 hours and we can easily have far more on a good day. Of course, this wouldn't be a problem if our test suite didn't take so long to run, but then, that's the problem which we're facing in the first place :/

jnareb.openid.pl | August 4, 2010 2:34 PM

Can't developers run tests on their branches (on their machines) before submitting them for inclusion in trunk (master branch)? This should reduce CI failures.

Ovid replied to comment from jnareb.openid.pl | August 4, 2010 2:39 PM

@jnareb: We're already running the tests on our own machines, but when we merge them to trunk (also on our own machines!), we need to run them again before the commit to make sure we don't break trunk. It's the fact that the test suite takes too long to run which makes this the problem.

mpeters.myopenid.com | August 4, 2010 2:58 PM

I just can't see #3 as so black and white. Yes at my $job we have a large test suite that tests some things that take a long time to run (they even take a long time to run in production, so I don't see why the tests shouldn't emulate what's in production). And yes we use CI to help developers so that they don't have to run the full test suite on each commit.

There are ways to work around lots of people investigating CI failures. But if you have a really large project that has some things that just take a long time to run, you reach a point where you can't make your tests any faster. Sure you could buy beefier hardware and try to make things run in parallel (our tests already make our 2 cores run about 80%) but if the project is large enough at some point you will hit a wall. I see CI as a "good enough" solution for getting over that wall.

Mike Friedman | August 4, 2010 4:55 PM

I think you've got a bit of a chicken-and-egg problem. To make your test suite smaller, the only way to go is to break up the monolithic project into several independent projects, each with their own test suite. You don't re-run the tests on all your CPAN dependencies every time you commit (I hope!) so why run tests on DataBridge for a change to Dynamite? The problem is the tight coupling: you can't separate the dependencies from the code, so you can't separate the test suites. I think that if you want to fix this, at some point there's going to have to be a lot of pain while you take a chainsaw to the code base.

Ovid replied to comment from Mike Friedman | August 4, 2010 5:11 PM

@Mike: we don't want to make our test suite smaller unless we're removing duplicate or useless tests. And I don't believe I said we run the DataBridge tests for Dynamite. DataBridge has a separate set of tests. I wrote "there is only one set of tests" for the shared code (primarily the DBIC layer), not that we're running tests for both projects.

We want to make our test suite faster, not smaller. That's a different beast entirely. We've identified several more areas we can do this in, but it's honestly hard work.

About Ovid

Freelance Perl/Testing/Agile consultant and trainer. See http://www.allaroundtheworld.fr/ for our services. If you have a problem with Perl, we will solve it for you. And don't forget to buy my book! http://www.amazon.com/Beginning-Perl-Curtis-Poe/dp/1118013840/

More info »

Ovid