How to be agile without testing
What's a bug?
Fair warning: if you're someone who has the shining light of the converted in your eyes and you've discovered the One True Way of writing software, you might feel a bit challenged by this post.
Your newest developer just pushed some code, but it has a bug. They screwed up the CSS and one of the links is a bright, glaring red instead of the muted blue that your company requires. While you're sitting with her, counseling her on the importance of testing, you get a call from marketing congratulating you on the last release. Sales have jumped 50%.
You know that the only change is the link color.
Was that code really a bug? Are you honestly going to roll it back?
More importantly, and this is the question that many people get wrong: what are you going to learn from this?
Why we write tests
Why do we go agile? Because we believe we can improve the software process. Because we believe we can share information better and, this is very important to agile: get feedback early and often (hint: that theme is going to recur). Another interesting thing about agile is that every agile methodology, without exception, says that you need to adjust that methodology to meet you particular needs. I've said this before and it bears repeating: you can't be agile unless you're, well, agile. So some agile teams don't have fixed iterations. Others (most?) don't do pair programming. Code review is only done on the tricky bits or maybe for newer programmers. And guess what? These companies often do very well, despite doing things differently.
But none of them talk about getting rid of testing. They just don't.
And yet in our "red link" example above, testing might well have made it harder to discover a 50% increase in sales.
For most of us, we learned about testing years ago and it was good. Then we learned about TDD and realized we had found testing Nirvana. FIT testing was a nifty idea that's heavily evangelized by those who offer FIT testing consulting services — and pretty much no one else. And now BDD leaves some breathless while others yawn. It's just the next craze, right?
But what is testing for? From a technical perspective, we might argue that it's to make sure the software does what we want it to do. But is that the most important thing? Remember that over a size of mumble lines of code, all software has bugs. All of it.
Instead, I think it's better to say that all software has unexpected behavior. We write tests because we hope the software will do what we want the software to do, but instead, isn't it better if the customers do what we want them to do? You can build a better mousetrap, but there's no guarantee they will come. And if there's anything to learn from Digg or other tech disasters it's this: customers are going to do what they damned well please, regardless of whether or not your software "works", the experts you've consulted or how many focus groups you've held.
So rather than introduce software testing as some proxy for customer behavior, let's think about the consumers of our software for a moment.
A list of undesirable things
Considering our "bright red link" example above, I ask again: is it a bug? In that example (which was not chosen at random), it's easy to argue that it's a software bug, but that's only because the software exhibited unexpected behavior. In this case, it was a 50% increase in sales.
So now, instead of bugs — always bad! — we can think in terms of "unexpected behavior", sometimes good, sometimes bad.
So how do you know which is which?
You make lists of undesirable things. 500 errors on your Web site are bad, but
are 302s? Tough to say. Maybe you want to keep RAM usage below a certain
level, or not see a significant drop in sales. And you probably want to make
sure that responses never take more than $x
milliseconds.
Make a list of everything that's unequivocally undesirable (for example, a Facebook "like" button going away doesn't count as unequivocally undesirable) and add monitoring for all of those behaviors. Every time you change or add technologies, go over your list of undesirable things again. Are they up to date? Is there anything you need to change?
Some of those undesirable things are reversible (dropping sales) and the alternative is good. So monitor those, too. Maybe you want to get notified when a release improves response time by 10%.
And then?
Well, it's great, but it doesn't replace testing. Not by a long shot. You've made your bi-weekly release, RAM consumption has skyrocketed, you're swapping like mad and now you a 3,000 line diff to go through. Finding a memory leak can be hard at the best of times and normal testing often misses them (but checkout Test::LeakTrace), but now you have a roll back a huge change and goes through 3,000 lines of code to find your problem.
So you don't do that. Instead, you're switching to continuous deployment. With this model, you push code to production the moment it's ready. Of course, it's good if you actually push it to a box, watch it, push it to a cluster, watch it, and then push it to all servers. With your extensive monitoring, undesirable things usually show up pretty quickly and your memory leak is a 30 line diff instead of a 3,000 line diff.
Which one do you want to deal with?
(Naturally, I used a memory leak as an example, but that's one of the things which often takes longer to show up, but I'm too lazy to change that example. Pretend I wrote "5% increase in 404s.)
In my experience with this, customers are fairly forgiving about minor quirks and most unexpected behaviors are things like "this image isn't showing up" or "these search results are ordered incorrectly." Those tend to not be catastrophic. In fact, many times this unexpected behavior goes unnoticed. Most of the time the unexpected behaviors will turn out to be neutral or bad, in terms of undesirable things, but sometimes they turn out to be the good unexpected behavior. You'll never know if you don't try.
As you may expect, this technique works very well with A/B testing and if you have the courage to look for unexpected behaviors instead of bugs, A/B testing is the next logical step.
Note: None of the above precludes writing tests. None of it. I've seen the "monitoring undesirable things" strategy work extremely well and I firmly believe that it can work in conjunction with software testing. However, it's a different way of testing software, one that's more reliant on customer behavior than exacting specifications. So the title of this post is actually a bit of a lie; it's just a different want of looking at testing.
And that's really the most interesting idea of this entire post: your customer's behavior is more important than your application's behavior.
See also: when must you test your code?.
I was A/B-tested
Isn't this the same argument some people use against using safety-belt in cars?
"I read about an accident where the person could not escape the car because of the safety belt, so I won't use it."
You're conflating two totally different ideas. When we talk about testing we're talking about correctness, or what you call "testing for unequivocally undesirable things".
Everything else you're taking about in this post is called analytics.
And, yes, good analytics and a/b testing is important.
Gabor, that's a great question and the answer is a resounding "no!" I was planning on writing a follow-up to this post entitled "when you must test" and the central criteria is simple: harm. It's a matter of simple ethics. For example, if you risk double-charging or mischarging a customer, you must test. If you risk giving someone a lethal does of radiation, you must test (and far more rigorously than most do). By using "potential for actual or perceived harm" as your minimum benchmark, you can make a clear distinction between testing that you're selling the right medication or testing whether or not you've shown the short or long synopsis of a toy.
I am not arguing that you should not test, but I am arguing that sometimes we're focusing so much on our software that we're forgetting our customers.
Sushisource: I'm very deliberately conflating monitoring and testing because they're different in the way that motorcycles and planes are different: they are different means of accomplishing the same general goal and each has strengths and weaknesses. But first ...
You will note that I never said that tests assure correctness because they can't. But lets ignore that for a moment and pretend that tests can demonstrate correctness. What then? Let's say you've created online book-selling software that is provably correct. In no way does adherence to a specification guarantee that your customers are going to use it and that's the crux of the problem that I am getting at.
The above post was in the context of customer-facing software and perhaps my failure to state that explicitly is the problem here, but let's take a look at that. What if I write a test to guarantee that a mobile user always sees the short synopsis of a product instead of the long synopsis that you'll see on a non-mobile system. That test is a binary assertion: either I see the behavior that I desire or not. However, I'm only seeing the behavior that I desire. How does that translate to what my customers do? It doesn't. Tests verifying that my code matches a specification are orthogonal to whether or not my customers are going to take actions that I desire. The Digg release that caused the collapse of Digg may have been the best tested Digg release in history (I don't know that, of course), but it failed miserably. I don't recall anyone saying that it failed just because it was so buggy (though that was part of it). Everyone said it failed because consumers were unhappy with it.
So monitoring/analytics is a non-binary form of testing because it requires humans to think about it rather than push a button and wait for a yes or no answer. You, the human, have to ask yourself, is a 10% increase in response time a fair trade-off for the 1% increase in sales? There is no "one size fits all" solution to that question. If you have plenty of capacity, the answer might be yes. If you've maxed out the performance of your boxes and you don't have the cash to buy more, the answer might be no.
In the end, what I'm trying to do is to convince people that for customer-facing code, simply writing a test that proves you can actually sign up for the corporate newsletter has nothing to do with whether or not people will sign up for that newsletter. Monitoring is a form of testing that tests human behavior rather than computer behavior and no, it's not going to be an either/or result.
For me, there are two very good reasons for writing tests.
I am lazy and I write web applications. Clicking myself through that up until I find the feature I'm working on just to be greeted by a stupid error such as "Cannot use an undefined value as a hash-ref" is way to exhausting for me.
I will have to add features and modify existing ones in the future. You make it sound like tests are written so that we can "prove" that our existing code works. That's not the point. I can use those tests to check whether what I wrote a couple of weeks/months ago still does what it's supposed to do although I added features and you can now sign up for another corporate newsletter.
I dont test automatically. I have precisely 0 unit tests, yet, I have a successful website, getting 2 million requests daily. Im not worried about adding new features, or breaking existing code. I add a feature, manually test that it works, and if it does, deploy to live.
What I do is live by these rules. 1) The system must be fast to deploy / restart. It is essential that when I add a change, I can see that change within seconds. Not have to wait for an ant/maven rebuild, or for caches to rebuild. Instant feedback.
2) Steer clear of anything complex. If I write something complex, that I cant understand fully, or wont fit into my mind without paging, this could be considered a prime candidate for testing to verify its doing as required. I dont want to spend time writing tests, so instead, I find another less complex way of doing what I want.
3) If things break (and they do occasionally break!) , then fix them. When I put things live, sometimes bugs do get out. my users tell me about bugs within minutes. Because of the first 2 points, I can get a fix out very quickly, usually within the following few mins. The bugs that have arisen, have typically been because of things that I didn't anticipate, and so I wouldn't have written tests for.
Im happy. I have a clean codebase, and very fast turnarounds. I kind of believe that unit tests would get in the way, or encourage code complexity. Though, thats just my opinion.
Monitoring is a kind of testing. Application monitoring is doubly so. We select how to monitor our applications from the set of behaviors exhibited by the system and write code that validates the presence of the behavior. If we choose, for example, to monitor that our style guide is adhered to. If we test that no glaring red appears in links, we loose the opportunity to realize the serendipitous increase in sales due to a violation of the style guide. The opportunity would never have occurred.
When we test and when we monitor and when we do all the other good things that make our software stable, usable and predictable we must realize that we are also locking down dimensions of variability that could lead to serendipitous outcomes. We choose to test and monitor and automate so that we avoid common bad outcomes. In the process we throw out a few potentially good outcomes.
I do so love the continual delivery stuff. It makes building a culture of experimentation so much easier.
That said - you seem to be assuming that the only reason to write tests is to find bugs / behaviour that the customer doesn't want. There are other reasons to write tests.
For example:
I write the vast majority of my code using TDD. Here I'm writing tests to help my design.
Sometimes I write story level tests / acceptance tests / customer tests. Here I'm writing tests to help me figure out when I've done something.
... and so on...
Adrian,
I like to make a distinction between customer and non-customer code. For non-customer code (such as for an open source project or a back end reporting system), there's a far different strategy which is applied. However, by "customer" code, I mean customer-facing code where the success of the business depends on customers using the code in the manner that the business needs to stay afloat. Did you buy the mug that's being sold? Did you sign up for the newsletter? Do you become a repeat visitor?
Most companies I've worked for are desperate to have those answers, but do nothing to either acquire these answers or help the development team "improve" these answers. Instead, companies have a marketing department which tells the devs to change 32 different things at once, though there is never (and I really mean never) any evidence presented that shows that customers will actually respond in the desired fashion. The devs, in turn, dutifully churn out the features that they're expected to churn out without considering whether or not those features are really useful. (This is part of the reason that the P.O.P. strategy I recommend suggests hiring devs who have business awareness, but I didn't have time to fit that in the slides).
The business is actually the forgotten party here because if you ask any business person with a brain "wouldn't you like to be able to see in real-time whether or not your customers are doing what you want them to do?", they're going to say "oh god, yes!" But few companies think about that question or try to answer it in any meaningful way. I want to change that and these posts are my thoughts on what is needed here.
So I don't dispute what you say about tests, but I will cheerfully dispute whether or not currently recommended best practices in testing are the elusive Holy Grail of performant software.
Side note: to get around the annoying "session expired" bit, copy your response, go back to the page with the post and hit "reload" and then submit quickly. Spending too much time on rewriting the post causes that annoying timeout.
Hmmm.. whose best practices are you reading ;-)
I've been running workshops on the advantages of a more experimental metric-driven approach for a couple of years now. Jez Humble's CD book was published nearly three years ago. The classic "The Deployment Production Line" paper from Agile 2006 is more than six years old now.
We've got the Lean Startup folk ranting about CD, metrics, split testing, etc. for the last three years.
Maybe it's just that I spend more time with new product development and startups - but this sort of stuff is best practice now.
It's just orthogonal to testing.