Sunday, October 19, 2008

When Code Coverage Is Necessary But Not Sufficient

In Can unit testing be a waste? pbielicki argues that it is far more efficient to test at the highest possible level of abstraction and rely on your code coverage tools to insure that all code paths have been tested. That might work in his world, but just a couple of days after having read that article I ran into a situation where developers had been doing just that and missed a bug completely.

The C++ code was implementing an algorithm in an open specification for the binary encoding of an identity string into a binary format for transmission to a remote piece of equipment. The identity string is a globally unique fifteen digit decimal number. The string is broken up in to several separate fields and each field is encoded from decimal into binary before being stuffed into a protocol packet destined for the remote equipment. There are several encoding algorithms depending on the length of the specific field, which range from one to three decimal digits in length.

Looking at the couple of dozen lines of C++ code in the base, I thought "It would be reasonable to assume that if any valid identity string were to be encoded into binary, then decoded back into a decimal string, the result would match exactly the original fifteen digit string." So knowing exactly nothing yet about the specific algorithms from the spec, I wrote a little unit test to do precisely that.

How many of the possible values did I test for each encoding algorithm? All of them, of course. On a 2.8 gigahertz Pentium 4, cycles are easy to come by.

What did I find? Twenty percent of the possible values failed this simple unit test.

I am pleased to report, however, that code coverage was one hundred percent. In fact, testing any single input value for any of the algorithms would have yielded code coverage of one hundred percent. Given that eighty percent of the possible values passed unit testing, it would be easy, in fact, likely, that you would not catch this bug just testing a few selected values. Conceivably you could test a lot of values, and still not have any failures, if you just happened to stay within the eighty percent that worked.

It wasn't a matter of code coverage. It was a matter of input range coverage.

I have written unit tests that ran for many minutes, testing a huge range of input values. For an algorithm that did time and date calculations, I had a unit test that ran for days. Fortunately, I didn't have to run it very often. But when I did run it, and it completed successfully, I was pretty darn sure that the underlying code worked.

Cycles are cheap. When dealing with what are fundamentally mathematical algorithms, there is no reason not to test a lot of values. When in doubt, test all of them.

5 comments:

mcjoe said...

Hey Chip,

While I agree in this case it was probably easier to just test every value, this approach doesn't scale. What if your identity string was a twenty digit value (equivalent to a 64-bit binary value)? Suddenly, even if you can test a million combinations a second, it is not possible to test all combinations in under half a million years!

The first question that came to mind regarding your 20% failure rate was, what was the common theme to these failures? Did the test miss an edge case?

If so, then the reason 100% code coverage was insufficient is that the code to handle the failed cases did not exist.

When the system under test gets more complex, unit tests are not a waste, but you have to figure out what aspects of the system are likely to fail and write tests for that target these expected situations.

And over time, as issues arise, add tests to cover those aspects that were not originally thought of. This way, a future developer is protected from reintroducing a known issued.

Chip Overclock said...

In the particular case I was describing, the 15 digit string was split into smaller fields, each of which could be tested for its full range, because several fields were the same length and hence used the same encoding algorithm.

I agree with you in principle. You have to be smart about picking your test cases, testing boundary and edge conditions in particular.

In the example I gave, there were no edge cases. It was a straight forward mathematical algorithm involving shifts and adds and such. It did have some conditional logic, but really, we're talking about maybe eight lines of code for the each of the encoding and decoding algorithm. How can eight lines of code have a 20% failure rate?

I think it was the shortness of the algorithms that may have misled the developers into thinking they were simpler than they were, the result being the 20 percent failure rate. But mathematical algorithms are subtle and quick to anger.

I still say: if possible, test all of them.

Embedded systems, where I spend most of my time, are expensive and difficult to upgrade once they are in the field. Many embedded systems, particularly telecommunications systems where I make most of my income, are in applications in which lives are potentially at stake. The stakes can be high.

How many times have we heard someone say "we don't have time to do unit testing", yet unit testing is actually a form of cost control, even more so in embedded systems. And for communications or medical systems, the cost of failure is even higher.

Chip Overclock said...

Forgot to say: as always, thanks for your insightful comments, Joe!

Anonymous said...

Great post. I too am an embedded system developer, and performing unit tests without testing the full range of input values, is not very trustworthy in my view.
This however presents the problem with running the tests themselves. CPU cycles can become quite expensive especially when testing functions with interfaces like these: foo(U32 i, U32 u)... this makes for the total combination of about a gazilion of tests to reach full range coverage. The limits of TDD are to me all too obvious. Because without full range tests, you need to look INSIDE the code to find the boundary values, so you cannot rely on the unit test at any time without doing a code review as well.
/Per

Chip Overclock said...

I agree, it's not always practical. But when it is, it only makes sense to test all of them. And when you can't, you definitely need to understand the risk you're taking on.