3 Steps to Find the Source of a Bug Quickly

Published September 22, 2021

Do you like doing maintenance as a software activity? Most people don’t. Maintenance is often associated with trudging through lines of code with the debugger in a desperate search for bugs, in software that someone else wrote.

All in all, maintenance gets the reputation of being an unrewarding activity, with low intellectual stimulation and not as worthy of a developer’s time as building a new feature, for example.

I love doing maintenance. Not because I like feeling lost in code that I don’t know. And neither because I like spending hours running in circles. And even less because I like the feeling of the touch of the F10 key.

I love doing maintenance because, if you have the right technique, maintenance can be fun.

In this article I try to explain that technique, with the goal that you come to like fixing bugs in the application you’re working on too (if you do, please leave a comment!). It’s not rocket science, and a part of it is in Code Complete. But it has enormous value because it can save you a large amount of time and frustration when working on that unjustly ill-famed activity.

Now if you’re wondering why we talk about that on Fluent C++, which is normally concerned with expressive code, the link is that this technique will save you from looking at a lot of code. So even if that code is not expressive and has poor design, it won’t be in your way to slow you down.

The slowest way to find the source of a bug

Before getting to the best way to identify the source of a bug, let’s see the natural way. The natural way goes like this: you get a bug report related feature X, you look around the code of feature X, potentially step through the codeline with the debugger, looking for the cause of the problem.

This is about the least efficient approach to finding the cause of a bug. But this is what we do naturally, and like pretty much everyone that’s what I was doing as a young sprout.

Why is this approach doomed to failure (or to a very long and painful road to eventual success)? It’s because if you start by looking at the code, you don’t know what you’re looking for. You hope to stumble upon the source of the problem by chance. It’s like looking for a specific street in a city, just by methodically walking around town until you run into that street.

And if you’re in a large codebase, it’s like walking in a big city. You may find it, but chances you’ll be dehydrated before then.

So the first piece of advice is don’t start by looking at the code. In fact, you want to spend as much time of your analysis as possible in the application.

But what to look for in the application then?

The quickest way to find the source of a bug

Step #1: Reproduce the issue

The first thing you want to look at in the application is checking that the bug is there. It sounds stupid, but it happens that the development environment is not quite in the same configuration as the one where the bug appears, and any further analysis would be a waste of time.

Step #2: Do differential testing

Ok now let’s assume that you do reproduce the bug. The next step is then to reduce the test case. This consists in trying slight variations of the original test case in order to refine the scope of the bug.

Step #2a: Start with a tiny difference

It’s going to be a little abstract here, but we’ll get to a concrete example later on. Say that the bug is appearing in feature X when it is in config A1. Other possible configs of the feature X are A2, which is very close to A1, and B which is fairly different from A1. And A2 is simpler than A1.

Since A1 and A2 are so close, the bug will likely be reproduced with A2 too. But let’s test A2 anyway.

If the bug is NOT reproduced in A2 then great, it means that the bug is specific to A1, and lies in the difference between A1 and A2. If you can refine the test by checking another config A11 versus A12, then by all means do. But say that you can’t go further that A1 versus A2. Go to Step #3.

But if the bug is also reproduced in A2, you know that the bug is not specific to A1 nor lies in the difference between A1 and A2. But you don’t know where the source of the bug is yet.

Step #2b: Continue with larger differences

So you test less close configs, and simpler ones if possible. B, for example. Since B is not close to A1, it’s likely that you don’t reproduce the bug in B.

But if you DO reproduce the bug in B, it means that you’ve been lied to: the bug has nothing to do with A1. But it’s ok, business people didn’t do it on purpose.

This discovery brings you two things:

it simplifies the test case, if you did find a simpler config B where you reproduce the issue,
it tells you that the bug is probably not related to feature X after all. So you need to do differential testing between feature X and another, close feature X2. And then a remote feature Y. And so on.

Step #3: Formulate and check a hypothesis

You now have a pretty accurate location for the bug. This is now the time to formulate a hypothesis about what is causing the incorrect behaviour. What could go wrong in this confined space of the application? If you see several things that go wrong, what’s your gut feeling for which one is the most likely?

Then, and only then, you can look at the code. The point of looking at the code is to confirm (or infirm) your hypothesis. So you go directly to the portion of code that your differential testing pinpointed. It should be fairly small. Fire up the debugger (or run the code mentally if you can), check if your hypothesis is confirmed.

If it is, congratulations, you identified the source of the bug. If it’s not, do Step #3 again until a hypothesis is confirmed.

A binary search for the root cause of a bug

If you don’t practice this technique or something resembling it yet, it probably sounds somewhat complicated. In that case, a nice analogy is to compare this with linear search versus binary search.

Starting by looking at the code and searching for what’s wrong in it is like linear search: you walk your way through the code, function by function or line by line, until you encounter the source of the issue.

However, with the method we described, that is operating with differential testing and hypotheses is like binary search: it consists in making checks at some targeted locations, and decide each time for a new direction to look into. And the same way binary search eliminates huge chunks of the collection from the analysis, differential testing and hypotheses discard huge portions of the codebase that you won’t have to look into.

Sure enough, binary search takes more mental effort to implement than linear search. But the method has two advantages: A) it’s way faster and B) it requires you to think.

This latest advantage is what makes maintenance fun. Every bug becomes a challenge for your mind, a sort of puzzle. But one for which you have a method, and of which the resolution is only a matter of time.

Next up, we will go through a concrete example of bug-finding to get some practice at applying that method.