The Art Of Debugging

My first taste of commercial software development came using Adabas / Natural - a data base with a 3GL programming language. There were no debuggers available but it had an interpreted development environment and you could write code and use print statements to debug your code. I was amazed when I was asked by another developer to help fix a bug in his code and observed that they did not use the same debugging technique. This developers debugging technique was to (apparently randomly) choose a line or two of code, change something, and rerun their application to see if the outcome changed. A couple of print statements quickly narrowed down the problem and had their program working. I found my productivity was many times better just because I had mastered that simple debugging technique. It took me minutes to fix a problem rather than days (or never!).

One day I volunteered to port an imap implementation to OS/2. I thought that because it was written for Unix and on my OS/2 development machine I had a port of GCC along with many Unix compatible libraries that it should be possible. I did not expect to find that there would be no tests, no install option and nothing explaining how to debug the compiled program. I ended up with a application that would run, however it did not respond to connections and did not write anything to screen or to any logs that I could find. Whatever it needed was beyond my ability (or patience!) to debug. I gave up after a few days of reading through the source code, adding print statements and attempting to map the flow of the program. I decided I had wasted enough time for a program I was not going to use myself!

So what went wrong? Why was I unable to get it working? With the benefits of hindsight and many more years of experience here is what I think were the main reasons I was doomed to fail.

I had no idea what imap was and I had no experience with socket programming. Not knowing anything about it meant I had no idea how to approach debugging it. The complete lack of any tests meant that I would have to understand the application inside and out to be able to work out why nothing was working. How should I test it, how to confirm what wasn't working and what was? How to configure it and how to enable logging - was it expecting some system logging service such as syslog? I just did not know enough to even start troubleshooting.

That was the first time I failed to port something - I had ported more than a few different applications to Amiga-OS, Windows and OS/2 so I thought I could handle anything! I had ported a fight simulator from MS-DOS to the Amiga, to Windows and to OS/2 - that required using three different graphics libraries so I believed I could handle anything.

Fixing a bug sometimes involves a compromise - for example if the bug is not in your code, perhaps the problem is in someone elses code, a third party library or even in the compiler. Sometimes I come across comments in code that indicate it was rewritten to work around a compiler bug and you can see the original code - commented out or conditionally compiled. There are usually certain patterns in code depending on the way most code in a project is written or because that is the way that seems logical for the programming language being used. So if something doesn't fit or looks unusual then it is often highlighted in some way to prevent it being changed and breaking again! Sometimes, if you support two or more different platforms (e.g. Linux and Windows) than the person who changes it may not notice it has broken the other platform.

An example of this happened to me the other day. A unit test failed on Linux and when I investigated I found that a print statement was producing different output between Linux and Windows. My solution was to break the print statement into multiple print statements. There did not appear to be anything wrong with the original code but it used a custom "printf" routine and debugging that would have taken some time. At the expense of possible technical debt - there may be a subtle bug in the custom code that may bite me in the future - I had a quick fix for the failing unit test. One day I may have to revisit the original problem but for now it was resolved.

I do not advocate any particular debugging style - if it works for you then stick with it! The problem is recognizing when what you are trying is not working. Do not persevere with something that is not working, Step back from the problem and consider another way to approach it. If possible ask someone else for their perspective!

When I start to investigate a problem I try to follow these steps.

Understand the problem as much as I can. What are the inputs and outputs? For example what did the user do and what was the outcome. If you have a support team working with the customer have they managed to reproduce the problem? Was it a one off or can it be reproduced easily?
If it is not reproducible then what were the circumstances that caused the problem? Perhaps the network went down. Sometimes watching what the user is doing will show what is wrong.

This is a true story. A load process was creating its work file and kept failing because its work files were disappearing. When we sent an engineer on site to investigate he discovered that the user was monitoring the file system and deleting any large files that appeared. The user did this in front of the engineer! It was not an automated process. The user started the program and then switched to another shell to check for large files and deleted the work files as they were being created!

When debugging a problem you should understand as much as possible about the process. If you can create a unit test that displays the problem then this will help you fix the problem very quickly. If relying on a regression (or end to end test) then it will be harder and require understanding a much larger part of the process.

There are many tools and techniques to help when debugging. Putting print statements in the code, logging entry and exit from functions, using a debugger to see the values of variables as the code executes, enabling core dumps or getting the call stack from a crash (for example from a Java application). Enabling bounds checking and debug options when compiling. The best results when debugging come when understanding the problem and by having the smallest possible reproduction so you can iterate quickly. Sometimes it also helps to take a break or get a second opinion.

I have diagnosed a problem from a core dump - as long as you can get a stack trace, showing the flow and value of variables then you can analyze the source code and gain an understanding of what caused the problem. It is a skill that you can learn. It does require a good understanding of how the process is meant to work.

In summary debugging is a skill, it is something you can learn and can become good at. It may require a certain type of way of approaching a problem. I have known more than a few programmers who have been bad at debugging, even their own programs. They had not yet learnt how to debug efficiently!

Programming By Numbers

Search This Blog

The Art Of Debugging

Comments

Post a Comment

Popular posts from this blog

The problem with acronyms

"No child processes" error