Skip to main content

The Art Of Debugging

My first taste of commercial software development came using Adabas / Natural - a data base with a 3GL programming language. There were no debuggers available but it had an interpreted development environment and you could write code and use print statements to debug your code. I was amazed when I was asked by another developer to help fix a bug in his code and observed that they did not use the same debugging technique. This developers debugging technique was to (apparently randomly) choose a line or two of code, change something, and rerun their application to see if the outcome changed. A couple of print statements quickly narrowed down the problem and had their program working. I found my productivity was many times better just because I had mastered that simple debugging technique. It took me minutes to fix a problem rather than days (or never!).

One day I volunteered to port an imap implementation to OS/2. I thought that because it was written for Unix and on my OS/2 development machine I had a port of GCC along with many Unix compatible libraries that it should be possible. I did not expect to find that there would be no tests, no install option and nothing explaining how to debug the compiled program. I ended up with a application that would run, however it did not respond to connections and did not write anything to screen or to any logs that I could find. Whatever it needed was beyond my ability (or patience!) to debug. I gave up after a few days of reading through the source code, adding print statements and attempting to map the flow of the program. I decided I had wasted enough time for a program I was not going to use myself!

So what went wrong? Why was I unable to get it working? With the benefits of hindsight and many more years of experience here is what I think were the main reasons I was doomed to fail.

I had no idea what imap was and I had no experience with socket programming. Not knowing anything about it meant I had no idea how to approach debugging it. The complete  lack of any tests meant that I would have to understand the application inside and out to be able to work out why nothing was working. How should I test it, how to confirm what wasn't working and what was? How to configure it and how to enable logging - was it expecting some system logging service such as syslog? I just did not know enough to even start troubleshooting.

That was the first time I failed to port something - I had ported more than a few different applications  to Amiga-OS, Windows and OS/2 so I thought I could handle anything! I had ported a fight simulator from MS-DOS to the Amiga, to Windows and to OS/2 - that required using three different graphics libraries so I believed I could handle anything.

Fixing a bug sometimes involves a compromise - for example if the bug is not in your code, perhaps the problem is in someone elses code, a third party library or even in the compiler. Sometimes I come across comments in code that indicate it was rewritten to work around a compiler bug and you can see the original code - commented out or conditionally compiled. There are usually certain patterns in code depending on the way most code in a project is written or because that is the way that seems logical for the programming language being used. So if something doesn't fit or looks unusual then it is often highlighted in some way to prevent it being changed and breaking again! Sometimes, if you support two or more different platforms (e.g. Linux and Windows)  than the person who changes it may not notice it has broken the other platform.

An example of this happened to me the other day. A unit test failed on Linux and when I investigated I found that a print statement was producing different output between Linux and Windows. My solution was to break the print statement into multiple print statements. There did not appear to be anything wrong with the original code but it used a custom "printf" routine and debugging that would have taken some time. At the expense of possible technical debt - there may be a subtle bug in the custom code that may bite me in the future - I had a quick fix for the failing unit test. One day I may have to revisit the original problem but for now it was resolved.

I do not advocate any particular debugging style - if it works for you then stick with it! The problem is recognizing when what you are trying is not working. Do not persevere with something that is not working, Step back from the problem and consider another way to approach it. If possible ask someone else for their perspective!

When I start to investigate a problem I try to follow these steps.

Understand the problem as much as I can. What are the inputs and outputs? For example what did the user do and what was the outcome. If you have a support team working with the customer have they managed to reproduce the problem? Was it a one off or can it be reproduced easily?
If it is not reproducible then what were the circumstances that caused the problem? Perhaps the network went down. Sometimes watching what the user is doing will show what is wrong.

 This is a true story. A load process was creating its work file and kept failing because its work files were disappearing. When we sent an engineer on site to investigate he discovered that the user was monitoring the file system and deleting any large files that appeared. The user did this in front of the engineer! It was not an automated process. The user started the program and then switched to another shell to check for large files and deleted the work files as they were being created!

When debugging a problem you should understand as much as possible about the process. If you can create a unit test that displays the problem then this will help you fix the problem very quickly. If relying on a regression (or end to end test) then it will be harder and require understanding a much larger part of the process.

There are many tools and techniques to help when debugging. Putting print statements in the code, logging entry and exit from functions, using a debugger to see the values of variables as the code executes, enabling core dumps or getting the call stack from a crash (for example from a Java application). Enabling bounds checking and debug options when compiling. The best results when debugging come when understanding the problem and by having the smallest possible reproduction so you can iterate quickly. Sometimes it also helps to take a break or get a second opinion.

I have diagnosed a problem from a core dump - as long as you can get a stack trace, showing the flow and value of variables then you can analyze the source code and gain an understanding of what caused the problem. It is a skill that you can learn. It does require a good understanding of how the process is meant to work.


In summary debugging is a skill, it is something you can learn and can become good at. It may require a certain type of way of approaching a problem. I have known more than a few programmers who have been bad at debugging, even their own programs. They had not yet learnt how to debug efficiently!

Comments

Popular posts from this blog

"No child processes" error

A problem was reported by a customer. They were getting a failure and in the logs it reported error → waitpid failed 'Reason: No child processes' The “No child processes” error came from waitpid() after using  fork/spawn to launch a utility to load data into a data base. Upon detailed investigation it appears it is possible that some other process that the user is running has changed the default handler for SIGCHLD - possibly the shell (e.g. bash!) used to launch our server processes.  If the signal handler is set to SIG_IGN then when a process is started using fork()/exec() the return code from the process is NOT returned and waitpid() cannot retrieve the response code. The most likely reason for "No child processes" error from waitpid() is that the signal handler for child processes (SIGCHLD) is not set to SIG_DFL. This should not be possible however it seems that on Linux a process run in the shell (or maybe a shell process) can set it to SIG_IG...

The problem with acronyms

Have you ever attended a presentation and been confused as to what an acronym meant? Have you asked what it meant or did you wait in vain for someone else to ask first? Have you thought you knew what it meant only to realize after a minute or two that you didn't? The problem with not defining the acronyms that you use in a presentation or talk are that a particular acronym means different thing to different people. We all know, or think we know, what certain acronyms mean - SDK means Software Development Kit, JVM is Java Virtual Machine - and some acronyms are so well known that they can be relied upon to always mean the same thing. Does anyone use the acronym IBM to mean anything other than International Business Machines? However many acronyms are reused for different meanings. The other problem is that it can take a few seconds to remember or to work out what the acronym means. That is time you should have been paying more attention to the presenter. When presenting pleas...