Why some bugs are harder to catch than others

Posted by admin on May 12, 2008 in launch, news

About monaSoftware

I guess that the creation of this blog comes just in time. I have been finally able to catch one of the more elusive bugs I have suffered as a developer and I’d like to take the opportunity to tell a little story about how did I catch this bug.

The purpose of this blog is primarily to supplement the information contained in the web site of my software company MomSoft. Not just informing of new releases and explaining obscure, or at least unusual, ways of using the software, but commenting in computing in general (usually Windows related, but not necessarily)

The bug. A succinct description

Your typical software bug

For these of you that don’t know it, Control Runner is our most beloved product. It was our first product (it was called MomShell at these times) and we have been selling and enhancing it for more than nearly 15 years now.

Control Runner, being so mature as it is, works very nicely. There are just a couple of glitches that we have not yet been able to solve, mostly related to the handling of icons of the targets of the program buttons. This problem requires a complete change on how are icons managed, so it will have to wait until we release version 4 in a few months.

But, from time to time we were reported a bug that caused the configuration of Control Runner to be lost. Some months we received one of these reports, some months we received none.

Now, this is a very serious problem. A typical Control Runner user has many many items configured on Control Runner. At the latest count I have nearly 200 of them on the development computer (much less on my normal working computers). Reconfiguring Control Runner from scratch is, to say the less, a pain.

For any one familiar with programming the worst case of bug is the irreproducible bug. This is bug that seems to be alive and not related to your one code. It happens, apparently, when the bug so wishes as opposed to when the program flow reaches an statement that has an error, or when the algorithm you are using reaches the point where it is wrong. For the software developer bug hunting usually involves setting up the exact same conditions that cause the bug to appear and use a magnifying lens (also known as a debugger) to see what is happening.

This was not the case. Our users were unable to describe what conditions caused the configuration to be lost. No error message, no computer hangs before. Nothing. Nada.

So, I had to figure it out.

The history of a frustration

I made an assumption. For me it was clear that the problem was that, somehow, Control Runner was not saving the configuration files correctly. Therefore they would become corrupt and the next time the computer started Control Runner would encounter an error and would restart with a default configuration.

I examined the code used to save the configuration and tried several tricks to try to catch this elusive bug. But nothing worked. It looked as if, after all, Control Runner was able to save the configuration files without problems (which is true). But I knew better. I knew that there had to be a problem because the configuration files were getting corrupt.

I asked my users to send me their configuration files to try to reproduce the problem. I wanted to see what kind of items could be causing the files to become corrupt. Unfortunately, when the error happened, the old (possibly corrupted) configuration files were overwritten with the new, empty files.

So, I made a backup of the configuration files before saving them and asked users to send me the backup files. It didn’t work either. The configuration files contained just normal items and I was able to open and save them without any problem whatsoever.

The solution. So simple

Then, one day I received the following message from a user:

I find at times CR loses all of it’s settings and just opens with the two default buttons. The only way to get things back is to re-load a backup set of files. I thought this might be due to some failure in writing the data files out at the end of a session but I now find that it is a problem at startup, not closedown. For some reason the program fails to read the data files and, of course, when I close it down then it wipes all the data away. If I make the data files read only then I can restart CR and it reads them perfectly.

And that was it. The problem was not that Control Runner was corrupting the configuration files when writing them, but rather that it was not reading them some times. The result was very similar, old configuration becoming lost. But the cause was completely different.

The lesson I have learned is: “When you are completely unable to find any problem in the particular code that you are staring at, no matter how hard you try, maybe the problem lies elsewhere“.

I still don’t know what is causing the problem. I suspect that some systems have too many processes running at startup and the file operation system used by Delphi stops waiting and simply returns without reading the file. I will investigate the matter further in the future.

Whatever the reason, solving the issue that has had me puzzled for years was just a matter of minutes. The user is notified if the configuration files seem to be wrong. This introduces the delay needed for the computer to finish the other tasks and, on the second try, the configuration files are read correctly.

Easy when you know where to look.

Share/Save/Bookmark

4 Comments on Why some bugs are harder to catch than others

By Louis Vane on May 13, 2008 at 11:20 am

Interesting story. But I don’t think that you have solved the bug. I think that showing a message informing of an error is rude.

By admin on May 15, 2008 at 10:53 am

Louis,

Yes. You are right. This is not a permanent solution to the problem. But it is convenient at the moment because I have a few volunteers that are going to report to me when they get the message. That will hopefully show me what is causing the error.

By Brian David on May 24, 2008 at 9:47 am

Got the configuration splash screen you speak of in the text show above.

By admin on May 24, 2008 at 2:41 pm

Brian,
How many times? Did something else happened?
Thank you for letting me know it, anyway.

Write a Comment on Why some bugs are harder to catch than others

Subscribe

Follow comments by subscribing to the Why some bugs are harder to catch than others Comments RSS feed.

More

Read more posts by admin

Users that are worth gold