In part one of this series on bug reproduction, we discussed the value of taking the time to find the pattern of an issue in order to eliminate the potential future costs of a bug in a production setting. I provided a real-life example of an issue that users were reporting, but couldn’t be reproduced by the developers themselves.
In this post, let’s explore another real-life example of a production bug where I had the pleasure of finding the pattern that directly led to a fix — one that showcases the value of considering how real users will interact with a product, and how that impacts their experience.
Trouble reproducing the issue
Prior to joining uTest, I was working in the QA department of a large book retailer. I got called into the office of the development manager, who was dealing with an in-store system issue being reported to the support desk by store personnel. His team could not reproduce the issue in their lab.
Stores were complaining about the in-store kiosk that customers used to search for information, such as the shelf location of books, or to place a special order if the book was out of stock at the location. Every morning, the kiosks ran smoothly. But as the day went on, the kiosks would slow down to a crawl, eventually becoming useless to both employees and customers. This was a huge problem and was leading to lost sales.
Having not been involved with the developments of the kiosk, I learned that it was basically a locked-down version of the company’s web store with a few added applications. A sleep timer reset the kiosk back to the home page after a certain amount of time, and there were security features to stop anyone from tampering with them.
Come up with a theory
After the meeting with the development manager, I went to the test kiosk in the lab. I had a theory on how I could reproduce the issue and set out to put it to the test. My theory was that the development team could not reproduce the issue in the lab because they were not testing for a real-life scenario.
With what I now knew about how the kiosks were set up, how they were used in the stores and the description of the problem being witnessed, I surmised that every search would take a bit longer than the last, eventually coming to a crawl. Since the systems were rebooted nightly, my theory was that is why they worked fine at the start of the day.
This theory was based on how customers used the kiosks. A customer would walk into the store, run their search, get their information and walk away. Then the sleep timer would kick in, resetting the kiosk to the home page, until the next customer walked up and the process started over again.
Put the theory to the test
The first step to reproduce this bug in the lab was to take the time to mimic this real-life scenario. Armed with a stopwatch timer, I tapped a key on the keyboard to wake the device and began timing my search. The first search took about 15 seconds, and I recorded my timing in a log while I waited for the kiosk to reset itself.
Once the kiosk reset, I timed the exact same search again: 18 seconds this time. I ran the tests again and again, recording my time-per-search each time. Each run added a few more seconds to the search time. Eventually, the kiosk was crawling to perform a search.
I walked back to the development team to tell them I had reproduced the issue. The look on their faces was priceless — they were flabbergasted since I had only been in the lab about an hour.
We went into the lab and I showed them the reproduced issue. I restarted the system to simulate a nightly reboot and ran them through the multiple searches, timing each one, to show them what was happening. However, even though I had managed to reproduce the issue, I couldn’t tell them why it was happening.
Understand the root cause
I asked for the development team’s help in removing some variables from the system, namely the added applications. One-by-one, we removed the applications and retested until there were no added applications left on the kiosk. The issue persisted. With no variables left, we surmised that whatever was creating the problem was part of the web store code.
Armed with this information, I went to my personal desktop and pulled up the web store to see if I could reproduce the issue there as well. Sure enough, it was there. I then ran the test again, this time with a network traffic sniffer and isolated the file that was causing the issue.
We took this evidence to the Web team and they identified the issue: a garbage-collection file.
Normally this file wouldn’t be a problem — when a home user ends their session, the file gets cleared. However, in the store kiosks, the sessions were never cleared, and the garbage collection file simply got larger and larger until it was so big the system would lock up. Because the development team hadn’t been replicating how the technology was used in real-life, they couldn’t figure out why there was a bug. The problem was now clear and the fix was simple: clear the session after each use.
With this fix, the kiosk didn’t routinely slow down for customers, and the retailer wasn’t losing out on sales.
Reproducing bugs takes patience, investigation and the ability to put yourself in the mindset of a real user. The closer you can bring your testing to replicating real user behavior and understanding the context of in-the-wild use, the more likely you are to identify and reproduce issues.