Bugs make their way to production. It’s a reality I’ve witnessed too often in my career. One of the common reasons for this is simply that testers miss them because the test environment doesn’t mimic the intended production environment. The bugs I remember most vividly were the most costly to the company – costs that could have been avoided with a small investment in the pre-production test environment, making it as close to production as possible. This is the story of one such bug.
A simple customer survey goes wrong
The marketing department at the retailer I worked for had an idea. The team wanted to get more information about their customers and decided to append a survey to customers’ receipts. As an incentive, those who filled out the survey were entered in a prize drawing. To collect the most shopper data, the department wanted these surveys to run during the holiday season when store traffic was at its zenith.
The requirements gathered called for the survey to present itself only on every twenty-fifth transaction at the point-of-sale (POS) terminal. The POS development team started to make the necessary code changes, unit testing was completed and the code sent to the QA department for more stringent testing. Test suites were created to test against all possible variables, not only for the designated twenty-fifth transaction when the survey should be presented, but all transactions leading up to it and beyond. There were a lot of variables in play.
The POS terminal was the company’s lifeblood; without it there was no revenue. It was imperative that any code changes, no matter how minor, did not disrupt that flow. Every transaction captured a mix of total dollar amounts and payment types. Customers had multiple payment options: cash, check, credit, gift cards, coupons, loyalty points, or a mix. As a result, tests needed to consider all of these variables, verify that transactions were correctly recorded and ensure they were sent to the other store systems that relied on POS data. This is where the problem started.
A small oversight leads to significant loss
While the POS systems in the test lab mimicked a store environment exactly, they didn’t mimic all the company systems that interacted with them. Testers ran scenarios and all transactions, regardless of type, were recorded correctly. Every twenty-fifth transaction added the survey to the printed receipt. All tests passed and we rolled out the change to all the production POS terminals in hundreds of stores.
The marketing department was pleased as the data started to roll in. Teams around the company considered the project a success… until about three weeks later when the accounting department found a large mismatch between actual and expected revenue. This was a big problem.
Teams investigated the root cause of the discrepancy. After some time and major effort, they discovered the issue: the code change that marketing had requested. Whenever the transactions that generated surveys were paid for via credit card, the POS terminal failed to send the data to the accounting department’s reconciliation system. A bigger problem was that the store systems only kept digital records of the transactions for 24 hours after the corporate systems confirmed receipt of the data.
Manual intervention wastes thousands of hours
Since our internal test environments did not mimic this scenario, instead of trying to fix the problem, we rolled back the code. While this solved the problem going forward, it did not recoup weeks of missing data for Accounting. The only records we had for reconciliation were the hundreds of rolls of printed POS transactions located in the back rooms of all our stores. Company leaders sent an urgent notice to all stores to gather these physical records and send them to the home office. Once they arrived, teams of accountants had to manually pore through every transaction from every store to find and restore the missing data in order to reconcile the books.
In the end, Marketing only received a fraction of the expected data and the company burned thousands of personnel hours to remedy the situation. All this for a single bug that might have been found if the company had made a small investment in the test environment.