Why One Bug Caused Widespread Internet Chaos

Anybody familiar with the concept of the butterfly effect is likely to have some sympathy for the Amazon Web Services engineer who inadvertently caused an Internet meltdown on February 28, 2017.

The cloud-based infrastructure provider suffered an hours-long outage that crippled a significant proportion of the Internet, with a plethora of high-profile websites and apps stopped in their tracks. The outage was centered in the Northern Virginia region and ensured that AWS was unable to service requests for around five hours.

Although there was no evidence that this was linked to a malicious attack, people (unsurprisingly) wanted to know what was going on.

And it turns out that human error was to blame. To be more specific, a member of the Simple Storage Service (S3) team typed in the wrong command as part of a debugging process intended to speed up the S3 billing process.

Yes, a typo brought parts of the Internet to its knees.

According to Amazon Web Services, the debugging was only supposed to affect a limited number of servers. The incorrect command not only took out a larger set of servers than intended in the US-EAST-1 datacenter—a huge location that is also Amazon’s oldest, ZDNet reported—but also removed support for two other S3 subsystems—the index subsystem and the placement subsystem. Both of these subsystems required a full restart and safety checks that took “longer than expected.”

“At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” said AWS, in a press release. “Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.”

The full AWS explanation for the outage can be read here.

Never Hurts To Plan For An Unexpected Event

The outage only lasted for a few hours but it highlights the ripples that can be created when the human factor comes into play.

AWS said that it builds its systems with the assumption that things might fail but admitted that it had not restarted either of these subsystems in its larger regions for years. S3 has experienced massive growth, which means that any outage or bug is always going to have a significant effect on the digital world.

To be fair to AWS, it has already made several changes to its operational practices as a result of this unforeseen event.

These changes include modifying the tools used to remove capacity and adding safeguards to ensure that other subsystems are not affected by an incorrect input. AWS has also begun auditing its other operational tools to speed up recovery time, splitting services into small partitions or cells and reducing the dependency that the AWS Service Health Dashboard—which was also taken out by the incorrect command—has on S3.

AWS has apologized for the impact that this event had for its thousands of customers, but it acknowledged that one small bug caused a whole heap of problems. With that in mind, this widespread outage through human error should be a wakeup call for any companies that test for bugs on an infrequent basis.

Want to see more like this?
David Bolton
Former ARC Writer
Reading time: 5 min

How to Assess Testing Resource Allocation

If you can’t measure the value of your efforts, you can’t explain or even justify your testing investment

Using Questionable Datasets to Train AI Could Come With High Costs

As companies look to capitalize on AI development, they must stay mindful of how they source training data — AI algorithms developed from private or non-consensual data may cost businesses in the long run.

Why Payment Testing is a Constant for Media Companies

Simulated transactions and pre-production testing won’t ensure friction-free subscriber payment flows

How Mobile Network Operators Can Leverage e- and iSIMs

We explain what e- and iSIMs are, what they mean for the industry and how MNOs and MVNOs can leverage them to their advantage.

Jumpstart Your Software Testing Education

Testers have a variety of options to upskill and grow professionally

The Future of Generative AI: An Interview with ChatGPT

We ask ChatGPT about where it sees itself in the future, what needs to happen for it to get there and how Applause can help.