Amazon finally has an answer as to what led to the huge S3 outage that brought about a third of the internet down – human error. In fact, the entire episode can well be described as a freak incident.
For something as trivial as entering the wrong input to a programming command led to the shutting down of more servers than anticipated.
Elaborating further, Amazon said an authorized engineer was working to remove a few servers to sort out some issues with the S3 billing system. However, while the engineer was following laid down procedures to accomplish his task, trouble started when he entered the wrong input to a command. That blacked out more servers than intended and the process kick-started a domino effect.
Amazon, however, hasn’t stated if the act was purely accidental or there was more to it. The fate of the engineer too is unknown as of now.
Meanwhile, there were more servers that were rendered inaccessible for four hours at least. With such huge portions removed, the system required to be restarted again for them to be brought back to life. This again turned out to be more time consuming then ever as many of those systems hadn’t had a restart for several years now.
Further, while the systems were being restarted, the S3 was in no position to reply to any requests, something that continued for well around 4 hours. Naturally, all AWS services that had their data stored in S3 were also adversely affected.
Now with systems restored, Amazon stated they have learned their lessons and has committed to install enough safeguard to ensure anything of this scale does not get repeated. One of these includes modification of a tool that will make it impossible to delete more than a set number of servers.
Changes are also being introduced to the Amazon service health dashboard which itself got knocked out for several hours during the incident. Similarly, Amazon pledged they will introduce a measure that will speed up the recovery process in the event of an outage by ensuring speedy restarts.