Facebook admits ‘cock-up’ during routine maintenance work to blame for 5-hour outage of social network
The Facebook outage which took the social network, as well as Instagram and WhatsApp, offline for more than five hours was caused by an error during a routine maintenance job, the company has said.
Billions of the platforms’ users had been left unable to get online on Monday by the fault, which the company said was “an outage caused not by malicious activity, but an error of our own making”.
Santosh Janardhan, Facebook’s vice president of infrastructure, said that during what was “routine maintenance work” on the firm’s backbone network “a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centres globally”.
Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command
Writing in a blog post he said: “Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.
“This change caused a complete disconnection of our server connections between our data centres and the internet. And that total loss of connection caused a second issue that made things worse.”
Mr Janardhan said it also took time to fix because of the way Facebook’s servers are designed, in order to offer better physical security.
“They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them,” he said.
He confirmed that Facebook then had to bring the servers back online slowly, to avoid any further issues.
“We knew that flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic,” he said.
“Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one.
“After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already under way.”
As well as sparking debate about the public use of social media, the outage also saw EU competition commissioner Margrethe Vestager repeat calls for greater competition in the tech sector – saying the incident highlighted the negative impact of big tech firms controlling large swathes of the online world.
“We need alternatives and choices in the tech market, and must not rely on a few big players, whoever they are,” she wrote on Twitter.
The best videos delivered daily
Watch the stories that matter, right from your inbox