05 October 2021

What caused the Facebook outage and why did it take so long to fix?

05 October 2021

Facebook, along with Instagram and WhatsApp, have come back online after an outage took the social media giants offline for several hours late on Monday.

The company has blamed the issue on a “faulty configuration change” within its network infrastructure which had a “cascading effect” that brought the firm’s platforms “to a halt”.

Here is a closer look at the incident.

– What happened?

Just before 5pm in the UK on Monday, people began noticing they could not access Facebook, or other services it owns and runs like Instagram and WhatsApp.

It would be more than five hours before service began to return.

Service outages on major platforms are not uncommon, but ones of this length are unusual, and it became clear Facebook was struggling to fix the problem.

In the meantime, other platforms such as Twitter and messaging app Signal saw huge surges in traffic as people turned to them to get back online, with some Twitter users even reporting issues at one point as the platform strained under the weight of the sudden burst of additional users.

By late Monday evening, access to Facebook and Instagram had returned for most users, while WhatsApp said it was back up at running “at 100%” as of 3.30am on Tuesday morning.

– What caused the issue?

In a statement, Facebook said the problem had been caused by a configuration change to the “backbone routers” that coordinate traffic between the firm’s data centres. This caused the cascading effect which brought the company’s various services down.

The company has not yet offered any further insight on what specifically caused the issue or how it was fixed.

But, web infrastructure and security firm Cloudflare has provided a detailed breakdown of the incident as it saw it unfold, and said it revolved around two key mechanisms which make the internet work – Domain Name System (DNS) and Border Gateway Protocol (BGP).

In essence, DNS is the address book and BGP the roadmap for the internet, helping people navigate the vast mesh of connected networks that make up the internet to help them find the website they want and then the quickest route to it.

Cloudflare said Facebook had, through a series of updates on Monday and seemingly accidentally, told the BGP that the paths for everything Facebook runs were no longer there – meaning people could no longer find a way to the social network.

Experts have said this is most likely to have been caused by a software bug in the updates or human error, although some have noted Facebook did not rule out foul play being the cause of the incident in its statement – however, there is currently no evidence to suggest that that is the case.

WhatsApp has also returned to normal service (Nick Ansell/PA) (PA Archive)

– Why did it take so long to fix?

It appears that the problem not only took down the social media platforms, but everything Facebook runs, including its own internal systems – with reports that staff were locked out of offices as internet-connected keycard entry systems went down, and were also unable to access their internal communications platform.

As a result, it was hard for staff to initially diagnose and coordinate on resolving the problem.

There were even reports in the US of Facebook having to send a team to one of its data centres to reset the servers manually to fix the issue.

One expert also noted that ongoing social distancing measures because of the pandemic and remote working may have also played a part.

Software testing expert, Adam Leon Smith of BCS, The Chartered Institute for IT, said: “It is unlikely the issues were directly caused by people working from home, however it is quite possible that it took so long to restore the service because of reduced staffing within the data centre.

“This would compound the problem because the nature of the failure meant that remote access to the data centre was also unavailable.”

Facebook CEO Mark Zuckerberg (Niall Carson/PA) (PA Archive)

– Can anything be done to prevent this from happening again?

This latest incident, after the major outages linked to Cloudflare in 2020 and Fastly earlier this year will again highlight the potential problems with having large portions of the internet reliant on just a handful of large companies and where one small issue can bring down huge segments of online services.

There are currently no obvious solutions to this, but this latest outage is likely to reignite the debate around internet infrastructure.

For many individuals and businesses too, the incident showed just how much they depend on Facebook and its services not just to communicate, but also to log in to other platforms.

In response, people have been encouraged to consider using other credentials beyond their Facebook log-in details to access other online services.

The best videos delivered daily

Watch the stories that matter, right from your inbox