It's always DNS, except when it's BGP
Facebook’s day-long outage is by far its longest and most extreme in years. At around 9 a.m. PDT on the U.S. West Coast — where the social giant is headquartered — Facebook, WhatsApp, Instagram and Facebook Messenger seemed to vanish from the internet.
The outage continued through market close, with the company’s stock dropping around 5% below its opening price on Monday. By midafternoon, services were beginning to resume after Facebook reportedly dispatched a team to its Santa Clara data center to “manually reset” the company’s servers.
But what makes the outage unique is just how extremely offline Facebook was.
In the morning, Facebook sent a brief tweet to apologize that “some people are having trouble accessing our apps and products.” Then, reports emerged that the outage was affecting not just its users, but the company itself. Employees were reportedly unable to enter their office buildings, and staff called it a “snow day” — they couldn’t get any work done because the outage also affected internal collaboration apps.
Facebook hasn’t commented on the cause of the outage, though security experts said evidence pointed to a problem with the company’s network that cut off Facebook from the wider internet and also itself.
The first signs of trouble were around 8:50 a.m. PDT in California, according to John Graham-Cumming, CTO at networking giant Cloudflare, who said Facebook “disappeared from the internet in a flurry of BGP updates” over a two-minute window, referring to BGP, or Border Gateway Protocol, the system that networks use to figure out the fastest way to send data over the internet to another network.
The updates were specifically BGP route withdrawals. Essentially, Facebook had sent a message to the internet that it was closed for business, like closing the drawbridge of its castle. Without any routes into the network, Facebook was basically isolated from the rest of the internet, and because of the way Facebook’s network is structured, the route withdrawals also took out WhatsApp, Instagram, Facebook Messenger and everything inside its digital walls.
A few minutes after the BGP routes were withdrawn, users began to notice issues. Internet traffic that should have gone to Facebook essentially got lost on the internet and went nowhere, Rob Graham, founder of Errata Security, said in a tweet thread.
Users began to notice that their Facebook apps had stopped working and the websites weren’t loading and reported experiencing issues with DNS, or the domain name system, which is another critical part of how the internet works. DNS converts human-readable web addresses into machine-readable IP addresses to find where a web page is located on the internet. Without a way into Facebook’s servers, apps and browsers would keep kicking back what looked like DNS errors.
It’s not known exactly why the BGP routes were withdrawn. BGP, which has been around since the advent of the internet, can be manipulated and maliciously exploited in ways that can lead to massive outages.
What’s more likely is that a Facebook configuration update went terribly wrong and its failure cascaded throughout the internet. A now-deleted Reddit thread from a Facebook engineer described a BGP configuration error long before it was widely known.
But while the fix might be simple, the recovery may stretch from the next few hours into the following days because of how the internet works. Internet providers usually update their DNS records every few hours, but they can take several days to fully propagate.
“To the huge community of people and businesses around the world who depend on us: we’re sorry,” Facebook tweeted around 3:30 p.m. local time. “We’ve been working hard to restore access to our apps and services and are happy to report they are coming back online now. Thank you for bearing with us.”
Facebook, Messenger, Instagram and WhatsApp are all down