Facebook Will Happen Again: DNS Outages and Interconnected Systems

Francesco Altomare Technical Sales Lead for Southern Europe, GlobalDots

11th October, 2021 4 Min read

Last week’s massive outage on the Facebook-Instagram-WhatsApp ecosystem left many of us puzzled and concerned: How did our entire social communication (and news source for many) become so dependent in a single, non-regulated conglomerate? How come this conglomerate can fail over a seemingly-trivial reason such as DNS? And what are the dangers of our over-reliance on such interconnected entities as our connection to the world?

What caused the Facebook outage?

“The Facebook case was actually more than just a DNS Failure: The root cause seems to be BGP (Border Gateway Protocol) failures underlying the DNS Protocol, which then caused the DNS to start failing,” says Francesco Altomare, GlobalDots’ chief EU-based expert for web performance solutions and business continuity strategies.

How One AI-Driven Media Platform Cut EBS Costs for AWS ASGs by 48%

“But in essence, the DNS failed because something wasn’t maintained as it should have, to the point it required manual intervention and resulted in an hours-long denial of service. A global corporation which controls most of the world’s means of social communication has a responsibility to minimize that risk.

“DNS was the cause of most major global outages recently, including the latest Facebook and Slack ones. It happens because DNS is the most overlooked protocol in the web. And it can happen to any online business – not just the biggest global ones – and create monetary & reputation damages beyond repair.

“I keep seeing commentaries saying “it’s always DNS” as if nothing can be done about it, and this simply isn’t true. A modest investment in a resilient, performant, 100%-uptime, SLA-backed DNS Technology can save all this, and we’ve been doing this for decades.”

Asked about the probability of such future events in global interconnected services, Francesco explains:

“The reliance on interconnected systems does carry with it an inherent risk of system or even service failure. To counter this daunting risk, companies utilize tools such as SRE (System Reliability Engineering), as well as DR (Disaster Recovery) and BCP (Business Continuity Planning), which all deal with varying levels of redundancy built into each and every layer of your systems infrastructure. In fact, the so-called “Compound SLAs” (Service-level Agreement) on systems deals with more than one component, each of which carries a distinct Availability SLA (see informal subject reference), are used to calculate that. The same goes for the notion of “Error Budgets” (Google explanation here), where you – as an Organization – live and cope with a budget for your systems’ downtime and maintenance windows. If an entity is able to afford enough system downtime, a limited solution can always be found to assess and input the technology to minimize the risk, and if repeated, potentially eliminate the risk from the agenda topic.

“Yet despite these defensive, preventative, and protection tools as well as the mounting literature on the subject, it remains that there is no magic formula to determine a user’s SLAs without active consultancy with its key stakeholders. Moreover, even a 100% Availability SLA-backed system is subject to failures, and when there is more than one component that actively contributes to the availability percentage, calculating the risk of failure becomes even more complex and grueling a task. Simply stated, it is not a question of how likely the risk might be for systems to fail or how an over-reliance on such systems may lead to more problems. Rather, the question is how long will it take for these systems to fail without constant maintenance and updates integration. As well as, what can be done to delay the inevitable system failure and maximize utilization and output most efficiently with the greatest optimization. Beyond that, the question turns to the human aspect in updating the system coding and configuration versus machine learning AI coding of the future, and whether this will lower the lisk and increase the timeframe of efficient system operation.

“DNS is probably the most overlooked web protocol, which is why even the world’s giants aren’t immune, unless they implement a multi-DNS strategy. This could happen to any website, and multi-DNS solutions are highly affordable, so no one should really go without them nowadays.“

Steven Puddephatt, GlobalDots’ chief UK solution architect, adds:

“The probability of these systems failing is 100%. We know this because no service provider will offer more than ‘7 nines’ uptime in their SLA. Undoubtedly Facebook have redundancy built into their core platform, but in this case it was a configuration change that caused the outage. As long as humans are involved with updating code & configurations there’ll always be outages. I don’t believe an over reliance on them will increase outages, there were far more system outages (overall) when systems were less consolidated, you just didn’t hear about them as they were less public facing.”

Watch Steven’s whiteboard explainer below

GlobalDots is happy to be leading the Multi-DNS front, keeping business customers out of outages for nearly 20 years.

Read the Lifewire article: Facebook’s Failure Shows Why We Shouldn’t Rely on It for Everything

Watch the Multi-DNS webinar

Latest Articles

Managed DNS

Downtime is Pricy, Solution isn’t: How to Stay Out of DNS Outages

The recent global DNS outages, with the latest addition of Facebook-Instagram-WhatsApp, are a call to transform your approach to DNS solutions. In this webinar, we explore whether cloud-borne environments are really fail-proof and how businesses can use the most advanced cybersecurity methods and DNS solutions to minimize their risk of server failures, code misconfigurations, DDoS […]

Francesco Altomare Technical Sales Lead for Southern Europe, GlobalDots

13th October, 2021

Managed DNS

Webinar: Stay Out of Outages – The BCP Element No One Talks About

The onslaught of recent outages at major infrastructure providers like Fastly, Cloudflare and Akamai, reminds us of the importance of a holistic business continuity strategy that leaves nothing to chance. Yes, that includes often-overlooked web protocols like DNS. Learn about the DNS strategies that can increase uptime on this webinar, featuring our friends at NS1. […]

Francesco Altomare Technical Sales Lead for Southern Europe, GlobalDots

13th October, 2021

Managed DNS

Ebook: DNS Best Practices to Proactively Protect Against Malware

Proactively protecting your company against malware, ransomware, and phishing at the DNS control-point, as opposed to retroactive triage and remediation, simply makes sense. A cloud-based solution is ideal given ease of configuration and deployment, limiting exposure time and ensuring 100% compliance across all branches, employees, and devices on your network near instantaneously. However, layering an […]

Francesco Altomare Technical Sales Lead for Southern Europe, GlobalDots

8th April, 2021

Managed DNS

Is DNS Your Security Achilles Heel?

With the constant drumbeat of news reports about security breaches, cyber security is hardto ignore. Organizations understand that they need comprehensive solutions that prevent,detect, and respond to security threats. They often implement multiple layers of securitycontrols to protect their IT systems.Yet gaps remain. Many organizations have a blind spot when it comes to the Domain […]

Francesco Altomare Technical Sales Lead for Southern Europe, GlobalDots

8th April, 2021

Back to Resources

What caused the Facebook outage?

Asked about the probability of such future events in global interconnected services, Francesco explains:

Watch Steven’s whiteboard explainer below

Unlock Your Cloud Potential