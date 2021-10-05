Facebook, WhatsApp and Instagram all gone offline due to maintenance

The massive outages that shut down Facebook, its associated services (Instagram, WhatsApp, Oculus, Messenger), its platforms for businesses and the company’s own internal networks all began with routine maintenance.

According to Santosh Janardhan, vice president of infrastructure, an order issued during maintenance inadvertently shut down the backbone that connects all of Facebook’s data centers everywhere in the world.

That in itself is bad enough, but as we’ve already explained, the reason you can’t use Facebook is because the DNS and BGP routing information pointing to its servers suddenly disappeared. According to Janardhan, this problem was a secondary issue, as Facebook’s DNS servers noted a loss of connection to the backbone and stopped advertising the BGP routing information that helps every computer on the Internet find its servers. . The DNS servers were still working, but they were unreachable.

Our products were bad yesterday, so we’re sharing some more details here on exactly what happened, how it happened and what we’re learning from it: https://t.co/IXRt572h4c — Mike Schroepfer (@schrep) 5 October 2021

The lack of network connections and loss of DNS disconnected the servers of engineers trying to fix the problem and disabled many of the tools they normally use for repair and communication – as we heard yesterday.

The blog post notes that engineers had additional constraints due to physical and system security around this critical hardware. Once they “activated the secure access protocol” (this is apparently not a code word for “open the server door with an angle grinder), they were able to get the backbone online and slowly Were able to restore services under increasing load.This is why it took longer for some people to get back yesterday, because the demand for power and computing to get everything running at once can cause more accidents.

So that’s it. No conspiracy theories to turn Mark Zuckerberg’s baby back, and no technology taking the ax to secure facilities. Just a bug in a command that an audit tool missed, and for six hours, services connecting billions of people disappeared.