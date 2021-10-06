Facebook outage due to a cascade of errors, it says
The company said in a blog post published on Tuesday that Facebook’s services went offline on Monday due to a maintenance error on its network.
Facebook’s family of apps, which includes Instagram, WhatsApp and Messenger, went offline for more than five hours as employees scrambled to repair the damage. More than 3.5 billion people worldwide use Facebook’s services to communicate with friends and family, deliver political messages, and expand their businesses through advertising and outreach.
Santosh Janardhan, Facebook’s vice president of infrastructure, wrote in a blog post that the initial problem occurred in a network that Facebook calls its “backbone,” which connects its data centers around the world.
During the maintenance of the network, an order was issued to assess how much capacity was available. But the command backfired, with networks disconnected and Facebook’s data centers blocked from communicating, Mr Janardhan said. He said an audit tool designed to catch erroneous commands failed to detect the error.
But this was just the beginning of the problems. “This change completely cut off our server connection between our data centers and the Internet,” Mr Janardhan wrote. “And that total loss of connection created a second issue that made things worse.”
Due to Facebook’s data centers being offline, the company’s servers that manage its Internet addresses were also unavailable. “This made it impossible for the rest of the Internet to find our servers,” Mr Janardhan said.
As the extent of the outage became clear, Facebook engineers struggled to restore access because its data centers are too secure and employees could not get immediate access, the company said.
“We have done extensive work on hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down not because of malicious activity but because of an error of our own making. Tried to recover from the outage,” said Mr. Janardhan wrote.
Once the engineers were inside Facebook’s data centers and started working, they were able to restore the network. But they needed to be done gradually while bringing the servers online so as not to burden the system, Mr Janardhan said.
The company plans to study how the outage happened and to design exercises that will allow employees to practice fixing Facebook’s systems more quickly, he said.
