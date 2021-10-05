BGP Explained: The Protocol That Could Be Behind Facebook’s Disappearance

On Monday, Facebook went completely offline, with it shutting down Instagram and WhatsApp (not to mention a few other websites). Many people were quick to say that the incident was related to BGP, or Border Gateway Protocol, Citing sources inside Facebook, traffic analysis, and the gut instinct that “it’s always DNS or BGP.” Facebook is on its way back up, but it all raises questions:

What is BGP?

At a very basic level, BGP is one of the systems the Internet uses to get your traffic to where it needs to go as fast as possible. Because there are many different Internet service providers, backbone routers and servers responsible for, say, Facebook for your data, there are many different routes your packets can end up with. The job of the BGP is to show them the way and ensure that it is the best route.

I’ve heard BGP described as a system of post offices, an air traffic controller, and more, but I think my favorite interpretation was the one that compared it to a map. Imagine BGP as a group of people creating and updating maps that show you how to access YouTube or Facebook.

BGP is like a map telling your computer which bridges it needs to cross to get to Facebook

When it comes to BGP, the Internet is broken up into larger networks, known as autonomous systems. You can imagine them as island nations – they are networks that are controlled by an entity, which could be an ISP, such as Comcast, a company, such as Facebook, or any other large organization such as a government or major university. It would be extremely difficult to build bridges connecting each island to all the others, so the BGP is the one who is responsible for telling you which islands (or autonomous systems) you need to pass through to reach your destination.

Since the Internet is always changing, maps need to be updated—you don’t want your ISP to lead you down an old road that no longer leads to Google. Because it would be a huge undertaking to map the entire Internet at all times, autonomous systems share their maps. They would occasionally talk to their island neighbors to see and copy any updates made to their maps.

Using maps as a framework, it’s easy to imagine how things can go wrong. Back when consumers first got access to GPS, it was always a joke about whether you drive off a cliff or in the middle of a desert. The same can happen with BGP – if someone makes a mistake, it can end up with out-going traffic, causing problems. If not caught, that mistake will be on everyone’s map. There are other ways this can go wrong, but we’ll get to those in a bit.

Yes, yes, maps. give me an example.

Undoubted! It’s massively simple, but imagine you want to connect to a fictional tech news website called Convergence. Convergence ISP uses NetSend, and you use DecadeConnect. In this example, DecadeConnect and NetSend cannot talk to each other directly, but your ISP can talk to Border Communications, which can talk to Forms, which can talk to NetSend. If this is the only route, then BGP will make sure you and the convergence can communicate through it. But if alternatively, both DecadeConnect and NetSend were connected to the thirdlevel, BGP may choose to route your traffic through it, as it is a shorter hop.

OK, so BGP is like those maps detailing all the fastest ways you can get to a website?

Correct! Unfortunately, this can be even more complicated because the smallest does not always equal the best. There are many reasons why one routing algorithm would choose one path over another – cost can also be a factor, with some networks charging others if they want to include them in their routes.

Mapping immutable roads is difficult; Imagine Internet Mapping

Plus, the maps are super tricky! I just discovered this recently while trying to plan a trip where roads were present on one map and not on another or varied between maps. A road in three maps also had three different names. If it’s so hard to figure out a “town” that has all five roads, imagine what it’s like trying to tie the whole internet together. Actual roads don’t change often, but websites can move from country to country or service providers can change, add or subtract, and the Internet just has to deal with it.

I remember something like this from my Algorithms and Data Structures class – trying to build an algo to find the shortest path.

I’ll have your say on that. I was freaked out when I heard about the graphs.

But Facebook didn’t! In fact, it has built its own BGP system, which lets it perform “rapidly incremental updates”, according to a paper presented earlier this year. That said, the system the company describes is for communication Inside Data Center – At this point, it’s hard to say what caused Facebook’s problems on Monday, and I need someone smarter than me to say whether Facebook’s datacenter communications could cause an issue like this Will be cyber security reporter Brian Krebs Claims That the outage was caused by “regular BGP updates”.

InFacebook’s engineering update states that the issue was caused by “a configuration change on the backbone routers that coordinate network traffic between our data centers.” This was followed by a “widespread effect on the way”. [Facebook’s] data centers communicate, bring [its] Services are stalled.” At least in my eyes, it read that the problem was communicating within Facebook, not to the outside world (though this apparently could have caused a worldwide outage, given that How much Facebook controls its own network).

What does DNS have to do with all this?

To borrow an explanation from Cloudflare: DNS tells you where you’re going, and BGP tells you how to get there. DNS is how computers know what IP address a website or other resource can be found at, but that knowledge itself isn’t helpful – if you ask your friend where their home is, you might need to get there. GPS will be required.

Cloudflare also has a great technical description of how BGP errors can mess up DNS requests too – the article is specifically about Monday’s Facebook incident, so if you’re looking at what it looks like from an autonomous system perspective If you’re looking for an explanation, it’s worth a read. .

What can go wrong with BGP?

many things. According to Cloudflare, two notable incidents include a Turkish ISP mistakenly asking the entire Internet to route traffic to its service in 2004 and a Pakistani ISP accidentally trying to do so for its users around the world. I am banning youtube. Because of BGP’s ability to spread from autonomous system to autonomous system (which, as a reminder, is one of the things that makes it so useful), a group at fault can cascade.

BGP is sometimes called the duct tape of the Internet.

A group owned can also cause problems — in 2018, hackers were able to hijack Amazon’s DNS requests and steal thousands of dollars in Ethereum by compromising a different ISP’s BGP server. Amazon wasn’t hacked, but traffic to it ended up elsewhere.

Or, you can fix it and remove your entire service from the internet with a bad BGP update. BGP is affectionately called the duct tape of the Internet, but no adhesive is perfect.

Looks like Facebook’s servers for some reason told everyone to remove them from their maps. Facebook has released a preliminary report, but it is light on details – it is possible that Facebook plans to release a more in-depth explanation later, saying why the changes were made, but it may also be the last. Maybe we hear about it (at least officially).

However, Cloudflare’s CTO reports that the service saw a ton of BGP updates from Facebook (most of which include route clearance, or erasing of lines on a map leading to Facebook) just before dark. A tech lead from Fastly tweeted that Facebook Fastly. stopped rooting when it went offline, and krebson security Supports the idea that it was some update in Facebook’s BGP that knocked down its services.

If you want fine-grained technical details I recommend Cloudflare’s explanation.

If BGP was the problem, how does Facebook fix it?

Given that the outage lasted for hours, the answer appears to be “easily not”. Facebook needed to make sure it was advertising the right records and those records were largely picked up by the Internet. In other words, he needed to make sure his maps were accurate and that everyone could see them.

However, doing so is easier said than done. Facebook employees had reports Being locked by badge-protected doors And of employees struggling to communicate. In situations like this, you need to find out not only who has the knowledge to solve the problem, and who has permission to solve the problem, but also how to connect those. And when your entire company is overboard, it’s no easy task— ledge Received reports of engineers being physically sent to a Facebook data center in California to try to fix the problem.

Will Web3 solve this problem?

stop it. I will cry.

But to quickly answer the question, probably not – even if Facebook has gotten on the decentralized train, there should be some protocol in place that tells you where to find its resources. We’ve seen before that it’s possible to misconfigure or mess up blockchain contracts, so I’d be a little skeptical of anyone who said that a contract and the blockchain-based internet would be immune to this sort of issue.

Surely there was messy timing on that outage, all the bad Facebook news, huh?

Okay, so frankly, it all happened when a whistleblower was going to TV and broadcast Facebook’s dirty laundry, it’s really easy to come up with alternative explanations. But it’s possible that this is an innocent mistake made by someone (very, very unfortunate) on Facebook’s IT staff.

For what it’s worth, that’s Facebook’s explanation. It blames on a “faulty configuration change”, not on any devious hack.

Update October 4th, 10:44PM ET: Updated with info from Facebook’s official engineering post.