We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Hole punching in the wild

00:00

Formal Metadata

Title
Hole punching in the wild
Subtitle
Learnings from running libp2p hole punching in production, measured from vantage points across the globe.
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
At FOSDEM 2022 I presented libp2p's hole punching mechanism, overcoming NATs and firewalls with no dependencies on central infrastructure. One year has passed since. We rolled it out to live networks. We launched a large measurement campaign with many volunteers deploying vantage points in their home network, punching holes across the globe. In this talk I will give an overview of the largest hack of the internet (aka. hole punching), dive into learnings running it on IPFS (~50_000 nodes) and finally present the data collected through our measurement campaign. If you always wondered how hole punching works, how much more successful UDP is over TCP, whether IPv4 or v6 makes a difference, which ISP is most friendly to p2p and how to overcome symetric NATs, join for the talk!
14
15
43
87
Thumbnail
26:29
146
Thumbnail
18:05
199
207
Thumbnail
22:17
264
278
Thumbnail
30:52
293
Thumbnail
15:53
341
Thumbnail
31:01
354
359
410
BitSoftware
Process (computing)Communications protocolTupleSynchronizationMechanism designAreaBitRouter (computing)Table (information)BuildingLaptopDifferent (Kate Ryan album)Firewall (computing)Connected spaceFormal languageAuthentication1 (number)Computer configurationSoftwareEncryptionPeer-to-peerLibrary (computing)Arithmetic meanMeasurementInternetworkingCASE <Informatik>ImplementationProjective planeMathematical optimizationCartesian coordinate systemSoftware engineeringStatement (computer science)Open sourceHash function
Process (computing)Inneres ModellQuery languageServer (computing)BitTerm (mathematics)Server (computing)Communications protocolClient (computing)CASE <Informatik>Sinc functionMeasurementTable (information)Configuration spaceDifferent (Kate Ryan album)2 (number)SoftwareMultiplication signPeer-to-peerVariety (linguistics)LaptopInternetworkingConnected spaceMathematicsDatabaseMessage passingConnectivity (graph theory)TelecommunicationSynchronizationAddress spaceCoordinate systemWeightLatent heatTrail
ArchitectureElectronic data interchangeQuery languageClient (computing)Client (computing)SoftwareGraph coloringPeer-to-peerNumberBit rateResultantMeasurementComputer animation
Client (computing)MappingSoftwareConnected spaceUniqueness quantificationCartesian coordinate systemUniverse (mathematics)Direction (geometry)ResultantMultiplication signGraph (mathematics)Data analysisPeer-to-peerIP addressRouter (computing)Configuration spaceView (database)Error messageStreaming mediaComputer animationDiagram
NumberBit rateCategory of beingSoftwareIP address1 (number)Connected spaceClient (computing)AngleMeasurementRow (database)Combinational logicGraph (mathematics)Similarity (geometry)Peer-to-peerUniqueness quantificationObservational errorComputer animation
Router (computing)Multiplication signResultantConnected spaceBit rateHypothesisClient (computing)Communications protocolSynchronizationTransportation theory (mathematics)Ferry CorstenComputer animation
Computer animation
System programmingMappingWeb pageMeasurementLoginInformationCommunications protocolType theoryFirewall (computing)Branch (computer science)Client (computing)Touch typingConfiguration spaceSoftwareLink (knot theory)Point (geometry)Multiplication signProjective planeSingle-precision floating-point formatData miningMathematical analysisSpecial unitary groupWeightConnected spaceRouter (computing)MereologyPattern languageError messageBit rateServer (computing)Category of beingObservational errorVariety (linguistics)Default (computer science)Roundness (object)Term (mathematics)Strategy game
Program flowchart
Transcript: English(auto-generated)
Hello, everyone. Thanks for joining today. Welcome to our talk on hole punching in the wild. Sometimes I would say we're going to talk about the biggest hack of the Internet, which I would refer to as hole punching.
We want to talk a bit about learnings from doing hole punching on larger networks. Some might remember me from last year in Fostum where I introduced our way of doing hole punching and today we're coming here with a bunch of data.
So who are we? Dennis, do you want to introduce yourself? Yeah, okay. My name is Dennis. I'm working at Protocol Labs as a research engineer at a team called ProbeLab, and I'm mainly focusing on network measurements and protocol optimizations that come out of these measurements. And yeah, I was working with Max on this hole punching campaign.
Very cool. And Max, again, software engineer. Yeah, you can find us anywhere there online if you want. Yeah, happy to communicate online further after the talk, and we're also around at the venue.
Wonderful. Okay, what we're doing today. I want to do a very quick intro to Lippy2P, a peer-to-peer networking library, but then dive right into the problem of why firewalls and NATs are rather hard for peer-to-peer networking. The solution, which in some cases is hole punching, then how Lippy2P does all that.
And then we have been running a large measurement campaign on the internet in the wild, collecting data how well hole punching works out there. And we're going to present those findings and then kind of have takeaways,
what we learned from there and where we're going from there. All right, Lippy2P, just a quick introduction. It's a peer-to-peer networking library. It's an open source project. There is one specification, and then there are many implementations of that specification,
among other things, other languages in Go, JS, Rust, NIM, C++, Java, many, many out there. Cool. It provides, I would say, two levels. On the low level, it provides all kinds of different connectivity options. It takes care of the encryption and authentication, here being mutual authentication.
And then things like hole punching, for example. And once you have these low-level features of being able to connect to anyone out there in an encrypted and authenticated way, you can then build higher-level protocols on top of that, which Lippy2P also provides, like a DHT-distributed hash table or gossiping protocols and things like that.
And my big statement always about Lippy2P is it's all you need to build your peer-to-peer application. All right, so to zoom out a little bit, that's Lippy2P. All the things that we're talking about today are implemented in Lippy2P,
but that doesn't mean you can't implement it in any other networking library if you want to. Our great motivation for Lippy2P and in general for peer-to-peer networking is that we have full connectivity among all the nodes within the network to the best of their capabilities, obviously.
And in this talk, we're going to focus on the problem of NATs and firewalls for peer-to-peer networking. Now, before all of you yell, I'm not saying let's get rid of firewalls. Please let's not do that. They have a very important purpose, but in some cases, we want to get around them.
Okay, cool. Yeah, I'm here in the network dev room. I'm not going to explain what NATs and firewalls are. Network Add-Ons, let's say, is firewalls. But we will go a little bit into what that means for hole punching. In general, hole punching, NATs and firewalls are big ones that we have to get around.
Okay, what is the problem in some fancy pictures? A wants to send a packet to B, whether that's a TCP SYN or anything, right?
And A and B are both behind their home routers. Just imagine two laptops in two different houses. And they want to communicate directly with each other. So A sends a packet to B. It crosses A's router. A's router sets a five tuple in its routing table for that packet.
And the packet makes it to B. And obviously, a very good thing is that B drops that packet because it's a packet that it has no clue where it's coming from, probably some wider internet, and it might be an attack. So it's dropping it. It doesn't have any five tuple in its routing table, right?
Okay, so that is the problem. And we somehow want to make A and B communicate with each other. So the solution here, in some cases, it's hole punching. Again, we want A and B to connect to each other.
Instead of only having A send a packet to B, we have both of them send a packet at the same time. I'm talking in a little bit about what at the same time means, but let's just for now say we have some magic synchronization mechanism. So A sends a packet to B. B sends a packet to A. The packet from A punches a hole in its routing table,
so adding a five tuple for it. The packet from B punches a hole in its routing table on its side. The packets cross somewhere in the internet. Obviously, they don't, but it's a nice metaphor.
And at some point, packet B arrives at router A. Router A checks its routing table. A little bit simplified here. It lets packet B pass, same on router B. And this way, we actually exchange packets.
Cool. So now, the big problem is how does A and B know when to send those packets? It has to happen at the same time, at least for TCP. We might go a little bit into what that means for UDP, but at least for TCP, this needs to happen at the same time for TCP simultaneous open to happen in the end.
So how do we do that? This is Lippy-to-P specific. It doesn't need to be Lippy-to-P. You can use any signaling protocol on top. Let's say A and B want to connect, and they need to hole punch at the same time, right? They need to send those two packets from both sides at the same time,
so one can go through the hole of the other to the other side. What do we do? We need some kind of coordination mechanism, so some kind of public server out there that is not behind a firewall or not. B connects to the relay.
A learns B's address through the relay. A connects through the relay, so now the two, A and B, have a communication channel over the relay. B sends a message to A. You can just think of it as like a time synchronization protocol.
At the same time, while sending that message, it measures the time it takes for A to send a message back. At this time, we know the round-trip time. Once we know the round-trip time, B sends another message to A
and waits exactly half the round-trip time. Once A receives that soon down there, you can do the math. If now both of them start, so A when it receives the packet and B after half the round-trip time, they actually do the hole punch.
They exchange the packets. They cross somewhere in the internet. Both of them punch the hole into their routers, and ta-da, we succeeded. We have a hole punch. We have a connection established. Cool. Okay, a little bit in terms of timeline on all of this.
Hole punching is nothing new. It's definitely nothing that Lippy-to-Pee invented, not at all. The most obvious mention I know is in RFC 5128, but again, it predates that for sure, but I think it's a nice introduction to hole punching in general.
In case you like reading ITF documents. Since then, we have been implementing it around 2021, 22, basing on a lot of past knowledge around that. I've been presenting this work at FOSDEM 2022 last year remotely,
and since then, we have rolled it out on a larger network, which is the IPFS network, in a two-phase way where all public nodes act as relay nodes, very limited relays, and then in a second phase, all the clients gained the hole punching capabilities,
and now on this large peer-to-peer network, we actually have on the one hand the public nodes relaying for the signaling, and then the clients actually being able to do the hole punching work. We have this deployed now in this large network, but it's very hard to know how it's working,
especially across the internet, across all the networks, across all the different endpoints, across all the routing hardware, and so on. That's why we launched the hole punching month, which is kind of like a measurement campaign, which Dennis now is going to introduce. Can you hear me? Yes.
All right. Thanks, Max. Yeah, as Max said, the libEDP folks conceived this new DCOTR protocol and then deployed it to the network, and now we want to know how well does it actually work. And for this, we launched, as Max said, a measurement campaign during December. We'll get to this in a second. But how actually do we measure these hole punching success rates?
And the challenge here is that we actually don't know the clients that are DCOTR capable. So where are the clients that we want to hole punch? Because they are behind nets. We cannot enumerate them. They don't register themselves in a central registry or so.
So we conceived this three-component architecture, and the crucial thing here probably is this honeypot component, which is just a DHT server node that interacts with, as Max said, the IPFS network and is a very stable node, and this means that it gets added to routing tables of different peers in the network,
and this increases chances if peers behind nets interact with this IPFS network, come across this honeypot. So peers behind nets is in this diagram, the top right corner, some DCOTR capable peer. This one, by chance, by interacting with the network, comes across the honeypot, and the honeypot then keeps track of those peers
and writes it into a database. And then this database is interfaced by a server component that serves those identified and detected peers to a fleet of clients, and the hole punch measurement campaign consisted of a deployment of those clients to a wide variety of different laptops or users that agreed to run these kinds of clients.
And this client then queries the server for a peer-to-hole punch. As Max said, it connects to the other peer through a relay node and then exchanges those couple of packages, tries to establish a direct connection,
and then at the end it reports back if it worked, if it didn't work, what went wrong, and so on, and so we can probe the whole network, or like many, many clients and many network configurations. So we did this measurement campaign. We made some fuss about it during November internally at Proko Labs,
and also reached out to the community, and starting from the beginning of December, we said, okay, please download these clients, run it on your machines, and let's try to gather as much data as possible during that time.
And as you can see here, so we collected around 6.25 million hole punch results. So this is quite a lot of data from 154 clients that participated, and we punched around 47,000 unique peers in this network. And on the right-hand side, you can see the deployment of our clients,
of our controlled clients. So the color here is the number of contributed results. So the U.S. was dominant here, but we have many other nodes deployed in Europe, but also Australia, New Zealand, and also South America, and also one client from the continent of Africa. And these clients interacted with these other peers
that are basically all around the world. So we could measure hole punch success rates all across the globe, and I think we have a very comprehensive data set here. And so we now gathered the data,
and at the beginning of January, I said, okay, the hole punching month is over, and I started to analyze the data a little bit. And what we can see here on the x-axis is each bar is a unique client, and on the y-axis we can see these different outcomes.
So each hole punch result, as I said, so the clients report back these results, and each result can have a different outcome. These outcomes are at the top, so it can be successful. So we actually were able to establish a direct connection through hole punching. Then connection reversed. This means I'm trying to hole punch,
so I'm connecting to the other peer through the relay, and the first thing before we do the hole punching dance is for the peer to directly connect to us, because if we are directly reachable because we have a port mapping in place in the router, we don't actually need to do the hole punching exchange. This is the connection reversed, and as we can see here, it's a little hard to see,
but some clients actually have a lot of these results, so this means they have a unique router configuration in place. Then failed is the obvious thing, so we tried, we exchanged these messages, but in the end weren't able to establish a connection. No stream is some internal error that's unique to our setup,
so probably nothing to worry about here, and no connection means we try to connect through the other peer through a relay, but the other peer was already gone. It's a permissionless peer-to-peer network, so it could be from the time that the honeypot detected the peer to the client trying to establish a connection to the peer
that the client has already churned and left the network. Actually, looking at these clients is a distorted view on the data because we allowed everyone who participated in the campaign to freely move around. I was running this client in my laptop, and I was moving from a coffee shop Wi-Fi network to a home network
to a university network and so on, and hole punching is actually dependent on those network configurations instead of just me running the client. The challenge here with the data analysis was, so I'm also not done with it yet and happy to open for suggestions, to detect these individual networks that the clients operated in.
With each hole punch result, the client reported their listening IP addresses and so on, and I grouped them together to actually find out, to identify unique networks that those clients operated in, and at the end, I arrived at 342 unique client networks, and then the graph looks like this,
probably not much different than before, but also there are some interesting unique network outcomes here that I will also get to in a bit. The most interesting graph is probably this one, so what's the success rate of this protocol? On the x-axis, we have the success rate binned by,
yeah, just a 5% binning, and on the y-axis, the number of networks that had the success rate by probing the whole other network, and the majority of networks actually had a success rate of 70%, so I think this is already, actually I think it's amazing
because from not being able to connect at all to having a 70% chance to establish a direct connection without an intermediary is actually pretty great, but then also there are some networks that have very low success rate, and these are the ones that are probably the most interesting ones.
Then also the IP and transport dependence is also quite interesting as an angle to look at the data. Here we can see the top row we used IPv4 and TCP to hole punch, so when these clients exchange these connect messages, they actually exchange the publicly reachable IP addresses
of those two peers that want to hole punch, and in our measurement campaign, we restricted this to actually only IPv4 and TCP and with some other hole punches only to IPv6 and QUIC, which is on the bottom right, and so we can take a look which combination is more successful than the other,
and here we can see that IPv4 in TCP and QUIC is actually if you average the numbers, it has a similar success rate, but on IPv6, it's basically not working at all, and these unexpected things are actually the interesting ones for us. Either it's a measurement error
or there's some inherent property to the networking setup that prevents IPv6 from being hole punchable basically. If we actually allow both transports, so in the previous graph, we showed we were only using TCP and QUIC,
but if we allow both transports to simultaneously trying to hole punch, we can see that with 81%, we end up with a QUIC connection, and this is just because QUIC connection establishment is way faster than TCP connection, so this is like an expected result here, just to verify some of the data here.
And now, two takeaways for us for protocol improvements. So if we took a private VPN, so if clients are running in VPNs, we can see that the success rate actually drops significantly from around 70% to less than 40%, and my hypothesis here is that the router time that Max showed previously is measured between A and B,
but what we actually need is the router time between the router A and router B, and if your router basically is your exit node or your gateway that you're connected to from your VPN, this can differ by dozens of milliseconds, actually, and so the router time doesn't add up, and the whole synchronization is a little off,
so this is potentially a protocol improvement here. And then, also interesting, so Max said they are exchanging these messages, doing the hole punch, but actually we tried this three times, so if it doesn't work the first time, we try it again, and if it doesn't work the second time, we try it yet again,
but when we look at the data, if we end up with a successful hole punch connection, it was actually successful with the first attempt in 97% or 98% of the cases, so this is also something for the next steps for us.
We should consider changing our strategy on the second and third try to increase the odds, so if we stick with the three retries, we shouldn't do the same thing over again because, as we saw from the data, it doesn't make a difference, so we should change our strategy here, and so one thing would be to reverse the client server roles
in this quick hole punching exchange. This would be something, and also the other protocol improvement for us, as I said, would be to change the measurement of the round trip time. For the future, the data analysis,
right now what I showed here is basically aggregates across all the data, and the interesting part is basically, so why is a specific client or a specific network, why has it a less or a worse success rate than others? These are these individual things to look into to increase, maybe there's a common pattern
that you can address with a protocol to increase the success rate, and then identify those causes. Also, at the end of all of this, we want to craft a follow-up publication to something that Max and some fellow friends, I would say, have published just last year.
We want to make the dataset public and so on and so forth for others to benefit from the data and can do their own analysis. With that, get involved, talk to us here at the venue about all of this. LIPID V is a great project. Have a look at all these links. Get in touch and contribute.
Join our community calls. Yeah, I think that's it. Thank you very much. Questions? This, what you implemented there, is it exactly ICE-turned-staff,
or how different it is from this? We differ in some cases. It's definitely very much motivated by ICE in turn. A couple of things. We don't do turn itself. We have our own relay protocol because nodes in the network
act for the public as relay nodes. The problem is you don't want to relay any traffic for anyone, but you want to make this really restricted in terms of how much traffic, how long. If you run a public node, you don't want to be the next relay node for everyone out there. And then what we built here
is very much TCP specific, but it also works well with UDP. We need the synchronization, and as far as I know, at least the WebRTC stack is very focused on UDP, where timing doesn't matter as much. So you saw the timing protocol, right? And that is very TCP specific,
where we want a TCP simultaneous connect, which allows two SUNs to actually result in a single TCP connection. Yeah. This is for your analysis. I guess a lot of this depends
on the default configurations of the firewall. Did you kind of find out what are the branch type of firewalls or configurations that stops hole punching in your research? So, yeah. So, not in its entirety, but what we did is, so people that signed up for the measurement campaign
gave us information about the networks, and so if we find something fishy in the data, we could also reach out to them and ask what's with the firewall setup in your specific network. We also gathered data about port mappings that are in place. So what libptphost tries to do is establish a port mapping inside your router,
and this is also reported back. And what we also did is try to query the login page from these routers and get some information about what kind of firewall router actually was preventing you
from connecting to someone else. So these are the data points that we have to get some conclusions around this, but more than this, we don't have. But I think this is already pretty conclusive to a wide variety of analysis.
What I was just wondering about is do you have any data? How many clients actually were behind the net? How many clients actually? So all these clients that the honeypot detected were clients that are behind the net. So these are all libptphosts, and with the default configuration of libptphosts,
if they only announce relay addresses, this means that they must be not publicly reachable, which is for us equivalent with being behind the net. So yeah, it should be. There's probably some error there, but yeah. So then all of the IPv6 kind of host you were trying to connect to also were behind net,
kind of IPv6 net. Yes, yes. And this is the interesting thing. So I cannot explain this yet. Maybe it's a measurement error from us. Maybe it's some, as I said, inherent property to something. Maybe it's a protocol error. I don't know. And this is the interesting stuff in these kinds of things. Thanks. I'm very curious.
I was wondering, does it also work with multiple nets? Can you hole punch through two nets?
So another friend of mine who I convinced to run these clients actually was running behind two nets, and it was working. Microphone. But I'm not sure how many people actually ran behind two nets. But in theory, yeah, maybe Max, you can explain this one a little bit. So right now we don't have really a lot of data about two nets, and also we don't have the data,
which I think, what is it called? Needle. I don't quite know where you're within the same network, but you don't know that you're next to each other, and you actually want to hole punch through your own net even though you can't connect to each other. So there's some challenges. Do we still have time for another question? Yeah, yeah.
Sorry. So you said that for UDP it should work similarly. Did you do any experiments with that? Because in the past we had a custom UDP hole punching thing
and the routers were pretty brain dead. They forgot the mapping within 20 seconds or something. Yeah, so we run this measurement campaign on TCP and QUIC, and QUIC in the end is just UDP. And what we do is something similar to STUN in the ICE suit, where we continuously try to keep our mapping up,
and then on NATs that do endpoint independent mappings, that actually helps. So as long as we keep that up for, I don't know, every 10 seconds or so, then our mapping survives even on UDP. Okay, cool. Thank you very much.