We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Sneaking In Network Security

00:00

Formal Metadata

Title
Sneaking In Network Security
Subtitle
Enforcing strong network segmentation, without anyone noticing
Title of Series
Number of Parts
165
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Highly compartmentalized network segmentation is a long-held goal of most blue teams, but it's notoriously hard to deploy once a system has already been built. We leveraged an existing service discovery framework to deploy a large-scale TLS-based segmentation model that enforces access control while automatically learning authorization rules and staying out of the way of developers. We also did it without scheduling downtime or putting a halt to development. This talk covers how we engineered this, and shares lessons learned throughout the process.
Keywords
2
Thumbnail
36:48
16
Thumbnail
1:00:12
17
Thumbnail
45:59
45
59
Thumbnail
1:01:02
83
Thumbnail
1:02:16
86
113
Thumbnail
1:01:38
132
141
154
Thumbnail
1:01:57
InformationComputer networkComputer networkMaxima and minimaInformation securitySoftwareMusical ensembleSoftware testingRoundness (object)NeuroinformatikLecture/Conference
Information securitySoftwareQuicksortProjective planeMultiplication signProcess modelingScaling (geometry)TheoryInformation securityGoodness of fitIntegrated development environmentXMLUMLLecture/ConferenceMeeting/Interview
Gastropod shellSoftwareGame controllerPivot elementCuboidPatch (Unix)Server (computing)NP-hardIncidence algebraGastropod shellComputer architecture1 (number)Time zoneInternetworkingDependent and independent variablesQuicksortInformation securityPerimeterLocal ring
Vertex (graph theory)Service (economics)MathematicsJordan-NormalformMathematicsData managementScaling (geometry)CodeQuicksortSoftware developerInstance (computer science)Product (business)Projective planeComputer architectureService (economics)Complex (psychology)Time zoneSoftwareComputer animation
Computer networkTheoryProcess modelingInformation securityProcess (computing)SoftwareCategory of beingRight angleSoftware developerSoftwareTheoryInformation securityCartesian coordinate systemSoftware testingProcess modelingXMLComputer animation
TheoryInformation securityComputer networkSicHoaxCommunications protocolCartesian coordinate systemSelf-organizationMultiplication signInformation securityData managementTerm (mathematics)Physical systemCommunications protocolSoftwareDefault (computer science)Scaling (geometry)Software engineeringRight angleSoftware developerService (economics)QuicksortSoftware testingProduct (business)BuildingDecision theorySlide rulePlastikkarteData centerGroup actionField (computer science)Operator (mathematics)Goodness of fitProjective planeConfiguration spaceINTEGRALComputer animation
Wechselseitige InformationAuthenticationElectronic mailing listInformation securityPolygon meshImplementationIdentity managementVertex (graph theory)Service (economics)Bound stateComputer networkQuicksortPhysical systemInformation securityTelecommunicationConfiguration spaceAuthenticationSlide ruleService (economics)MereologyTransport Layer SecurityIdentity managementIP addressCategory of beingCommunications protocolSoftwareIntelligent NetworkProcess modelingMobile appXMLComputer animation
Identity managementBound stateAuthorizationService (economics)Vertex (graph theory)ImplementationRule of inferenceComputer networkAsynchronous Transfer ModeSoftwareService (economics)Level (video gaming)Connected spaceAuthorizationTransport Layer SecurityLogicQuicksortCentralizer and normalizerDiagramPublic key certificateView (database)Wind tunnelComputer animationProgram flowchart
Client (computing)Service (economics)RippingMeta elementImplementationWind tunnelService (economics)Server (computing)Direct numerical simulationClient (computing)AuthenticationSoftwarePhysical systemWeb pagePoint cloudQuicksortPublic key certificateWeb 2.0Term (mathematics)Integrated development environmentCore dumpInformation securityDirection (geometry)Domain nameIdentity managementLevel (video gaming)Key (cryptography)Service-oriented architectureSoftware frameworkProduct (business)Internet service providerMultiplication signTransport Layer SecurityExtension (kinesiology)Formal verificationWeb browserProjective planeExterior algebraWechselseitige InformationPlastikkarteStack (abstract data type)Public domainCuboidSound effectComputer animationProgram flowchart
Stack (abstract data type)PlastikkarteElectronic mailing listOpen sourceSlide ruleOrder (biology)Local ringSystem callProxy serverService (economics)Physical systemPlastikkarteAddress spaceInstance (computer science)Information securityStack (abstract data type)Projective plane
Service (economics)Service (economics)Reverse engineeringLine (geometry)Process modelingInformation securityProxy serverInternetworkingTelecommunicationTransport Layer Security
Service (economics)Metric systemInformation securityFormal verificationGame controllerProxy serverTransport Layer SecurityService (economics)Presentation of a groupAuthenticationFormal languageQuicksortSoftware testingSlide ruleDifferent (Kate Ryan album)TelecommunicationCodeCommunications protocolAuthorizationConsistencyDirection (geometry)Software frameworkMetric systemStructural loadComputer animation
Point cloudNon-standard analysisElectric currentGoogolClient (computing)Proxy serverService (economics)Arrow of timeCloud computingMultiplication signProxy serverService (economics)Information securityTransport Layer SecurityAuthenticationBitCodeMechanism designExclusive orSoftware frameworkComputer animationJSON
Keyboard shortcutLink (knot theory)TorusIdentity managementFormal verificationRight anglePublic key certificateConfiguration spaceDecision theoryTelecommunicationLevel (video gaming)Web pageDynamical systemFront and back endsCAN busQuicksortSoftwareTime zoneService (economics)RobotComputer animationProgram flowchart
Identity managementTelecommunicationPublic key certificateTerm (mathematics)Vertex (graph theory)MathematicsClefAuthorizationBuildingService (economics)Service (economics)Fluid staticsIdentity managementQuicksortElectronic mailing listMultitier architecturePublic key certificateSpreadsheetFigurate numberConfiguration spaceExterior algebraBitDistribution (mathematics)Physical systemAttribute grammarIdentical particlesTransport Layer SecuritySoftwareTime zonePoint cloudSet (mathematics)TelecommunicationSelectivity (electronic)Proxy serverState of matterLevel (video gaming)Data managementInstance (computer science)MathematicsSystem administratorFront and back endsString (computer science)CASE <Informatik>Functional (mathematics)Computer animation
Level (video gaming)LogicBuildingInferenceCodeRegulärer Ausdruck <Textverarbeitung>DiagramInformation securitySoftwareSoftware repositoryQuicksortPoint (geometry)Data managementCodeFirewall (computing)Self-organizationConfiguration spaceMaxima and minimaPresentation of a groupTelecommunicationElectronic mailing listDecision theoryMusical ensemblePhysical systemMultiplication signMathematicsMoving averageSinc functionWeb 2.0Latent heatCASE <Informatik>Computer fileNeuroinformatikImplementationComputer animationProgram flowchart
InferenceService (economics)Configuration spaceData managementPhysical systemDatabaseCache (computing)Software repositoryVirtual machineSoftwareProduct (business)InformationLevel (video gaming)Repository (publishing)Peer-to-peer
Software repositoryWorld Wide Web ConsortiumMathematicsSoftwareService (economics)Identity managementRepository (publishing)NeuroinformatikBitVapor barrierLevel (video gaming)Web crawlerReverse engineeringConnected spaceWeb 2.0QuicksortSingle-precision floating-point formatInformation securityGame controllerDecision theoryAdditionPhysical systemPoint (geometry)Proxy serverOrder (biology)AuthorizationBoilerplate (text)Peer-to-peerDistribution (mathematics)Condition numberUniform resource locatorCASE <Informatik>Computer animation
Client (computing)AuthorizationProxy serverInformationService (economics)LogicAuthenticationEmailZugriffskontrolleIdentity managementElectronic mailing listTransport Layer SecurityAuthorizationToken ringKey (cryptography)Component-based software engineeringOrder (biology)Software developerSystem callConfiguration spacePasswordGame controllerQuicksortStreaming mediaCryptographyPhysical systemCommunications protocolReverse engineeringTelecommunicationMobile appProxy serverIdentity managementLevel (video gaming)Electronic mailing listPublic key certificateCategory of beingClient (computing)Information securityInformationService (economics)EmailImplementationDecision theory1 (number)JSONXML
FreewareBand matrixVertex (graph theory)Reverse engineeringKolmogorov complexitySoftwareService (economics)Public key certificateSound effectSoftwareRevision controlService (economics)Information securityElectronic mailing listDependent and independent variablesPhysical systemLevel (video gaming)TelecommunicationSubsetWeb 2.0DebuggerComputer networkIdentity managementQuicksortTransport Layer SecurityCore dumpElement (mathematics)Server (computing)Proxy serverTerm (mathematics)AuthenticationCuboidPublic key certificateMathematicsState of matterConnected spaceCache (computing)2 (number)Band matrixBitCentralizer and normalizerDataflowFirewall (computing)AdditionFront and back endsXMLComputer animation
Component-based software engineeringProxy serverStorage area networkIdentity managementPublic key certificateScripting languageSoftware repositoryAuthorizationTopologyFunction (mathematics)Cache (computing)Installation artFormal verificationService (economics)Configuration spacePhysical systemLocal ringProcess (computing)QuicksortSoftware testingConfidence intervalLocal ringExistencePerspective (visual)Arithmetic meanAxiom of choiceComputer fileOpen sourceProduct (business)Network topologyParsingCondition numberMultiplication signStructural loadAuthorizationService (economics)Web 2.0AuthenticationStress (mechanics)MathematicsFunction (mathematics)Variety (linguistics)Information securityCartesian coordinate systemScripting languageCache (computing)1 (number)Software repositoryConfiguration spacePhysical systemGoodness of fitProjective planeMoving averagePlanningRoutingPublic key certificateLatent heatSoftwareWind tunnelIdentity managementProxy serverComponent-based software engineeringPolygon meshMetric systemRollback (data management)Transport Layer SecurityUltraviolet photoelectron spectroscopyRepresentation (politics)Rule of inferenceOperator (mathematics)Sound effectElectronic mailing listPoint (geometry)MiniDiscSystem callMappingLevel (video gaming)UMLComputer animationXML
LogicAuthorizationPhysical systemService (economics)Socket-SchnittstelleProxy serverClient (computing)Sound effectSoftware testingProcess (computing)Configuration spaceLocal ringIntegrated development environmentPersonal digital assistantCache (computing)LogicAuthorizationPublic key certificateLevel (video gaming)Proxy serverReverse engineeringVisualization (computer graphics)Computer hardwareService (economics)Revision controlPasswordMetric systemBitEncryptionConfiguration managementConfiguration spaceInformation securityCartesian coordinate systemConnected spaceNumberState observerWechselseitige InformationAxiom of choiceMobile appProcess (computing)QuicksortTransport Layer SecurityIdentity managementMassBit rateCASE <Informatik>Patch (Unix)MathematicsCategory of beingCache (computing)SoftwareCuboidSweep line algorithmArithmetic meanSoftware testingStability theoryProjective planeMultiplication signMoving averageOrder (biology)Semiconductor memoryGoodness of fitEmailState of matterPhysical systemNetwork topologyNumbering schemeSensitivity analysisProgram flowchartXMLComputer animation
Distribution (mathematics)TelecommunicationVertex (graph theory)Identity managementComputer networkElectronic mailing listZugriffskontrolleConfiguration spaceComputer configurationSoftware frameworkInformation securityMereologySoftwareScaling (geometry)Goodness of fitQuicksortVideo game consoleConfiguration managementPhysical systemAuthorizationConfiguration spaceService (economics)Electric generatorPublic key certificateCASE <Informatik>Polygon meshInformation securityProjective planeInstance (computer science)Time zoneConnected spaceGame controllerInjektivitätTelecommunicationElectronic mailing listProxy serverStrategy gameComputer configurationSummierbarkeitMathematicsMappingSystem callIdentity managementIntegrated development environmentDataflowRobotTransport Layer SecurityComputer animation
TwitterConnected spaceMaxima and minimaSheaf (mathematics)VideoconferencingTwitterMusical ensembleLine (geometry)Goodness of fitStack (abstract data type)QuicksortComputer virusCellular automatonXMLLecture/Conference
NumberLocal ringMathematical analysisAreaCodeMobile appLecture/ConferenceMeeting/Interview
NumberWorkstation <Musikinstrument>Service (economics)SoftwareIdentity managementCore dumpQuicksortMoment (mathematics)Proxy serverVirtual machineComputer architectureComputer animationLecture/Conference
Proxy serverCodeClosed setNumberOperator (mathematics)ImplementationFormal verificationElectric generatorScripting languageLevel (video gaming)Heat transferReverse engineeringQuicksortAreaOrder (biology)AuthorizationSoftwareNetwork topologySingle-precision floating-point formatMathematicsSynchronizationLecture/Conference
NumberPoint (geometry)Electronic mailing listQuicksortPublic key certificateBit rateProcess (computing)Electric generatorTerm (mathematics)Renewal theoryArmLevel (video gaming)AuthorizationLecture/Conference
SoftwareVideo gameWeb pageFirewall (computing)Process capability indexView (database)Physical systemIntegrated development environmentFlow separationProduct (business)InformationArithmetic meanLatent heatPublic key certificateMultiplication signGame controllerLecture/Conference
Data managementMathematicsObject (grammar)Cartesian coordinate systemCoefficient of determinationGroup actionAddress spaceUsabilityProduct (business)Information securityTransport Layer SecuritySoftware developerPlanningSelf-organizationSoftware testingMultiplication signTerm (mathematics)ImplementationStack (abstract data type)AuthenticationOperator (mathematics)Lecture/Conference
Stack (abstract data type)Web 2.0Computer file2 (number)SynchronizationConfiguration spaceElectronic mailing listIdentity managementMetadataClient (computing)InformationLevel (video gaming)Entire functionSoftwareLecture/Conference
QuicksortSinc functionComputer fileWeb 2.0Service (economics)Multiplication signInformationMoment (mathematics)Mechanism designLecture/Conference
Semiconductor memoryCartesian closed category
Transcript: English(auto-generated)
Okay, our next talk is sneaking in network security.
Our speaker Max is going to tell us how to scale up defense for computer networks, and in particular how to integrate that in existing networks, okay?
Max here is a former pen tester and now a blue team member. Please welcome him with a huge round of applause. Thank you. Hi everyone, my name is Max Burkhardt.
I'm here to tell you today about sneaking in network security, how I and a small team of other security engineers managed to implement a strong network segmentation model in an already running, high-scale, large network. I'm a security engineer at Airbnb, and so the sort of practical experience of this project occurred in that network.
However, I think that the techniques we'll go over here today will apply to many other networks, and so I'll spend some time talking about the technical theory behind this approach, as well as what happened when we rolled it out in an attempt to give you some good evidence
and experience to run this in your own environment. So, let's talk about network security in 2018. Segmentation continues to be a really good idea, because we all know that compromises are going to happen. Those boxes are going to be popped, whether it's a zero day or something less fancy,
like, you know, somebody forgot to patch a server. And network segmentation gives you the controls to be able to keep those compromises contained, to make sure that low security systems can't pivot into higher security zones, and help your incident response teams keep incidents localized.
However, if you've ever been involved in network pentesting, you'll know that a well-segmented network is a rare thing to see, and I think we know why this happens. As networks grow quickly, small security teams, especially ones that something like a startup like Airbnb was,
find themselves having to prioritize their work where it is the most impactful, and that usually ends up being the perimeter, the internet facing hosts. And so, as a network grows quickly, you end up with a large network that has this sort of hard shell, soft center architecture, where the external perimeter may be hardened,
but once an attacker is able to compromise that, they may have relatively free reign inside the rest of it. And this obviously isn't something that we want, and ask any blue team member and they'll know that this is a bad place to be. But change is hard, and especially with a pretty large network.
So to give you an idea of the scale that this project dealt with, earlier this year when we were implementing this, Airbnb's production network had about 2,500 services, about 20,000 nodes, and I define a node to be something that's sort of like a host, whether it's an instance running an EC2 or a Kubernetes pod,
and over 1,000 engineers who are doing hundreds of production deploys per day. So things are moving really fast, and it's hard to go in and build in these large architectural changes, like adding segmentation. Furthermore, because of this sort of highly serviceified architecture,
there was a lot of complex interconnectivity between these things. So determining where the zones should be was difficult in itself. Finally, developer productivity is a really big concern for us, and especially to my managers and their managers. If you have over 1,000 engineers writing code every day,
if you slow them all down by 5% or 10%, that's actually a really expensive thing to do, and it's not something that's going to fly. So the question became, how do we go from a soft center network to something that has good segmentation and has the security properties we want? And we're not allowed to stop development. We can't start over.
We've got to be able to build a security in as the network is running. So we hear a lot, especially in the pentesting, offensive community, about trying to be like a ninja, right? Get into the network, do stuff without anyone noticing. I'll argue that it's also just as important in defensive security. We need to be defensive security ninjas
and be able to sneak in, put in the defenses, and have nobody know we were there. So what's the theory that we're going to be applying in this approach? We need to stop thinking about security as this layer around development, as another step in the waterfall model.
This is maybe what we were thinking about 30 years ago, so that you'd build an application and then you'd do security testing and then you'd ship it to production. But it just hasn't really held up anymore. So there's been a lot of smart people talking about the new way to do things. Agile security, dev sec ops, sec dev ops, people can't decide. This whole concept of really unifying security operations
and software engineering so that you're building a secure thing all the way through. And this certainly isn't something that we invented. Many people have been working on it. But I've found that most of the time people think about this concept when in the terms of application development. And I think it's time that we integrate this with network security as well.
I think the important thing here is scale. We need to build a security solution that scales with development. There's this saying that it's good to hire lazy engineers and developers because they're going to build things that sort of scale up
and don't require a lot of manual work. That's even more important for security engineers. You're never going to outwork the attackers. And so you need to build something that's going to scale along with your engineering group. So we're good project managers, so we're going to lay out the requirements for this solution before we jump into how it actually works.
Whatever we build needs to stay out of the way of engineers. It may be something they're aware of, but the farther we can keep it out of their scope, the better. So they can just keep writing applications that make the company money or accomplish your organization's goals, and their stuff ends up being secure.
Security by default is, of course, something that we have been chasing for a long time, but I think that we can also go further than that and say, beyond being secure by default, it should actually be hard to have an insecure configuration with this system. So we'll try and design things in that manner. And finally, we want to build something that,
as much as possible, is flexible to whatever sort of network or protocols you are using. You don't really ever know what's going to be coming six months down the road. When this was being worked on, Airbnb was mostly a Linux on Amazon shop, but I don't know what's going to happen in the next six months. We might acquire a Haskell on Azure company
and try and integrate that, or we'll start going to on-prem data centers. I have no idea what's going to be in the future. So we want to build a solution that's going to be as agnostic as possible to those sorts of decisions. So my next slide is basically the whole solution. I tried to condense it into two sentences.
We're going to use mutual TLS built into the service discovery system for authentication and confidentiality across all service communications, and we're going to discover those access lists totally automatically for security with zero to almost zero configuration. This is a lot of jargon on a single slide,
so I don't expect you to kind of visualize it yet. We'll dive into each of these parts, and I'll show you how they fuse together to build a system that is invisible and secure. So to start off, I've sort of isolated three pillars of this approach. The first is TLS in service discovery.
So we love TLS. It's one of the really powerful protocols that the security industry has managed to build, and it gives us great security properties if we can use it everywhere. So the first pillar is get everything to be using TLS, and by building into service discovery,
make sure that it runs everywhere without a lot of per-app configuration. Pillar two is binding identity to nodes. So in a more traditional network segmentation model, you might define subnets or restrict things by IP address. We're going to be able to be a little more flexible with how we refer to individual nodes in this network
because we're using TLS as an authenticator, and therefore can sort of define our own concepts of identity, and I'll get into that soon. Finally, we're going to generate an authorization map. So by automatically determining what services need to talk to what and figuring out how data flows through this network,
we can attempt to update ACLs automatically to stay out of engineer's way while still ensuring that the connections between services are trusted and can be verified. So this is a diagram that we'll be diving into individual pieces of,
but basically this is a very simplified view of a network. We've got three nodes. Those nodes each have a certificate sort of defining who they are, and they can use those certificates to communicate with each other through TLS tunnels. They have authorization logic that runs on them that is fed by this sort of centralized map of what nodes in the network should talk to each other.
Let's jump into the first pillar here, which is the implementation of TLS. So specifically here, we're looking at these tunnels. Before I start, though, it's important to cover some basic concepts here
just to get all on the same page. We're using mutual TLS here, which is, you know, you heard of traditional TLS. That's what your web browser uses all the time, where you have a client that is verifying the identity of a server. Normally it will get the cert, make sure that the subject alternative name or the CN matches the domain name,
and if so, tell you it's verified. But TLS is really awesome, and it actually out of the box supports verification in both directions. So you can have the client also present a certificate in that initial handshake, and the server can check who is talking to it using an equally strong authenticator.
This is pretty hard to deploy on the public web because users can't really manage certificates, but in your own production network, this works really well because you can distribute certs to everyone. So this is really great because this means we can have two-way strong authentication with key material that security engineers understand.
We know how to deal with these sort of systems. So not only can we make sure that clients of services know they're talking to a legitimate service, but that service can look at who's talking to them and make sure that that's a caller that seems appropriate. So that's mutual TLS. Service discovery.
This was a term that I hadn't heard about a lot before I started working in companies that used a lot of cloud environments and SOA. But at its core, service discovery is this concept that you have some node in a network, and it needs to find other nodes to provide services to it. So if you think about it, DNS is like a very old sort of basic service discovery system.
You want to perform a Google search, so you go to www.google.com, and DNS finds you a server that can provide Google services to you. So these have gotten a lot more complex and varied as people move to these environments where hosts are very flexible
and stuff moves around a lot. And they're pretty ubiquitous in modern service-oriented architectures. And service discovery can actually be kind of problematic for security if you do it wrong, because fundamentally it's trying to be a map of the network and be really helpful about, like, oh, hey, find this service here, find this service here.
But I'll argue that we can actually use this to great effect in achieving security. So Airbnb uses a framework called Smart Stack and so that was what was there when we started this project, and we built this security extension on top of the Smart Stack framework.
So that's what I'll sort of be talking about, but I believe that these concepts can be applied to most service discovery systems. As a brief aside into how Smart Stack works, this is an open-source system that Airbnb created and open-sourced a few years ago. But the basic idea is that it uses two other publicly available projects,
ZooKeeper and HAProxy, in order to make it easy for services to talk to each other. If you look at this example above, Node 2 is hosting a service, service B, and so service B is going to report into a ZooKeeper cluster, hello, I'm a service B instance, and you can find me at Node 2.
Node 1 wants to talk to service B, and so it will load the relevant addresses for service B from ZooKeeper, and it will put them into its local HAProxy instance. HAProxy is a reverse proxy that just kind of forwards traffic along. Service A, if it wants to make a call to service B, simply then just sends a request to local host
and leaves it to HAProxy to find a suitable host to fulfill that request. So an important thing to note here is that this system was not designed for security. Anything can write into ZooKeeper. It is, like, the most prone to impersonation thing possible because you just ask for a list of nodes and you get them, and it's not really authenticated.
But I'll show you how in the next few slides we can build security into this system. So the old way that we connected to services before any security upgrades is that service A wants to talk to service B. It sends a request to its local outbound proxy,
and that sends it along. So it's going to make an HTTP request to local host. That gets sent through the reverse proxy, goes across the Internet to service B. Not a lot of security going on here. What we added is a secure shim. So we added a new reverse proxy that runs on the receiving node in front of service B,
and we reconfigured the proxies to communicate to each other with mutual TLS. So now all of the traffic that's going over the Internet is in a TLS tunnel. But crucially, service A and service B did not change at all. Service A is still sending HTTP traffic. Service B is still receiving HTTP traffic.
So we were able to pretty radically change the security model of this cross-host communication without touching a single line of an engineer's code. So this is where we're getting our invisibility from. There's some other really big benefits to this. Because there are these two service discovery proxies
that are doing the TLS setup, and they are the things that can do authentication and verification of this TLS tunnel. Security was able to build these controls once and distribute them basically across the entire fleet. The same proxies can run no matter what language the service is written in,
what sort of protocol that service uses. And so instead of having to verify authentication, authorization code in dozens of different frameworks and languages, we were able to do it just about once. The other thing that ended up being really helpful is that having these proxies on either side of your service communications
is actually really helpful for non-security reasons. So things like consistent metrics, better tracing, better ability to do load testing. We got all of those for free by adding in more proxies, and thus we got to really get the support of other infrastructure teams at the company
who maybe didn't have direct security goals, but they wanted to help us do this. So basically, what we've done with this whole proxy thing is sort of the opposite of what the NSA wants. You may remember this slide from a leaked NSA presentation where they were discovering with glee that inside Google's cloud network at the time,
there was a lot of plain text HTTP going on, and SSL was added and removed. We are just adding SSL and keeping it there. All of the arrows on the right need to be TLS in the modern age. One important caveat about this particular approach is this concept of proxy exclusivity,
which is that basically we are relying on this inbound proxy to provide the security benefits of TLS, confidentiality and authenticity, and thus it is crucial that going through the inbound proxy is the only way to talk to a given service. If that service is reachable by going around the inbound proxy, you would still be able to talk plain text HTTP to it
and possibly evade authentication mechanisms, and so it's important that this is impossible. I'll talk a little bit about how we solved this particular issue. It's just something that's important to be thinking about if you're going to implement this approach. So that's TLS. By implementing a new proxy into an existing service discovery framework,
we can switch all the traffic to be going over TLS without radically changing the code of services running. Next up, though, is that really what we wanted out of all this is segmentation, right? We want to make sure that only legitimate things can connect to a given service, and so we need to build a sense of identity
that can be used to do this verification. So in this next pillar, I'm going to be talking about how we put these certificates there and, more importantly, how we decide what that certificate is going to say. So segmentation. You know, we're trying to make sure that a node in the network
can only talk to the things that it should be allowed to talk to. You know, if a node needs to talk to the payment's backend service, it's going to do that for business reasons, but we can maybe make sure that only nodes that have to talk to a given service can. But a lot of previous thought about segmentation tends to happen on this subnet level.
You make a zone of hosts, and things in that zone can talk to each other, and then maybe they can get out to other zones via certain predefined channels. But in a microservice network or something that has a lot of, like, dynamic communication going on, it may make more sense to think about this on a service level as opposed to a host level.
So we'll say things like, we want the payment config page service to be able to talk to the payment backend service. That seems like a reasonable thing to do. But in our network, we've also got a Slackbot running that makes memes for engineers, and that thing should definitely not be able to talk to the payment backend service.
So we can start representing these sorts of decisions instead of these sort of static tiers of hosts. We have a bunch of these services, and each service sort of keeps a list of identities that's going to allow it to connect to it. And we just did all this work to build up these proxies
on either side of a service communication that understand TLS and are using TLS. And TLS is fantastic at verifying identities. So we can now start to build the segmentation by saying, for a given service listener, here are the following identities which are allowed to connect to it. And thus, you can end up in a state where
only the right things can talk to a given service based on business need. We do have to identify all the nodes in our network, though. And this is something that's going to vary a bit depending on how your network is set up. So you need to sort of find a concept of an identity
that fills a few key attributes. So this identity that you decide for a node needs to be pretty varied. If you have one identity for everything, you're back at soft center network again because you won't be able to do any distinguishing.
You need an identity that a node can't change about itself. Otherwise, an attacker would be able to compromise a particular host, change its identity, and then move into zones the network it shouldn't be allowed to. It should really be able to be something that you can detect automatically so that you can sort of automate the distribution
of these certificates. If you end up having to go through to an Excel spreadsheet and figure out what each host is and then like mend those sorts yourself, it's not really going to work. And finally, we do need to represent this concept of an identity in a TLS certificate. So in our case, we wanted something that could fit
into a subject alternative name. So most modern networks have some concept of a role that works pretty well for this. When you have a config management system or a cloud permission system, you almost always are giving things identities based on their function, and this tends to work pretty well for this. So in our network, we used Amazon IAM roles,
which is a sort of designation given to an instance that gives it some level of permissions in AWS. And this worked really well because most different services had them. They can't be changed unless you have very high-level administrative permissions in AWS, and it can represent it as a string,
so it fits well in a certificate. So to kind of look at what we're going to do here now, we need to give everything an identity, and we need to make certificates that allow nodes to prove their identity in these TLS communications. We can then build this map of what identities
should be allowed to access what services. This is what is going to give us our segmentation. Because we're going to be able to distribute that map, saying, for the payments backend service, you allow the following identities and no others. And thus, you can get to this place where only a very select set of nodes in your network
can access the sensitive stuff. But how do we make that map? That's pillar three, which is the final segment of this diagram. So how do we figure out what needs to talk to what and distribute that? So a big question here is really all about trust.
How do you figure out what needs to talk to what and do it with a minimum of human-involved computation? A lot of what I was talking about earlier in the very beginning of this presentation was about the sort of human cost of segmentation.
If you have people who are spending all day trying to make firewall configurations, that's going to be rather expensive, difficult to keep safe, et cetera. We want to try and get away from the configurable list style of security engineering, where you hire a ton of security engineers to try and figure out what is supposed to talk to what.
So we wondered, could we just infer this from existing code? Can we look at how the network currently works, at how our configurations are defined, and use that to build this sense of how communication should happen? So this is getting to an interesting point
because the decisions you make here really depend on how you think about threats at your organization. So we decided that if you are somebody who can merge peer-reviewed, CI-passed code into our config management system, that means you're reasonably authorized to make changes.
And this is something that may vary based on your organization's setup, and we'll kind of dig more into those questions in a bit. But for our case, we realized we have this Chef repo. The Chef repo is a, you know, Chef is a config management system that can distribute information
to all of the nodes running in our network. And it already, in a nice machine-parsable way, was saying what the dependencies of every service were. So in this hypothetical example, we have a service one. Service one has dependencies on the production database, a cache, and a monitoring service. And this is already set up in a repository
that is rather heavily controlled. You have to be, you know, an engineer that gets peer review, et cetera, to commit to this. So what we can do is we can take this, determine that service one is an authorized caller of these services, and then sort of build that into this map. Say, for production DB, service one is authorized.
To do this, we built this service called Arachne. Arachne, named after a Greek spider goddess, is kind of computing the web of services and nodes at the Airbnb network. And so basically, it's continuously pulling our Chef repository and deployed Kubernetes artifacts
to figure out what connections have been defined by trusted people and building a sort of reverse map of, for a given service, what identity should be allowed to connect. It can then push these into S3. I'll talk about why we did that in a little bit. And then those can be sent to all of the nodes
that are actually doing this allowance. So the barriers that you're going to be putting into place about how this map is generated really depends on how you think about insider threats at your company. So in our case, we've made the conscious decision
to trust our engineers and rely on things like CI checks and peer review in order to make sure that legitimate things are committed. But depending on how you approach this, you may want to have more controls in place. And this system is rather flexible to do that. All you need is something that can automatically discover as much as it can,
and then, under some conditions, publish a new authorization map to some location. So you could certainly imagine, if you wanted more controls than this, making it so that when a new connection is discovered, it prompts the security team for a quick manual review and an acknowledgment before that actually gets distributed. So this does give security a single point of control,
where they can do any sort of monitoring or additional approvals if they wish, while still taking away a lot of that boilerplate work of trying to figure out what actually connects to what. We can actually go further with authorization. Instead of just telling all of these service discovery proxies to allow these identities and ban these others,
because we're just using vanilla TLS, we can rely on the heavy support for these sorts of protocols and many things. So the reverse proxy that we use is the inbound proxy has the feature to inject information about the client certificate used into HTTP streams
that went through it. Most of our APIs are HTTP-based ones, so this applies to most things, and it means that whenever a service gets a call over TLS, it can just parse this very simple header and know exactly what sort of identity is calling it, making it trivial to implement various permission levels depending on your service caller.
So this is something that sort of authorization control would have been really tough to implement before the system because you'd have to set up maybe your own TLS system or maybe a system of tokens or keys or passwords. But this lets us leave all of the tricky crypto stuff
to the security-owned components and let app developers just parse a very simple header and make decisions based on that. So those are the three pillars of this solution. We set up TLS in between everything to give us the security properties and communication that we need. We give everything an identity in order to make sure that they can authenticate to each other
and enforce segmentation by having specific allow lists for every service, and then we automatically discover this map by parsing configurations already there. But I'm not here just to sell you on this solution because I like it. There are some downsides, and to be perfectly honest,
I want you to know about them before you consider implementing something like this. And so these are just some of the things that we thought about and decided to accept. First, you are going to need to constantly synchronize out this map of allow lists
or some subset of those allow lists. Instead of having centralized allowance of various network communications like you might have if you have a central firewall, you're sort of doing it in a distributed way. Every node is determining whether or not a connection is allowed, and so that means that you have a reasonably strong need for a lot of bandwidth to synchronize this out.
You can use caching, which will make some things a lot easier, and I'll talk about why we did that, but that is going to cost you some in terms of update latency. If the web changes and you need to allow a new identity for a service, that may be slower if you're using a cache.
Second, if TLS has a problem, heartbleed, you have way more problems than you used to because you're now relying on one of the core security elements of your system. So this is something that we know, but the reasoning here is basically if heartbleed happens again, if we find some sort of major core issue in TLS,
already security is going to be working nights until we can get that patched on our front-end web servers. And so if we're going to be massively deploying new OpenSSL versions as quickly as we possibly can, that's going to end up patching up all these as well. So basically, we are relying on the fact that major SSL issues are going to get a quick community response
and be something that we can move quickly on. Adding more reverse proxies in your traffic flow turns out to be kind of complicated. This introduced a lot of interesting behavior in some services, and I'll talk a little bit more about the specific things you ran into,
but it's just something to note that the actual addition of TLS to things broke very little, but the additional hop in the network had surprising effects. You do need to be able to run software wherever you're receiving traffic through the system because you need to install that secure listener and something that can download the allow lists.
If you manage all of your own infrastructure, this is relatively easy, but if you have things like vendor devices or hosted services where you cannot install arbitrary software, that gets a little harder. When we have some services that are in this state, we basically put proxy boxes in front of them and use those to handle the authentication.
Finally, you are going to want some sort of certificate revocation because if a node does get compromised, you'll need to kick out its permissions, and this, I say, is usually tricky. There are certainly ways to do it, but this is something to be thinking about and scoping as you're considering doing a deployment like this.
So, rolling it out. My hope is that, you know, I've described this solution, but it's not just theoretical. This is something we did, and so I hope that I can share as much as I can about what we learned throughout this whole process.
So, to start with sort of the technical details, we built this mostly out of components that are available and open source. So, for the inbound proxy, we used Envoy, which is a project open sourced out of Lyft that is really growing in popularity in the service mesh world
and for good reason. It's really designed for this sort of thing. It's modern, it's fast, it has great support for TLS, has a ton of metrics which are really useful, and generally served us very well. The one thing we ran into with Envoy that is quite the stickler about the HTTP 1.1 standard,
and that led to some funny behavior in certain other applications that were not so strict about it. But overall, Envoy was a great choice, and we're actually migrating to use that on our outbound side as well. As I sort of alluded to earlier, we gave every node an identity based on its AWS IAM role,
and this was just sort of a natural choice for us because this is already how we were thinking about permissions for nodes. Now, nodes got their permissions by their IAM role, and that also kind of controlled what services they were allowed to talk to. The Arachne service I mentioned is basically
just a continually running Ruby script that loads a Chef repo and some Kubernetes artifacts and parses them. These authorization maps, the quote-unquote web files, are uploaded to and downloaded from S3. So we're using S3 as the source where this all gets actually pulled from.
All in all, it's about four minutes to fully compute the web of services and generate one of these web files, meaning that it's about a four-minute delay in between a change in topology, that is, a service adding or removing a dependency, and when that gets reflected in allow lists. In our experience, this is far shorter than the time it takes to actually deploy
such a change to production. So we haven't really run into race conditions where a new dependency gets added before it's allowed. We had some pretty specific availability considerations. Mainly on caching the output of Arachne, these web files.
So we wanted to make sure that if Arachne went down and we stopped being able to generate these authorization maps, that all the traffic kept working. We didn't want to be owners of a service that, if it went down, would ban all traffic. And really, if you think about it, by decentralizing all of the authentication of service calls,
you want to be able to rely on decentralization benefits. So by putting everything in S3 and letting nodes download it from there, we can make sure that if Arachne has some sort of critical problem, if it stops running, the worst thing that happens is that new topology stops being reflected.
So this means that traffic keeps flowing, even if S3 goes down, as it famously did last year, I think. That was a fun day. Things basically still work. Nodes won't be able to download new topology changes, but they'll still have local cache ones on disk,
and all the traffic will keep flowing as normal. So this was a choice we made early on and has served us very well because when there are new and interesting things that happen with Arachne, no one really notices. Generally, security is able to fix it before someone changes the topology. So the plan for a rollout was basically these six steps.
We started by computing this authorization map. Since this all kind of works on offline data, we were able to spend some time writing the software to do this and getting it to work nicely before we had to actually touch any production services. So we could build that map and verify its correctness.
Next, we wanted to give everything an identifying certificate. So the idea of doing this first is that this is a pretty small change and something that we could pretty safely roll out. We're simply dropping a certificate on a bunch of nodes, and it's also relatively easy to verify that this worked before moving on to the next step. We can check for the existence of these files
in a large-scale way and make sure they look good. Third, we installed this receiving proxy everywhere and started listening and setting up this traffic routing. At this point, no traffic is actually flowing through these TLS tunnels. We're simply giving it the path, too. This also lets us configure
or verify the step before moving on. Next, we can start actually doing the testing and building the confidence in this system. So we can start routing some traffic through these new secure listeners, and we configured our sort of configuration in a way that we could turn it on
or turn it off per service. So we picked a bunch of services that seemed representative, high-QPS ones, low-QPS ones, ones that use HTTP, ones that were plain TCP, just a great variety of things that seemed like they would sort of stress test the system, and we turned these on one by one and built confidence that this is going to work.
Step five is sort of the radical one, and that is the switching everything over at once. This is not always how you want to run operations, but we chose this for a very good reason, which is that there were two people working on this project, and there were, you know, 1,098 other engineers
building services as fast as they could, and we were reasonably confident that if we tried to go one by one, we would never catch up. We had to be able to build a system that we could switch on all at once and confidently move into a post-plaintext future. Our final step is rebinding services to localhost
so that these security guarantees were enforced. So we did this last, and this was sort of painful from a security perspective because you really have to wait till step six is complete before you really get the security benefits here, but we had to give ourselves the ability to roll back if things turned out to have problems.
We wanted to make sure that if switching to TLS for a service caused some unintended effect, we could roll back, fix that, and then roll forward again once that was dealt with. So to sort of visualize this, we start with the nodes. We've always had those. We built the authorization map
and made sure that was available first, moved on to adding the certificates to everything. We installed the reverse proxies with their authorization logic. We turned on TLS for some things to make sure that it worked, and then on plaintext deprecation day,
everything went to TLS. So we did this in April of this year, and there's a lot of things that went well. We went from about 15% internal TLS usage to 70% in one evening, which was really awesome and something that I don't think would have been possible with any other scheme.
We made sure that there were a lot of security or non-security benefits to this system, and this let us get wider organizational support for such a change. These sorts of massive sweeping infrastructure changes, because they affect everything, can make other engineers nervous,
especially people who are primarily concerned about uptime. And so we wanted to make sure there was plenty of stuff in there for them, too. Some of the chief benefits we provided included much easier configuration because we were automatically assigning identities to everything and preconfiguring certificates. Engineers no longer had to think about
setting up a custom mTLS connection if they needed security benefits. Performance, we'll talk about that in a sec, but the numbers are good. And then there were a ton more metrics available, so people could have greater observability in their services and realize what was going on, and that was operationally very helpful.
The other thing that we did that was a really good choice was making sure that we had the right configuration. We could disable TLS routing for individual services on a one-off basis so that if we determined that a certain service was having a problem, we didn't have to roll the whole thing back.
We could keep the wins we'd gotten and roll certain services back in order to fix them before moving forward again. Of course, I'm here to be honest with you. There are some things that were hiccups during the whole process. As I mentioned earlier, running everything through an inbound proxy
sounds good on paper, but leads to some weird stuff in practice. So of the 2,500 services, most of them took this fine. There was just a small percentage that did weird things. There are some things that change if you're using a reverse proxy, like all of your traffic is suddenly coming from localhost. Even small things like changing the case of HTTP headers,
which is fully allowed by the spec, can lead to weird behavior in some applications. Reverse proxies also can mess with particularly stable things like WebSockets. We didn't think about the WebSockets case and did not have support for that on day zero. That was a quick day one patch
to teach our reverse proxy that WebSocket connections are special, needs to be handled specially. So all of these things are generally surmountable, but you are going to run into some weird behavior. The thing that I thought was funny about all of this is that really the biggest problems we had had nothing to do with the security properties. Even if we'd had a plain HTTP reverse proxy,
we would have had the same problems. Our testing process, because of how we turned this on, was very good at testing the case where suddenly all of your traffic starts coming in over this TLS channel. So you enable TLS for service B, and suddenly all the service B nodes get all the traffic over TLS. We tested that well. What we didn't have great testing coverage on
was what happens if all of the services that your box depends on suddenly start requiring TLS. And so we ran into some interesting issues with this. Most particularly, HAProxy, what we were using for the outbound proxy, was a bit of an older version. It handled TLS certificates very poorly.
And so for certain roles that had thousands of dependencies, it would load the same certificate into memory over and over again for every connection it was making, and that led to some pretty crazy memory issues. So that was something that we could have tested a little better.
The final thing to mention is that binding these services to localhost again, this took longer than expected. We expected to be able to use easy service config templates that were built into our configuration management to say, okay, everything that used to be binding to 0.0.0.0, you're now binding to localhost. This ended up taking a few weeks longer than we expected
because there was more drift in how we did configuration than we expected. This is just one of those things that I wish we could have allocated a little more time to in the beginning. I mentioned I'd talk about performance because this always comes up whenever you introduce a TLS project. Someone is like, but what if it's really slow?
And fortunately, I can sort of confirm the security industry's assertion about it, which is that things often actually got faster, which was really, actually, I didn't expect it. Whenever somebody says this, at least I had this sort of kind of disbelief, like, eh, did it really? For a number of our services,
we improved 95th percentile latency by as much as 80%. What was happening here is that we had a bunch of these services that had sort of hand-implemented mutual TLS for security reasons, particularly high-sensitivity things like password services or things like that did implement MTLS because they wanted to be secure.
But they were implementing entirely at an app layer, and so application to application was communicating with mutual TLS. And these applications tended to restart reasonably frequently whenever there were deploys, new boxes spun up, et cetera. And so they were unable to take particularly good advantage of TLS session caching and session resumption,
meaning they had to use the TLS handshake all the time, making them quite slow. Service discovery proxies restart very, very infrequently. They kind of come up when a box comes up and often can last for weeks or months, and thus their TLS session caches are very well warmed.
And thus, we were able to keep a session resumption rate of near 100%, meaning that we were basically just paying the AES encryption cost, which was just happening in hardware and added very little. So that was a really great benefit, and we were able to pretty much squash the concern of this will be too slow for our network.
So doing this in your infrastructure, I imagine some of you may be involved with networks that are not as segmented as you like. And I think this provides a good approach to implementing segmentation on a large scale in a way that is actually shippable. There are some questions you should ask yourself
when thinking about this that might help you assess whether or not this is a good solution for you. First, how effective is it for you to be able to distribute these proxies in your service communications? So we had a lot of benefits in that we had a configuration management system that could deploy software and configuration,
and we already had these outbound proxies that were in place because of the service discovery system we used. So this is something that came pretty naturally for us, but it's something to think about in your own environment. How would you assign identities? This is really important because an identity is a zone. It's a segment in our network,
and so if you have a highly specific way to refer to things that you can turn into TLS certificates, this may work really well for you. If you don't have this, you may need to do some work to get there. In our case, IAM role is what we went with, but at the beginning of the project, not every instance had an IAM role. We had to do a little legwork in the beginning to get that to be an enforcement
for our entire infrastructure. Will you need to manually configure these access control lists, or will you be able to automatically generate them? If you can automatically generate, that's where you're going to get these huge efficiency wins, so that's something that you really want to push for if you can.
The other good news is that there are some available options on the market right now that can help do this for you. We hand-implemented the whole thing, but Istio and Console, which are both solutions being pushed as the new way to do service mesh,
especially in Kubernetes, implement this sort of security system already. So to be clear, this is not something that we totally invented. This is an idea that's been going around for a while, and Istio and Console implement it for you in an easily packageable way. They do less on the automatic generation side,
but you could easily kind of build this sort of system using these tools. But if you don't want to make such a huge leap and switch to a whole new service mesh system, you can certainly implement the security benefits here with your existing service discovery stack, as we did. So to kind of sum up here,
I'm here to tell you that you can switch to a deeply authenticated network, and the reason you can do that is because you can make the changes here invisible, and you can make the system fast. Because of these generated authorization maps and the automatic TLS, an engineer who is working on a microservice
before the system and after the system has basically the exact same experience. They still use the same HTTP calls they always did. They still add a new dependency, get that changed, approved, and merged to master, and then their service talks to it no problem. Their flow remains the same as it always has been.
But now, when an attacker compromises that Slack meme bot with some sort of, you know, meme injection or whatever it is, they find themselves in a network zone where they can talk to basically nothing. And all of the services that were wide open to them beforehand simply reject their connections out of hand when they ever try and go past a layer 4 connection.
So this is something that I believe is possible. We've done it now. And I think it's a great strategy as you try and build into the security what you weren't able to do when your network first started. So, thank you very much for listening. If you want to stay connected or ask me more questions about the details of this,
something that's not as easy to do in the Q&A section, you can hit me up at maxb on Twitter, max.berkhart, Airbnb.com, or if you just want to see what we're up to at Airbnb Engineering, Airbnb.io. Thank you very much.
Thank you, Max. Thank you, Max. If you do have a question, please line up on the microphones, try to limit your question to a single sentence. If you'd like to leave at this point, please do that as quietly as possible.
Signal Angel, your first question from the internet, please. Hello. I'm an SSL and not a liberal SSL. So, I guess I said OpenSSL,
just sort of a random example. You can use whatever SSL stack works best for you. I believe that the way our packages were being built used OpenSSL, but switching to something like Boring or LibreSSL would probably be a good idea for further hardening.
Thank you. Microphone number two, your question. Hi, great talk. What are you currently doing to mitigate the increased risk of local host-bound SSRF? So, yeah. SSRF is something that is deeply troubling to me, as somebody working on AppSec
and a company that works almost exclusively with HTTP API calls. Our approach, honestly, is a very dedicated static analysis. We are watching engineer-written code very vigilantly for anything that might make HTTP calls outbound and trying to ensure that it doesn't hit internal stuff.
That's an area that my team is trying to do a lot of work to improve, and perhaps that could be a future talk. Cool. Thanks. Microphone number one, please. Very interesting idea. Are you going to include workstations too? Our workstations? It's an interesting thought.
At the moment, we don't have our workstations plugged into the same service discovery, and so they don't have the sort of core proxies that could work for this. But I think that if that's an architecture you wanted to go to, this would actually lend itself pretty well, because if you're managing your workstations, you could hand them identities just as well.
You need to think of probably a slightly different approach to identities because you can't give a physical machine an AWS IM role, at least for us. But if that's something that your network has, then I think it's a very reasonable way to go. Thank you. Signal Angel, your next question.
Did they audit the proxy code before placing it in front of whole services? Yes, we took a close look at it. Microphone number two. What are the cost implications for implementation and on your operations in general?
Costs are pretty low because the reverse proxy is pretty efficient. It doesn't use a ton of extra compute, so we didn't have to scale up anything in order to support this verification. Again, being able to just do AES is pretty cheap. The generation of the map is very cheap.
It's running on a single Kubernetes pod, just running a Ruby script, so that's fine. Probably the greatest cost is simply an S3 transfer of that authorization map, and that's something that we think we're going to be able to continue to reduce by sort of being a little smarter about how often we check.
In certain areas, the network that evolved very infrequently and don't have a lot of topology changes, we'd be able to sync that a lot less frequently. So that's something I think we can improve in, but overall, the cost is pretty low. Signe, do you have more questions from the net? Okay, then mic for number two, please. In terms of certification authority,
how are you managing the lifetime of the certificate and which kind of consideration did you do on that side, so like certificate expiration, renewal, also CSP, if it's already implemented or not? Yeah, so we want to get to a point where certificates expire a lot faster.
There are some companies that have done a really great job with certs that only last about a month or even two weeks, something like that, and unfortunately, I think our infrastructure isn't at a point where we can reliably reduce it that much. Our current approach is that we can currently ban things by basically introducing them into a deny list
in the allow list generation stage, and that will result in something being banned within about four minutes. So that's how we can deal with active compromises, but there's just sort of a longer running effort to be able to increase our infrastructure refresh rate so that we can have really short-lived certificates
to deal with those sorts of stolen cert attacks. Thanks. Your question? Since you do all of this on a flat layer 3 network and you already mentioned payment information, what does this mean for your PCI DSS scope and how does it affect certification if you handle payment data
and the systems are connected to other systems in your network and not separate by like firewalls or something? Our PCI network is a little interesting. It actually is a totally separate thing from the most of the Airbnb production network, so that specific certification didn't affect us,
but I think we've also had a pretty effective time of convincing auditors that this is an effective way to do access control, even though it's traditionally happening at a layer that is not as standard. So for PCI DSS specifically,
our cardholder environment actually is just a web page that syncs to Braintree, so we don't have to deal with that one specifically, but it's something that has been received pretty favorably by our compliance folks. Thank you. Signal Angel. Could you elaborate on how you got the management and application engineers
by in for the changes described in your talk? What objects did they erase and how did you address them? Thank you for asking. This is something that I like talking about. I think that a lot of security is actually being a good salesman for your solutions.
Whenever you are presenting something like this that has such a wide scope, it's crucial to make sure that there's something in it for the stakeholders beyond just security's goals. And so a lot of those things for us were around developer ease and productivity, reducing the pain that engineers were feeling
in trying to set up their own TLS implementations or their own authentication stacks, better performance benefits like I discussed. These were all things that other infrastructure and product teams heard about and wanted, and so they were very open to our original request.
And then from there on, it was all about being a good steward of an operation, having really good operational plans, showing that we'd done our homework in terms of testing, and really thinking like an infrastructure engineer or an SRE, instead of just a security engineer. Security is our ultimate goal,
but we need to make sure that we are not burning our credibility with the rest of the organization when going for that. So there was a lot of time spent thinking like, forgetting about all the security benefits right now, how am I going to make sure this isn't going to take everything down?
Thank you. Microphone 1, your question. Do all nodes have the whole web files, and what technology stack do you use to apply them to NY? So, yeah, everything gets a web file. Technology stack is JSON, so basically there's a very small shim that downloads this file from S3
and then puts the relevant list of allowed identities into an Envoy configuration file, and then Envoy is using its automatic updating SDS configuration to load that every few seconds.
So that's how that synchronization works. Okay, thanks. Microphone 2, last question. Have you considered using a PubSub push of just the relevant metadata based on the X.509 identity of the clients that you're not also giving them all the information about the entire map for the entire network?
Yeah, so you can rather easily segment what information you're providing, and it's really just a matter of sort of engineering time. At the moment, we have pretty wide availability of that through other service discovery mechanisms, so it wasn't a priority for us, but it would be relative.
easy to have customized web file availability. In particular, since everything is an IAM role and everything has its own IAM role, you can simply make IAM role-specific web files in S3 and set up the permissions to allow just those to access it. So that actually wouldn't be that hard to implement. Thank you.
Thank you for answering all the questions. Thank you. Please give us some applause for his patience.