Sneaking In Network Security

Video thumbnail (Frame 0) Video thumbnail (Frame 1361) Video thumbnail (Frame 2875) Video thumbnail (Frame 4908) Video thumbnail (Frame 6844) Video thumbnail (Frame 8126) Video thumbnail (Frame 12116) Video thumbnail (Frame 14230) Video thumbnail (Frame 15583) Video thumbnail (Frame 20206) Video thumbnail (Frame 22231) Video thumbnail (Frame 23793) Video thumbnail (Frame 25902) Video thumbnail (Frame 27694) Video thumbnail (Frame 30117) Video thumbnail (Frame 35155) Video thumbnail (Frame 37777) Video thumbnail (Frame 39101) Video thumbnail (Frame 42065) Video thumbnail (Frame 44475) Video thumbnail (Frame 49139) Video thumbnail (Frame 59318) Video thumbnail (Frame 69574) Video thumbnail (Frame 75426) Video thumbnail (Frame 77515) Video thumbnail (Frame 78712) Video thumbnail (Frame 79920) Video thumbnail (Frame 81817) Video thumbnail (Frame 83674) Video thumbnail (Frame 85366) Video thumbnail (Frame 87985) Video thumbnail (Frame 89529) Video thumbnail (Frame 90694)
Video in TIB AV-Portal: Sneaking In Network Security

Formal Metadata

Title
Sneaking In Network Security
Subtitle
Enforcing strong network segmentation, without anyone noticing
Title of Series
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2018
Language
English

Content Metadata

Subject Area
Abstract
Highly compartmentalized network segmentation is a long-held goal of most blue teams, but it's notoriously hard to deploy once a system has already been built. We leveraged an existing service discovery framework to deploy a large-scale TLS-based segmentation model that enforces access control while automatically learning authorization rules and staying out of the way of developers. We also did it without scheduling downtime or putting a halt to development. This talk covers how we engineered this, and shares lessons learned throughout the process.
Keywords Security

Related Material

Video is cited by the following resource
Software Computer network Computer network Maxima and minima Information Musical ensemble Information security
Goodness of fit Scaling (geometry) Software Integrated development environment Multiplication sign Projective plane Process modeling Quicksort Information security Information security Theory
NP-hard Complex (psychology) Game controller Server (computing) Service (economics) Code Patch (Unix) 1 (number) Product (business) Mathematics Internetworking Gastropod shell Cuboid Jordan-Normalform Vertex (graph theory) Information security Perimeter Computer architecture Time zone Service (economics) Dependent and independent variables Scaling (geometry) Software developer Projective plane Incidence algebra Instance (computer science) Pivot element Mathematics Data management Software Gastropod shell Quicksort
Software developer Process modeling Theory Computer network Cartesian coordinate system Theory Process modeling Category of being Software Software Right angle Process (computing) Information security Information security
Slide rule Wechselseitige Information Group action Building Service (economics) Multiplication sign Decision theory Authentication Electronic mailing list Mereology Field (computer science) Product (business) Goodness of fit Sic Term (mathematics) Operator (mathematics) Software testing Communications protocol Information security Physical system Authentication Default (computer science) Scaling (geometry) Software engineering Hoax Software developer Projective plane Theory Plastikkarte Computer network Cartesian coordinate system Data management Software Telecommunication Data center Self-organization Configuration space Right angle Quicksort Information security Communications protocol
Authentication Intelligent Network Rule of inference Polygon mesh Service (economics) Mobile app Service (economics) Mapping Process modeling Computer network Bound state IP address Connected space Category of being Software Authorization Configuration space Vertex (graph theory) Authorization Implementation Communications protocol Information security Identity management Identity management
Web page Meta element Domain name Asynchronous Transfer Mode Implementation Server (computing) Service (economics) View (database) Multiplication sign Direction (geometry) Client (computing) Web browser Public key certificate Product (business) Web 2.0 Direct numerical simulation Centralizer and normalizer Term (mathematics) Core dump Formal verification Authorization Diagram Software framework Extension (kinesiology) Information security Physical system Identity management Authentication Service (economics) Mapping Key (cryptography) Projective plane Client (computing) Ripping Wind tunnel Exterior algebra Software Integrated development environment Logic Internet service provider Point cloud Quicksort Service-oriented architecture
Slide rule Service (economics) Service (economics) Open source Projective plane Electronic mailing list Process modeling Stack (abstract data type) Instance (computer science) Line (geometry) Plastikkarte System call Internetworking Telecommunication Order (biology) Proxy server Information security Address space Physical system Reverse engineering
Slide rule Game controller Presentation of a group Metric system Service (economics) Code Direction (geometry) Multiplication sign Formal language Different (Kate Ryan album) Googol Formal verification Authorization Software framework Software testing Arrow of time Proxy server Information security Point cloud Authentication Service (economics) Consistency Structural load Non-standard analysis Cloud computing Telecommunication Quicksort Metric system Communications protocol Electric current
Authentication Service (economics) Keyboard shortcut Service (economics) Proxy server Code Client (computing) Exclusive or Mechanism design Formal verification Software framework Right angle Proxy server Information security Identity management
Web page Time zone Dynamical system Link (knot theory) Torus Service (economics) Decision theory Robot Public key certificate Front and back ends Software Telecommunication Configuration space Energy level Quicksort
Functional (mathematics) Service (economics) State of matter System administrator Set (mathematics) Public key certificate Attribute grammar Front and back ends Fluid statics Mathematics Spreadsheet Telecommunication String (computer science) Energy level Selectivity (electronic) Vertex (graph theory) Proxy server Identity management Physical system Time zone Service (economics) Distribution (mathematics) Public key certificate Mapping Building Multitier architecture Electronic mailing list Bit Instance (computer science) Term (mathematics) Mathematics Data management Exterior algebra Software Personal digital assistant Telecommunication Configuration space Point cloud Authorization Quicksort Figurate number Clef Identity management
Point (geometry) Presentation of a group Mapping Regulärer Ausdruck <Textverarbeitung> Code Firewall (computing) Decision theory Neuroinformatik Mathematics Logic Diagram Information security Physical system Building Electronic mailing list Code Maxima and minima Data management Software Personal digital assistant Telecommunication Inference Configuration space Self-organization Quicksort
Web crawler Service (economics) Vapor barrier Virtual machine Product (business) Neuroinformatik Web 2.0 Software repository Physical system Identity management World Wide Web Consortium Information Mapping Database Bit Connected space Peer-to-peer Cache (computing) Data management Software Repository (publishing) Software repository Inference Configuration space Quicksort Reverse engineering
Point (geometry) Addition Game controller Mapping Decision theory Boilerplate (text) Client (computing) Connected space Peer-to-peer Mathematics Uniform resource locator Personal digital assistant Single-precision floating-point format Order (biology) Authorization Authorization Quicksort Proxy server Information security Physical system Identity management Condition number
Email State of matter Decision theory 1 (number) Client (computing) Public key certificate Subset Web 2.0 Mathematics Component-based software engineering Computer network Core dump Logic Cuboid Information Vertex (graph theory) Information security Identity management Physical system Service (economics) Email Public key certificate Mapping Kolmogorov complexity Software developer Electronic mailing list Sound effect Bit Connected space Band matrix Category of being Telecommunication Order (biology) Configuration space Authorization Quicksort Reverse engineering Dataflow Game controller Mobile app Server (computing) Freeware Service (economics) Proxy server Token ring Firewall (computing) Authentication Electronic mailing list Streaming media 2 (number) Element (mathematics) Zugriffskontrolle Revision control Centralizer and normalizer Term (mathematics) Band matrix Software Authorization Energy level Reverse engineering Proxy server Authentication Dependent and independent variables Information Key (cryptography) Debugger Cryptography System call Cache (computing) Software Password Communications protocol Identity management
Axiom of choice Scripting language Installation art Confidence interval Multiplication sign 1 (number) Function (mathematics) Perspective (visual) Public key certificate Storage area network Web 2.0 Mathematics Component-based software engineering Local ring Information security Physical system Scripting language Service (economics) Public key certificate Mapping Structural load Rollback (data management) Stress (mechanics) Electronic mailing list Sound effect Process (computing) Software repository Configuration space MiniDisc Authorization Quicksort Metric system Physical system Point (geometry) Topology Existence Service (economics) Proxy server Computer file Open source Variety (linguistics) Ultraviolet photoelectron spectroscopy Rule of inference Product (business) Latent heat Goodness of fit Cache (computing) Operator (mathematics) Software repository Authorization Representation (politics) Software testing Configuration space Proxy server Condition number Authentication Polygon mesh Projective plane Planning Cartesian coordinate system System call Component-based software engineering Cache (computing) Wind tunnel Software Function (mathematics) Network topology Formal verification Identity management Local ring
Axiom of choice Sensitivity analysis State observer State of matter Multiplication sign Numbering scheme Public key certificate Mathematics Bit rate Semiconductor memory Personal digital assistant Logic Encryption Cuboid Process (computing) Local ring Information security Physical system Identity management Service (economics) Email Mapping Bit Connected space Category of being Arithmetic mean Process (computing) Order (biology) Configuration space Software testing Moving average Authorization Quicksort Metric system Physical system Reverse engineering Mobile app Service (economics) Proxy server Sweep line algorithm Patch (Unix) Mass Number Sound effect Revision control Goodness of fit Cache (computing) Computer hardware Authorization Integrated development environment Software testing Configuration space Proxy server Socket-Schnittstelle Wechselseitige Information Projective plane Client (computing) Cartesian coordinate system Configuration management Cache (computing) Software Logic Personal digital assistant Network topology Password
Dataflow Game controller Service (economics) Distribution (mathematics) Robot Electronic mailing list Public key certificate Zugriffskontrolle Mathematics Strategy game Telecommunication Computer configuration Authorization Vertex (graph theory) Configuration space Information security Proxy server Physical system Identity management Injektivität Time zone Scaling (geometry) Polygon mesh Electric generator Mapping Projective plane Electronic mailing list Computer network Mereology Instance (computer science) Configuration management System call Connected space Computer configuration Software Integrated development environment Personal digital assistant Telecommunication Software framework Configuration space Summierbarkeit Quicksort Video game console Information security Identity management
Connected space Goodness of fit Cellular automaton Videoconferencing Sheaf (mathematics) Musical ensemble Line (geometry) Quicksort Stack (abstract data type) Twitter Twitter
Area Mobile app Code Mathematical analysis Local ring Number
Service (economics) Software Workstation <Musikinstrument> Moment (mathematics) Core dump Virtual machine Quicksort Proxy server Number Identity management
Area Scripting language Implementation Electric generator Mapping Closed set Heat transfer Number Mathematics Software Synchronization Operator (mathematics) Network topology Single-precision floating-point format Order (biology) Formal verification Authorization Quicksort Proxy server Reverse engineering
Point (geometry) Process (computing) Arm Electric generator Bit rate Authorization Electronic mailing list Quicksort Public key certificate Renewal theory Number
Web page Game controller Information Multiplication sign Firewall (computing) View (database) Process capability index Flow separation Public key certificate Product (business) Latent heat Arithmetic mean Software Integrated development environment Video game Physical system
Authentication Implementation Group action Software developer Multiplication sign Planning Stack (abstract data type) Usability Product (business) Data management Mathematics Coefficient of determination Term (mathematics) Operator (mathematics) Self-organization Software testing Object (grammar) Information security Address space
Computer file Information Mapping Electronic mailing list Client (computing) Metadata Entire function 2 (number) Web 2.0 Software Synchronization Configuration space Identity management
Web 2.0 Latent heat Mechanism design Service (economics) Information Computer file Multiplication sign Moment (mathematics) Moving average Quicksort Sinc function
Cartesian closed category Musical ensemble Semiconductor memory
[Music] okay our next talk is sneaking in network security our speaker max has is going to tell us how to scale up the defense for computer networks and in particular how to integrate that in existing networks okay max here is a former champ pancetta and now a blue team member please welcome them with the utroms applause [Music]
thank you hi everyone my name is Max
Burkhart I'm here to tell you today about sneaking in network security how I and a small team of other security engineers managed to implement a strong network segmentation model and an already running high scale a large network I'm a security engineer at Airbnb and so the sort of practical experience of this project occurred in that network however I think that the techniques will go over here today will apply to many other networks and so I'll spend some time talking about the technical theory behind this approach as well as what happened when we rolled it out in an attempt to give you some good evidence and experience to run this in your own environment so let's talk about network security in 2018 segmentation continues to be a really good idea because we all know that compromises are
going to happen there's box is going to be popped whether it's a zero day or something less fancy like you know somebody forgot to patch a server and network segmentation gives you the controls to be able to keep those compromises contained to make sure that low security systems can't pivot into higher security zones and help your instant response teams keep incidents localized however if you've ever been involved in network pentesting you'll know that a well segmented network is a rare thing to see and I think we know why this happens as networks grow quickly small security teams especially ones that something like a start up like everybody was find themselves having to prioritize their work where it is the most impactful and that usually ends up being the perimeter the internet facing hosts and so as a network grows quickly you end up with a large network that has this sort of hard shell soft center architecture where the external perimeter may be hardened but once an attacker is able to compromise that they may have relatively free rein inside the rest of it and this obviously isn't something that we and ask any blue team member and they'll know that this is a bad place to be but change is hard and especially that
with a pretty large network so to give you an idea of the scale that this project dealt with earlier this year we're implementing this every bees production network had about 2,500 services about 20,000 nodes and I define a node to be something that's sort of like a host whether it's a instance running in ec2 or kubernetes pod and over a thousand engineers were doing hundreds of production deploys per day so things are moving really fast and it's hard to go in and you know build in these large architectural changes like adding segmentation furthermore because of this sort of highly service' fied architecture there was a lot of complex interconnectivity between these things so determining where the zones should be I was difficult in itself finally developer productivity is a really big concern for us and especially to my managers and their managers you know if you have over a thousand engineers writing code every day if you slow them all down by 5% or 10% that's actually a really expensive thing to do and it's not something that's gonna fly so the question became how do we go from a soft center network to something that has
good segmentation and has the security properties you want and we're not allowed to stop development we can't start over we've got to be able to build a security in as the network is running so we hear a lot especially in the pentesting offensive community about like trying to be like a ninja right get into the network do stuff without anyone noticing I'll argue that it's also ism just as important in defensive security we need to be defensive security ninjas and be able to sneak in put in the defenses and have nobody know we were there so what's the theory that we're
going to be applying in in this approach we need to stop thinking about security as this like layer around development as another step in the waterfall model you know this is maybe what we were thinking about third years ago that you'd build an application and then you'd do security
testing and then you chip it to production but it just hasn't really held up anymore so there's been a lot of smart people talking about sort of the new way to do things agile security you know dev Sackhoff SEC DevOps field can't decide the this whole concept of really unifying security operations and software engineering so that you're building a secure thing all the way through and this certainly isn't something that we invented many people have been working on it but I've found that most of the time people think about this concept when in the terms of application development and I think it's time that we integrate this with network security as well there's I think the important thing here is scale right the we need to build a security solution that scales with development there's this saying that it's good to hire lazy engineers and developers because they're going to build things that sort of scale up and don't require a lot of manual work that's even more important for security engineers you're never going to outwork the attackers and so you need to build something that's going to scale along with your engineering group so we're good project managers so we're going to lay out the requirements for this solution Before we jump into how it actually works whatever we build needs to stay out of the way of engineers it may be something they're aware of but the farther we can keep it out of their scope the better so they can just keep writing applications that make the company money or accomplish your organization's goals and their stuff ends up being secure security by default is of course something that we have been chasing for a long time but I think that we can also go further than that say beyond being secure by default it should actually be hard to have an insecure configuration with this system so we'll try and design things in that manner and finally we want to build something that as much as possible is flexible to whatever sort of network or protocols you were using you don't really ever know what's going to be coming six months down the road when this was being worked on everybody's mostly Linux on Amazon shop but I don't know what's going to have the next six months we might acquire a you know hask?? on a jerk company and try and integrate that or we'll you know start you know going to on-prem datacenters I have no idea what's going to be in the future so we want to build a solution that's going to be as agnostic as possible to those sorts of decisions so my next slide is basically the whole solution tried to condense into two sentences we're going to use mutual TLS built into the service
discovery system for authentication and confidentiality across all service communications and we're going to discover those access lists totally automatically for security with zero to almost zero configuration this is a lot of jargon on a single slide so I don't expect you to like kind of visualize it yet we'll dive into each of these parts and I'll show you how they fuse together to build a system that is invisible and secure so to start off on I've sort of
isolated three pillars of this approach the first is TLS in service discovery so we love TLS it's one of the really powerful protocols that the security industry has managed to build and it gives us great security properties if we can use it everywhere so the first pillar is get everything to be using TLS and by building into service discovery make sure that it runs everywhere without a lot of per app configuration pillar two is binding identity to nodes so in a s'more traditional network segmentation model you might define subnets or restrict things by IP address we're going to be able to be a little more flexible with how we refer to individual nodes in this network because we're using TLS as an Authenticator and therefore can sort of define our own concepts of identity and I'll get into that
finally we're going to generate an authorization map so by automatically determining what services need to talk to what and figuring out how data flows through this network we can attempt to update apples automatically to stay out of engineer's way while still ensuring that the connections between services are trusted and can be verified so this
is a diagram that we'll be diving into individual pieces of but basically this is a very simplified view of a network we've got three nodes those nodes each have a certificate sort of defining who they are and they can use those certificates to communicate with each other through TLS tunnels they have authorization logic that runs on them that is fed by this sort of centralized map of what nodes the network should talk to each other let's jump into the first pillar here
which is the implementation of TLS so
specifically here we're looking at these tunnels before I start though it's important to cover some basic concepts
here just to get all in the same page we're using mutual TLS here which is you know you've certainly heard of traditional TLS that's what your web browser uses all the time where you have a client that is verifying the identity of a server normally it will get the cert make sure that the subject alternative name or the CN matches the domain name and if so tell you it's verified but TLS is really awesome and it actually out-of-the-box supports verification in both directions so you can have the client also present a certificate in that initial handshake and the server can check who is talking to it using an equally strong Authenticator this is pretty hard to deploy on the public web because users can't really manage certificates but in your own production network this works really well because you can distribute certs to everyone so this is really great because this means we can have two-way strong authentication in with key material that security engineers understand right we know how to deal with these sort of systems and so not only can we make sure that clients of services know they're talking to legitimate service but that service can look at who's talking to them and make sure that that's a caller that um seems appropriate that's mutual to us service discovery um this was a term that I hadn't heard out about a lot before I started working in companies that used a lot of cloud environments in SOA but um at its core service discovery is this concept that you have some node in a network and it needs to find other nodes to provide services to it so if you think about it DNS is like a very old sort of basic service discovery system you want to perform a Google search so you do go to WOL comm and DNS finds you a server that can provide Google services to you so these have gotten a lot more complex and varied as people move to these environments where hosts are very flexible and stuff moves around a lot and they're pretty ubiquitous in modern service-oriented architectures and service discovery can actually be kind of problematic for security if you do it wrong because fundamentally it's trying to be a map of the network and be really helpful about like oh hey find the service here find this service here but I'll argue that we can actually use this to great effect on in achieving security so everybody uses a framework called smart stack and so that was what was there when we started this project and we built this security extension on top of the smart stack framework on so that's all sort of be talking about but I believe that these concepts can be applied to most service discovery systems as a brief aside and how smart
stack works this is an open source system that everyone be created and open sourced a few years ago but the basic idea is that it uses two other publicly available projects a zookeeper and H a proxy in order to make it easy for services to talk to each other if you look at this example above node two is hosting a service service B and so service B is going to report into a zookeeper cluster hello I'm a service B and and you can find me at node to node one wants to talk to service B and so it will load the relevant addresses for service B from zoo keeper and it will put them into its local h a proxy instance H a proxy is a reverse proxy that just um had a forwards traffic long service a if it wants to make a call to service B simply then just sends a request to localhost and leaves it to Asia proxy to find a suitable host to fulfill that request so an important thing to note here is that this system was not designed for security anything can write into zookeeper it is like the most prone to impersonation thing possible because you just ask for a list of nodes and you get them and it's not really authenticated but I'll show you how in the next few slides we can build security into this system so the old way
that we connected to services before any security upgrades is that service a wants to talk to service B it sends a request to its local outbound proxy and that sends it along so it's gonna make an HTTP request to localhost that gets sent through the reverse proxy goes across the internet to service B not a lot of security going on here what we added is a secure shim so we added a new reverse proxy that runs on the receiving node in front of service B and we reconfigured the proxies to communicate to each other with mutual TLS so now all the traffic that's going over the Internet is in a TLS tunnel but crucially service a and service B do not change at all service a is still sending HTTP traffic service B is still receiving HTTP traffic so we were able to pretty radically change the security model of this cross host communication without touching a single line of an engineer's code so this is where we're
getting our invisibility from there's some other really big benefits to this because there are these two service discovery proxies that are doing the TLS set up and they're the things that can do authentication and verification of this TLS tunnel on security was able to build these controls once and distribute them basically across the entire fleet the same proxies can run no matter what the what language the service is written in what sort of protocol that service uses and so instead of having to verify authentication authorization code in dozens of different frameworks and languages we're able to do it just about once the other thing that ended up being really helpful is that having these proxies on either side of your service communications is actually really helpful for non security reasons so things like consistent metrics better tracing better ability to do load testing we got all of those for free by adding in more proxies and thus we got to really get the bet on support of other infrastructure teams at the company who maybe didn't have direct security goals but they wanted to help us do this so basically what we've done with this whole proxy thing is sort of the opposite of what the NSA wants you may remember this slide from a leaked NSA presentation where they were
discovering with Glee that inside Google's cloud network at the time there was a lot of plain text HTTP going on and SL was added and removed we are just adding SSL and keeping it there all of the arrows on the right need to be TLS in the modern age one important caveat about this particular approach is this
concept of proxy exclusivity which is that basically we are relying on this inbound proxy to provide the security benefits of TLS confidentiality and authenticity and thus it is crucial that going through the inbound proxy is the only way to talk to a given service if that service is reachable by going around the inbound proxy you would still be able to talk plaintext HTTP to it and possibly evade authentication mechanisms and so it's important that this is impossible I'll talk a little about how we solve this particular issue I'm it's just something that's important to be thinking about if you're going to implement this approach so that's TLS by implementing a new proxy into an existing service discovery framework we can switch all the traffic be going over TLS without radically changing on the code of services running next up
though is that really what we wanted out of all this is segmentation right we want to make sure that only legitimate things can connect to a given service and so we need to build a sense of identity that can be used to do this verification so in this next pilling and
we're talking about how we put these certificates there and more importantly how we decide what that certificate is going to say so segmentation you know
we're trying to make sure that a node in the network can only talk to the things that it should be allowed to talk to you know if a node needs to talk to the payments back-end service it's going to do that for business reasons but we can maybe make sure that only nodes that have to talk to a given service can but a lot of previous thought about segmentation tends to happen on this subnet level um you make a zone of hosts and things in that zone can talk to each other and then maybe they can get out to other zones via certain predefined channels but in a micro service network or something that has a lot of like dynamic communication going on it may make more sense to think about this on a service level as opposed to a host level so um we'll say things like we want the payment config page service to be able to talk to the payment back-end service that seems like a reasonable thing to do but in our network we've also got a slack bot running that makes memes for engineers and that thing should definitely not be able to talk to the payment back-end service so we can start representing these sorts of decisions as
instead of you know these sort of static tiers of hosts we have a bunch of these services and each service sort of keeps a list of identities that's going to allow to connect to it and we just did
all this work to build up these proxies on either side of a service communication that understand TLS and are using TLS and TLS is fantastic at verifying identities so we can now start to build the segmentation by saying for a given service listener here are the following identities which are allowed to connect to it and thus you can end up in a state where only the right things can talk to a given service based on business need we do have to identify all the nodes our network though and this is something that's going to vary a bit depending on how your network is set up so you need to sort of find a concept of an identity that fills a few key key attributes so this identity that you decide for a node needs to be pretty varied you know if you have one identity for everything you're back at soft center network again because you won't be able to do any distinguishing you need an identity that a node can't change about itself otherwise an attacker would be able to compromise a particular host change its identity and then move into zones the network it shouldn't be allowed to it should really be able to be something that you can detect automatically so that you can sort of automate the distribution of these certificates if you end up having to go through to an Excel spreadsheet and figure out what each host is and then like mend those sorts yourself it's not really gonna work and finally we do need to represent this concept of an identity in a TLS certificate so in our case we wanted something that could fit into a subject alternative name so most modern networks have some concept of a role that works pretty well for this when you have a config management system or a cloud permission system you almost always are giving things identities based on their function and this tends to work pretty well for this so in our network we used Amazon iam roles which is a sort of designation given to an instance that it gives it some level of permissions in AWS and this worked really well because most different services had them they can't be changed unless you have very high level administrative permissions native us and it is can represented as a string so it fits well on the certificate so to kind of look at what we're going to do here now we need to give everything identity and we need to make certificates that allow nodes to prove their identity in these TLS communications we can then build this map of what identities should be allowed to access what services this is what is going to give us our segmentation because we're going to be able to distribute that map saying for the payments back-end service you allow the following identities and no others and thus you can get to this place where only a very select set of nodes in your network can access the sensitive stuff but how do we make that map that's
pillar 3 which is the final segment of
this diagram so how do we figure out what needs to talk to what and distribute that so a big question here is really all about trust how do you figure out what needs to talk
to what and do it with a minimum of human involved computation a lot of what I was talking about earlier in the very beginning of this presentation was about the sort of human cost of segmentation if you have people who are spending all day trying to make firewall configurations that's going to be rather expensive difficult to keep safe etc we want to try and get away from the configurable list style of security engineering where you hire a ton of security engineers to try and figure out what is supposed to talk to one so we want to we just infer this from existing code can we look at how the network currently works at how our configurations are defined and use that to build the sense of how communication should happen so this is getting to an interesting point because the decisions you make here really depend on how you think about threats at your organization so we decided that if you are somebody who can merge peer reviewed CI past code into our config management system that means you're reasonably authorized to make changes and this is something that may vary based on your organization's setup and dig more into those questions in a bit but for our case we realized we have
this chef repo the chef repo is a you know chef is a config management system that can distribute information to all of the nodes running in our network and it already in a nice machine possible way was saying what the dependencies of every service were so in this hypothetical example we have a service one service one has dependencies on the production database a cache and a monitoring service and this is already setup in a repository that is rather heavily controlled you have to be you know an engineer that gets peer review etc to commit to this so what we can do is we can take this determine that service one is an authorized caller of these services and then sort of build that into this map say for production DB service one is authorized to do this we built this service called Arachne
Arachne named after Greek spider goddess is kind of computing the web of services and nodes at the Airbnb network and so basically it's continuously pulling our chef repository and deployed kubernetes artifacts to figure out what connections have been defined by trusted people and building a sort of reverse map of for given service what identity should be allowed to connect it can then push these into s3 I'll talk about why we did that in a little bit and then those can be sent to all of the nodes that are actually doing this computation this allowance so the barriers that you're
going to be putting into place about how this mappers generated really depends on how you think about insider threats at your company so in our case we've made the conscious decision to trust our engineers and rely on things like CI checks and peer review in order to make sure that legitimate things are committed but depending on how you approach this you may want to have more controls in place and this system is rather flexible to do that all you need is something that can automatically configure discover as much as and then under some conditions publish a new authorization map to some location so you could certainly imagine if you wanted more controls than this making it so that when a new connection is discovered it prompts the security team for a quick manual review and an acknowledgement before that actually gets distributed so this does give security a single point of control where they can do any sort of monitoring or additional approvals if they wish while still taking away a lot of that boilerplate work of trying to figure out what actually connects to what we can actually go further with authorization instead of just telling all of these sort of discovery proxies to allow these identities and ban these others because
we're just using vanilla TLS we can rely on the heavy support for these sorts of
protocols and many things so the reverse proxy that we use is the inbound proxy has the feature to inject information about the client certificate used into HTP streams that went through it most of our API is HTTP based ones so this applies to most things and it means that whenever a service gets a call over TLS it can just parse this very simple header and know exactly what sort of identity is calling it making it trivial to implement various like permission levels depending on your service caller so this is something that sort of authorization control would have been really tough to implement before the system because you'd have to set up maybe your own TLS system or maybe a system of tokens or keys or passwords but this lets us I'd you know leave all of the tricky crypto stuff to the security on components and let app developers just parse a very simple header and make decisions based on that so those are the three pillars of this solution we set up TLS in between everything to give us the security properties in communication that we need we give everything an identity in order to make sure that they can authenticate to each other and enforce segmentation by having specific allow lists for every service and then we automatically discover this map by parsing configurations already there but I'm not here just to sell you on this solution because I like it on there
are some downsides and to be perfectly honest I want you to know about them before you consider I'm blending some something like this and so these are just some of the things that we thought about and decided to accept first you are going to need to constantly synchronize out this map of allow lists so or some subset of those allow lists instead of having centralized allowance of various Network communications like you might have if you have a central firewall I'm you're sort of doing in a distributed way every node is determining whether or not a connection is allowed and so that means that you have a reasonably strong need for a lot of bandwidth to synchronize this out you can use caching which will make some things a lot easier and I'll talk about why we did that but that is going to cost you some in terms of update latency if the web the changes and you need to allow a new identity for service that may be slower if you're using a cache second if TLS has a problem heartbleed you have way more problems than you used to because you're now relying as one of the core security elements of your system so this is something that we know but the reasoning here is basically if heartbleed happens again if we find some sort of major core issue in TLS already security is going to be you know working nights until we can get that patched on our front end web servers and so if we're gonna be massively deploying new opensl versions as quickly as we possibly can that's going to end up patching up all these as well so basically we are relying on the fact that major SSL issues are going to get a quick community response and be something that we can move quickly on adding more reverse proxies in your traffic flow is turns out to be kind of complicated on this introduced a lot of interesting behavior in some services and I'll talk a little bit more about the specific things we ran into but it's just something to know that the actual edition of TLS two things broke little but the additional hop in the network had surprising effects you do need to be able to run software wherever you're receiving traffic through the system because you need to install that secure listener and something that can download the allow lists if you manage all of your own infrastructure this is relatively easy but if you have things like vendor devices or hosted services where you cannot install arbitrary software that gets a little harder when we have some services that are in this state we basically put proxy boxes in front of them and I use those two handles the authentication finally you are going to want some sort of certificate revocation because if a node does get compromised you'll need to you know kick out its permissions and this I say is usually tricky there are certainly ways to do it but this is something to be thinking about and scoping as you're considering doing a deployment like this so rolling it out
my hope is that you know I've described the solution but it's not just theoretical this is something we did and so I hope that I can share as much as I can about what we learned throughout this whole process so to start with sort of the technical details we built this
mostly out of components that are available and open-source so for the inbound proxy we used envoy which is a project open sourced out of lyft that is really growing in popularity in the service mesh world and for good reason it's really designed for this sort of thing it's modern it's fast it has great support for TLS has a ton of metrics which are really useful and generally served us very well the one thing we ran into with envoy that is quite the stickler about the HTTP 1.1 standard and that led to some funny behavior and certain other applications that were not so strict about it but overall envoy was a great choice and we're actually migrating to use that on our outbound side as well as I sort of alluded to earlier we gave every note and an entity based on its AWS I am rule and this was just over a natural choice for us because this is already how we were thinking about permissions for nodes and now nodes got their permissions by their I am role and that also I kind of controlled what services they were allowed to talk to the Arachne service I mentioned is basically just a continually running Ruby script that loads a chef repo and some kubernetes artifacts and parses them these authorization Maps the quote-unquote web files are uploaded to and download from s3 so we're using s3 as the source where the salt gets actually pulled from all in all on it was it's about four minutes to fully compute the web of services and generate one of these web files meaning that's about a four minute delay in between a change in topology that is a service adding or removing a dependency and when that gets reflected in allow lists in our experience this is far shorter than the time it takes to actually deploy such a change to production so we haven't really run into race conditions where a new dependency gets added before it's allowed we had some pretty specific availability considerations mainly on caching the output of these of Arachne these web files so we wanted to make sure that if Arachne went down and we stopped being able to generate these authorization maps that all the traffic kept working we didn't want to be owners of a service that if it went down would would ban all traffic and really if you think about it by decentralizing all of the authentication of service calls you want to be able to rely on decentralization benefits so by putting everything in s3 and letting nodes download it from there we can make sure that if Arachne has some sort of critical problem if it stops running the worst thing that happens is that new topology stops being reflected so this means that traffic keeps flowing even if s3 goes down as it famously did last year I think that was a fun day things basically still work nodes won't be able to download new topology changes but they'll still have local cache ones on disk and all the traffic will keep flying as normal so this was a choice you made early on and has served us very well because I when there are new and interesting things that happen with Arachne no one really notices generally security is able to fix it before someone changes the topology so the plan for a rollout was basically these six steps we started by computing this authorization map since this all kind of works on offline data we were able to spend some time writing the software to do this and getting it to work nicely before we had to actually touch any production services so we could build that map and verify its correctness next we wanted to give everything on identifying certificate so the idea of doing this first is that this is a pretty small change and something that we could pretty safely roll out we're simply dropping a certificate on a bunch of nodes and it's also relatively easy to verify that this worked before moving on to the next step we can check for the existence of these files in a large-scale way and make sure they look good third we installed this receiving proxy everywhere and started listening and setting up this traffic routing at this point no traffic is actually flowing through these TLS tunnels we're simply giving it the path to this also lets us configure or verify the step before moving on next we can start actually doing the testing and building the confidence in this system so we can start routing some traffic through these new secure listeners and we've configured our sort of configuration in a way that we could turn it on or turn it off per service so we picked a bunch of services that seemed representative hike ups ones loke ups ones ones that use HTTP ones that were playing TCP just a great variety of things that that seemed like they would sort of stress test the system and we turn these on one by one and built confidence that this is going to work Step five is sort of the radical one and that is the switching everything over at once this is not always how you want to run operations but we chose this for a very good reason which is that there were two people working on this project and there were you know 1098 other engineers building services as fast as they could and we were reasonably confident that if we tried to go one by one we would we would never catch up we had to be able to build a system that we could switch on all at once and confidently move into a post plaintext future our final step is rebinding services to localhost so that these security guarantees were enforced so we did this last on and this was sort of painful from a security perspective because you really have to wait till step six is complete before you really get the security benefits here but we had to give ourselves the ability to rollback if things turned out to have problems we wanted to make sure that if switching to TLS first service caused some unintended effect we could rollback fix that and then roll forward again once that was dealt with so to sort of
visualize this we start with the nodes we've always had those we built the authorization map and made sure that was available first moved on to adding the certificates to everything we installed the reverse proxies with their authorization logic we turned on TLS for some things to make sure that it worked and then on plaintext deprecation day everything went to TLS so we did this in April of this year and there's a lot of
things that went well we went for about 15 percent internal TLS usage to 70 percent in one evening which was really awesome and something that I don't think would have been possible with any other scheme we made sure that there were a lot of security or non security benefits to this system and this let us get wider organizational support for such a change you know these sorts of massive sweeping infrastructure changes because they affect everything can make other engineers nervous especially if people who were primarily concerned about uptime and so we wanted to make sure there was plenty of stuff in there for them to some of the chief benefits we provided included much easier configuration because we were automatically assigning identities to everything and pre configuring certificates and she was no longer had to think about setting up a custom MPLS connection if they needed security benefits performance I'll talk about that in a sec but it the numbers are good and then there were ton more metrics available so people could have greater observability in their services and realize what was going on and that was operationally very helpful the other thing that we did that was a really good choice was making sure that we had the right configuration we could disable TLS routing for individual services on a one-off basis so that if we determined that a certain service was having a problem it we didn't have to roll the whole thing back we could keep the wins we'd gotten and roll certain services back in order to fix them before moving forward of course you know I'm here to be honest with you there are some things that were hiccups during the whole process as I mentioned earlier running everything for an inbound proxy sounds good on paper but leads to some weird stuff in practice so of the 2500 services most of them took this fine there's just a small percentage that did weird things there are some things that change if you're using a reverse proxy like all your traffic is suddenly coming from localhost even small things like changing the case of HTTP headers which is fully allowed by the spec can lead to weird out behavior in some applications reverse proxies also can mess with particularly stateful things like WebSockets we didn't think about the WebSockets case and did not have support for that on day zero that was a quick day one patch to teach our reverse proxy that WebSocket connections are special needs to be handled special so all of these things are generally surmountable but you are gonna run into some weird behavior the thing that I thought was funny about all this is that really the biggest problems we had had nothing to do with the security properties even if we'd had a plain HTTP reverse proxy we would have had the same problems our testing process because of how we turn this on was very good at testing the case where suddenly all your traffic starts coming in over this TLS channel so you enable TLS for service B and suddenly all the service B nodes get all the shaft over TLS we tested that well what we didn't have great testing coverage on was what happens if all the services that your box depends on suddenly start requiring TLS and so we ran into some interesting issues with this most particularly a tree proxy what we were using for the outbound proxy was a bit of an older version it handled TLS certificate certificates very poorly and so for certain roles that had thousands of dependencies it would load the same certificate into memory over and over again for every connection it was making and that does some pretty crazy memory issues so that was something that we could have tested a little better the final thing to mention is that binding these services to localhost again this took longer than expected we expected to be able to use easy service config templates that were built into our configuration management to say okay everything that used to be my name does 0.000 you're now binding to look closed this ended up taking a few weeks longer than we expected because there was more drift in how we did configuration than we expected this is just one of those things that I wish we could have allocated a little more time to in the beginning I mentioned I'd talk about performance because this always comes up whenever you introduce a TLS project someone is like but what if it's really slow and fortunately I can sort of confirm the security industry's assertion about it which is that things often actually got faster which was really actually I didn't expect it on whenever somebody says this you at least I had this sort of like kind of disbelief like and did it really and yeah so for a number of our services we improved 95th percentile latency by as much as 80 percent so what was happening here is that we had a bunch of these services that had sort of hand implemented mutual TLS for security reasons particularly high sensitivity things like password services or things like that did implement MTS because they wanted to be secure but they were implanting entirely at an app layer and so application to application was communicating with mutual TLS and these applications tend to restart reasonably frequently whenever their deploys new boxes spun up etc and so they were unable to take particularly good advantage of TLS session caching and session resumption meaning they had to use the TLS handshake all the time making them quite slow service discovery proxies restart very very infrequently they kind of come up when a box comes up and often can last for weeks or months and thus there's TLS session caches are very well warmed and thus we're able to keep a session resumption rate of near a hundred percent meaning that we were basically just paying the AES encryption cost which was just happening in hardware and added very little so that was a really great benefit and we were able to pretty much squash the concern of this will be too slow for our network so doing this in your infrastructure I
imagine some of you may be involved with networks that are not as segments as you like and I think this provides a good approach to implementing segmentation on a large scale in a way that is actually shippable there are some questions you should ask yourself when thinking about this that might help you assess whether or not this is a good solution for you
first how effective is it for you to be able to distribute these proxies in your service communications so we had a lot of benefits in that we had a configuration management system that could deploy software and configuration and we already had these outbound proxies that were in place because of the service discovery system we used so this is something that came pretty naturally for us but something to think about in your own environment how would you assign identities this is really important because an identity is a zone it's a segment in our network and so if you have a highly specific way to refer to things that you can turn into TL certificates this may work really well for you if you don't have this you may need to do some work to get there in our case you know I am role so we went with but at the beginning of the project not every instance had an iamb role we had to do a little legwork in the beginning to get that to be an enforcement for our entire infrastructure will you need to sort of manually configure these access control lists or will you be able to automatically generate them if you can automatically generate that's where you're going to get these huge efficiency wins so that's something that you really want to push for if you can the other good news is that there's some available options on the market right now that can help do this for you we sort of hand implemented the whole thing but sto and console which are both solutions being pushed as sort of the new way to do service mesh especially in kubernetes implement this sort of security system already so to be clear this is not something that we we totally invented there's an idea that's been going around for a while and is T own console implement it for you in a easily packaged able way they do less on the automatic generation side but you could easily kind of build this sort of system using these tools but if you don't want to make such a huge leap and switch to a whole new service mesh system you can certainly implement the security benefits here with your existing service discovery stack as we did so to kind of sum up here I'm here to tell you that you can switch to a deeply authenticated network and the reason you can do that is because you can make the changes here and visible and you can make the system fast because of these generated authorization Maps and the automatic TLS and engineer who is working on a micro service before this and after the system has basically the exact same experience they still use the same a should be calls they always did they still add a new dependency get that changed approved and merged a master and then their service talks to it no problem their flow remains the same as it always has been but now when an attacker compromises that slack meme BOTS with some sort of you know meme injection or whatever it is on they find themselves in a network zone where they can talk to basically nothing and all of the services that were wide open to them beforehand on simply rejects their connections out of hand when they ever try and go past a layer for connection so this is something that I believe is possible we've done it now and I think it's a great strategy as you try and build into the security what you weren't able to do when your network first started so thank you very much for listening if you want to stay can
actually are asking more questions about the details of this something that's not as easy to do in the Q&A section you can hit me up at max V on Twitter Max Weber Carbon be calm or if you just want to see what we're up to everything the engineering on every video thank you very much
[Music] Thank You max thank you max if you do
have a question please line up on the microphones try to limit your question to a single sentence if you'd like to leave at this point please do that as quietly as possible our signal angel the
first question from the internet please [Music] so I I guess I said open SSL just sort of a random example on you can use whatever SSL stack works best for you I believe that the way our packages were being built used opensl but switching to something like boring or Libre a cell would probably be a good idea for further hardening thank you wait for
number two your question hi great talk what are you currently doing to mitigate the increased risk of local host bound SS RF so look yeah s RF is something
that is deeply troubling to me as somebody working on app sack in a company that works almost exclusively with HTTP API calls our approach honestly is very dedicated static analysis we are watching engineer written code very vigilantly for anything that might make HTTP calls that bound and trying to ensure that it doesn't hit internal stuff that's an area that my team is trying to do a lot of work to improve and perhaps that could be a future talk cool thanks but
from number one please a very interesting idea are you going to include workstations to our workstations
it's an interesting thought at the moment we don't have our workstations plugged into the same service discovery and so they don't have the sort of core proxies that could work for this but I think that if that's an architect's you wanted to go to this would actually lend itself pretty well because you if you're managing your workstations you get them identities just as well you need to think of probably a slightly different approach to identities because you can't give a physical machine an AWS I am role at least for us but if that's something that your network has then I think it's a very reasonable way to go you signal
your next question yes we took a close
look at it all right like for number two
what are the cost implications for implementation and and on your
operations in general costs are pretty low because the reverse proxy is pretty efficient it doesn't use a ton of extra compute so we didn't have to scale up anything in order to support this verification again being able to just do a EES is pretty cheap the generation of the map is very cheap it's running on a single kubernetes pod just running a Ruby script said that's fine probably the greatest cost is simply an s3 transfer of that authorization map and that's something that we think we're going to be able to continue to reduce by sort of being a little smarter about how often we check in certain areas the network that evolved very infrequently and don't have a lot of topology changes we'd be able to sync that a lot less frequently so that's something I think we can improve in but overall the cost is pretty low but signal introduced more
Christmas from the net okay then microphone number two please in terms of
certification authority other arm how are you managing the lifetime of the certificate and which kind of consideration did you do on that side so like certificate expiration renewal also CSP if it's already implemented or not yeah so we want to get to a point where our certificates expire a lot faster on there are some companies they've done a really great job with certs that only on you know refresh that only lasts about a month or even two weeks something like that and unfortunately I think our infrastructure isn't at a point where we can reliably reduce it that much our current approach is that we can currently ban things by basically introducing them into a denialist in the allowed list generation stage and then that will result in something being banned within about four minutes so that's how we can deal with active compromises but there's just sort of an over a longer running effort to be able to increase our infrastructure refresh rate so that we can have really short-lived certificates to deal with those sorts of fun like stolen cert attacks Thanks your question since you
do all of this on a flat layer 3 network and you already mentioned payment
information what does this mean for your PCI DSS cope and how does it affect certification if you handle payment data and the systems are connected to other systems in your network and not separate life by like firewalls or something it's so our PCI network is a little interesting it actually is a totally separate thing from the most of the urban view production network and so that specific certification didn't affect us but I think we've also had a pretty effective time of convincing auditors that this is an effective way to do access control even though it's traditionally happening at a layer that is on not as standard so for PCI DSS specifically on our cardholder environment actually is just a webpage that syncs to Braintree so we don't have to do it that one specifically but I it's something that has been received pretty favorably by our compliance folks thank you signal angel
how you got the management and the Pikachu engineers is buy-in for the changes described in your dog what the objection each day race and how did you
address them um that thank you for action is something that I like talking about I think that a lot of security is actually being a good salesman for your solutions whenever you are presenting something like this it has such a wide scope it's crucial to make sure that there's something in it for the stakeholders beyond just Security's goals and so a lot of those things for us were around developer ease and productivity reducing the pain that engineers were feeling in trying to set up their own TLS implementations or their own authentication stacks better performance benefits like I discussed these were all things that other infrastructure and product teams heard about and wanted and so they were very open to our original request and then from there on it was all about being a good steward of an operation you know having really good operational plans showing that we've done our homework in terms of testing and really thinking like a infrastructure engineer or an SRE instead of just a security engineer you know security is our ultimate goal but we need to make sure that we are not you know burning our credibility with the rest of the organization when going for that so there was a lot of time spent thinking like you know forgetting about all the security benefits right now how am I going to make sure this isn't going to take everything down thank you much
from one your question do all notes have
the whole web files and what technology sector you use to apply them to NY so
yeah everything gets a web file on its yeah technology stack is JSON so basically there's a very small shim that downloads this file from s3 and then puts the relevant list of allowed identities into the envoy into an envoy configuration file and then envoys using it sort of automatic updating SDS configuration to load that every few seconds so that's how that synchronization works okay thanks
welcome to last question have you considered using pub/sub of just the relevant metadata based on the x.509 identity of the the clients that you're not also giving them all the information about the entire map for the entire network yeah
so you can rather easily segment what information you're providing and it's really just a matter of sort of engineering time at the moment we have a pretty wide availability of that through other service discovery mechanisms so it wasn't a priority for us but it would be relatively easy to have customized on I'll bet that availability in particular since everything is not a emeral and everything has its own time roll you can simply make I am role specific web files and s3 and set up the permissions to allow just those to access it so on that actually wouldn't be that hard to implement alright thank you thank you
for answering all the questions thank you yeah please give us some applause for his patience
[Music] [Music]
Feedback