Operationalizing Unknown Cloud Deployments (In a Repeatable Fashion) - TIB AV-Portal

Operationalizing Unknown Cloud Deployments (In a Repeatable Fashion)

00:00

4

Formal Metadata

Title

Operationalizing Unknown Cloud Deployments (In a Repeatable Fashion)

Title of Series

Number of Parts

45

Author

License

CC Attribution - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/34598 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Cascadeo will demonstrate how they use Chef to deploy and manage operational infrastructure in multi-cloud environments for their managed services customers. Chef-driven automation deploys, configures, populates inventory, and validates the telemetry application stack in a distant customer-owned cloud account. Our engineers will demonstrate visualization and reporting based on this data: tickets, device performance graphs, etc. as well as connectivity to services like Slack and PagerDuty for notification and escalation. Companies struggling with operational, monitoring, performance and analytics challenges will find this presentation particularly engaging, as will individuals interested in self-healing distributed systems at scale.

ChefConf 201710 / 45

1

39:18

WinRM: Ride the Adventure!

2

42:26

The Hand-Waver's Guide To Contributing to Open Source

3

41:08

STIG Automation W/ Chef and Inspec

4

12:09

Amazon Web Services Keynote

5

15:02

Forrester Research

6

19:23

Python Applications with Habitat

7

31:27

Providing Monitoring Result Data to Chef

8

33:42

Practical Management of the new Shape of Applications With Chef Automate

9

39:29

10

33:28

Operationalizing Unknown Cloud Deployments (In a Repeatable Fashion)

11

39:32

My Journey Into Technology Through InSpec

12

11:24

Verisk Analytics Keynote

13

32:57

Mario Star Power Your Infrastructure: Getting the Most Out of Inspec

14

43:07

Managing Your Systems on Microsoft Azure with Chef

15

48:18

Managed Chef in the Cloud: Introducing AWS OpsWorks for Chef Automate

16

43:45

Kubernetes & Habitat

17

37:42

Kick Starting our DevOps Transition with Chef Compliance

18

29:24

19

38:23

Keeping Secrets - A Practical Approach to Managing Credentials

20

26:49

Microsoft Keynote

21

42:03

Incident Command at the Edge

22

37:00

Howdy, Chef Partner Cookbook!

23

42:27

How Habitat can Boost Your Chef Ecosystem

24

39:56

Habitat in Production

25

45:26

Getting Started with Habitat

26

30:39

From Solo to Happy: Migrating Chef Solo to Chef Server/Automate

27

41:15

Faking Coherence for Engineers

28

09:40

Verisk Analytics Keynote

29

41:20

Ephemeral Apps With Chef, Terraform, Nomad, and Habitat

30

35:59

Diversity is Not Just a Checklist

31

52:43

DevOps Transformation at Absa Bank: Technical Evolution; Cultural Revolution

32

39:58

Credit Union, AIX, and DevOps Oh My

33

45:24

Cooking Us Security for the Modern macOS Fleet

34

31:44

Chef Vault: A Deep Technical Dive

35

40:49

Chef Cookbook Testing Like a Pro

36

31:32

Chef CMO Ken Cheney interviews customers from Westpac & Rakuten

37

32:58

SAP NS2 Keynote

38

38:14

Beyond the Cookbook: Using Workflow to Bring Continuous Delivery to And Project

39

33:47

Chef CEO Keynote

40

37:39

An Approach to Air-Gapped Deployment

41

37:57

Adding Developers to the DevOps Process

42

1:06:50

Chef CTO Keynote

43

55:45

Chef CTO + Chef Automate demo Keynote

44

41:20

A Year with Chef and InSpec: A Retrospective with Optum

45

33:54

'Acing' Infrastructure Testing with Chef

Automatic playback

Speech

Text

Image

00:00

Service (economics)Operations researchLimit (category theory)Internet service providerThermodynamisches SystemComputerSlide ruleComputer networkComputing platformShift operatorPerspective (visual)ResultantDifferent (Kate Ryan album)Multiplication signGroup actionData miningHuman migrationEnterprise architectureProjective planeArithmetic meanService (economics)Physical systemInformation technology consultingVideoconferencingProcess (computing)Right anglePoint (geometry)Greatest elementClient (computing)Addressing modeNumbering schemeTrailDisk read-and-write headSoftwareCategory of beingCloud computingInternet service providerPredictabilityMotion captureData centerGame controllerDemosceneQuicksortTraffic reportingScripting languageThermodynamisches SystemBackupVery-high-bit-rate digital subscriber lineDemo (music)StapeldateiProduct (business)Server (computing)FreewareIn-System-Programmierung2 (number)WhiteboardHand fanBitJSONXMLUMLComputer animation

04:50

Information technology consultingExecution unitClient (computing)RandomizationBasis <Mathematik>Parameter (computer programming)DivisorQuicksortRight angleData conversionComputer animation

05:31

ScalabilityInformation securityShift operatorSerializabilityIterationMusical ensembleMalwareRootTotal S.A.Random numberSystem programmingVolumeException handlingAutomationView (database)State of matterField (computer science)Thermodynamisches SystemLevel (video gaming)MalwareFrequencyError messageOperator (mathematics)Execution unitCASE <Informatik>Staff (military)SummierbarkeitArithmetic meanIncidence algebraCycle (graph theory)Process (computing)Decision theoryScalabilityMetreNumberProduct (business)Web 2.0Server (computing)SoftwareLibrary (computing)Point (geometry)Virtual machineRight angleNumbering schemeQuicksortMultiplication signInterface (computing)Model theoryDefault (computer science)Optical disc driveView (database)Mathematical analysisReal numberData conversionDevice driverFood energyHypermediaVolume (thermodynamics)Scripting languageInternet service providerReplication (computing)Computer hardwareAtomic numberExtension (kinesiology)TouchscreenExpected valueFile formatMainframe computerPhysical systemRevision controlVirtual memoryMobile app1 (number)MathematicsCategory of beingMereologyFilesharing-SystemQuantificationDatabase transactionData managementFacebookDressing (medical)Goodness of fitSoftware bugShift operatorSoftware maintenanceObject (grammar)Cloud computingInsertion lossDifferent (Kate Ryan album)MultilaterationExistential quantificationVertex (graph theory)MassDigital photographyRootkitIn-System-ProgrammierungComputer animation

15:10

Virtual machineProcess (computing)Information securityScalabilitySubsetData recoveryData managementParallel portShift operatorMobile appRadiusBuildingMereologyGame controllerNatural numberThermodynamisches SystemVertex (graph theory)Computer hardwarePhysical systemRun time (program lifecycle phase)PlanningDatabase normalizationBitScaling (geometry)Focus (optics)Structural load9 (number)Moment (mathematics)Right angleEmailInsertion lossHecke operatorFront and back endsContrast (vision)Limit (category theory)ScalabilityParallel portProjective planeGoodness of fitPoint (geometry)SoftwareCASE <Informatik>Extreme programmingDatabase transactionPlastikkarteMultiplication signSynchronizationProduct (business)Replication (computing)Bus (computing)Line (geometry)Set (mathematics)BackupBoss CorporationData managementDecision theoryArithmetic meanSingle-precision floating-point formatEnterprise architectureInformation securityMultiplicationVideo gamePower (physics)Incidence algebraMultitier architectureLastteilung1 (number)Fundamental theorem of algebraNuclear spaceSoftware frameworkWorkloadSocial classPerspective (visual)Factory (trading post)Electric generatorQuicksortSpecial unitary groupGoogolData structureCode refactoringCartesian coordinate systemCondition numberBit ratePressureDifferent (Kate Ryan album)

21:50

Service (economics)Category of beingWeb serviceComputer networkComputerControl flowPlane (geometry)Stability theoryRule of inferenceProgrammable read-only memoryPlot (narrative)Server (computing)Data storage deviceWide area networkLocal area networkEnterprise architectureHybrid computerMultiplicationAxiom of choiceInformation technology consultingComputing platformEvent horizonMetric systemEmailBenachrichtigungsdienstFingerprintLambda calculusElasticity (physics)PlastikkarteHuman migrationShift operatorOperations researchSoftware bugPredictionLetterpress printingThermodynamisches SystemEvent horizonVisualization (computer graphics)Cartesian coordinate systemMetric systemProcess (computing)Uniform boundedness principleMultiplication signVirtual machineRight angleForm (programming)Line (geometry)Data centerInheritance (object-oriented programming)Game controllerProduct (business)Data structureMereologyData managementPlanningPoint (geometry)Analytic setSoftwareOperator (mathematics)Web serviceComputing platformInteractive televisionQuicksort1 (number)Mobile WebMobile appUser interfaceMessage passingDatabaseOnline chatTable (information)Mainframe computerEnterprise architectureMiniDiscDifferent (Kate Ryan album)AutomationLengthFigurate numberCASE <Informatik>Coordinate systemRadiusComputer virusSingle-precision floating-point formatConfiguration managementState observerLink (knot theory)InternetworkingWide area networkBasis <Mathematik>INTEGRALTemplate (C++)Server (computing)Component-based software engineeringGoodness of fitCuboidService (economics)Moment (mathematics)Configuration spacePhysical systemPlastikkarteFront and back endsCloud computingWeb applicationOpen sourcePresentation of a groupExtension (kinesiology)AngleReal numberInformation technology consultingWeb 2.0FreewareDemo (music)Software as a serviceComputer animation

28:29

Computing platformWeb serviceVideoconferencingView (database)Extension (kinesiology)Library (computing)Execution unitCASE <Informatik>Instance (computer science)Suite (music)Component-based software engineeringConnected spaceVisualization (computer graphics)Point (geometry)Event horizonThresholding (image processing)Thermodynamisches SystemClient (computing)Open sourceType theoryLine (geometry)SoftwareSource codeAnalytic setCartesian coordinate systemMultiplication signPrototypeWeb pageQuicksortMetropolitan area networkGroup actionProcess (computing)PlanningSelf-organizationGame controllerKey (cryptography)Web serviceSoftware frameworkExtension (kinesiology)DemosceneComputer configurationDesign by contractVideoconferencingConfiguration spaceInformation securityService (economics)Logic gateCore dumpOperator (mathematics)Interactive televisionSoftware as a serviceString (computer science)Image warpingReal numberZeno of EleaWeb applicationForm (programming)TouchscreenPublic-key cryptographyMereologyFront and back endsXMLProgram flowchart

32:26

Graph (mathematics)Visualization (computer graphics)Virtual machineAnalytic setLoginSource codeFunctional (mathematics)Mathematical analysisOperator (mathematics)BuildingMachine learningPredictabilityZeno of EleaComputer animation

32:51

Open setModel theoryFeedbackJSONXMLComputer animation

33:23

JSONXML

Transcript: English(auto-generated)

00:05

All right, well, I'm really flattered and amazed by the turnout here We've got standing room only so thank you guys all for for coming to this session This is my first time presenting at ChefConf. Although Michael Norring our CEO was here last year, I believe topic for today is

00:22

Operationalizing other people's poorly deployed cloud deployments or as I like to put it migrate and haste repent at leisure This is born out of a lot of experience we've had doing enterprise cloud migrations and systems architecture projects over the years But why why should you believe me or us about what we have to say?

00:43

First thing is we don't sell things. So we're not here to resell or convince you to buy product of any sort We actually only do professional managed services. That's the only way we make money We're vendor neutral. So we have companies like chef that we like and that we work with and that we support But we are not beholden to them. And when we think that they're doing something wrong

01:02

We are quick to point it out to them and our clients We've been around for 11 years. We started out doing P2V virtualized data center migrations data center build outs early kind of pre chef pre puppet Bash scripting and deployment automation back before there was such a thing

01:21

And most of us came out of the ISP and backbone enterprise software world One of the things that I think is kind of fun about us we're not a huge company But we're big enough that we've made premier status with two of the three major cloud providers Google and Amazon Web Services And I believe we are the only boutique

01:42

Consulting and professional services firm that is in the premier consulting category with them All of the others are mega companies like Accenture and capgemini and whatever else Okay, so me I started Cascadia about 11 years ago after Lots of time in the ISP and hosting and backbone community

02:02

Built one of the first dial-up ISPs DSL ISPs in Seattle Did a bunch of ISP mergers and acquisitions and learned a lot about other people's deployments through the process of? consolidating these eight or nine different ISPs that we acquired with all their legacy systems that had been built by hand and And a lot of the people that had built them of course were gone

02:22

I also worked on Silicon Valley some you guys probably have seen that show It's kind of a fun show if you remember in season two there was a whole bit with Data center in the garage and big paint Bitcoin mining rigs for servers that was my idea and contribution to the project Also was heavily involved with the Seattle Internet Exchange, which is a public Ethernet switch in Seattle

02:43

That is the world's largest free exchange point we move almost a terabit per second of backbone traffic now across the six And I was one of the founding board members of that and I've been involved with that for quite some time also built The distributed telemetry system for one of the cloud platforms behind the scenes so we do work not just for clients

03:02

But also for some of the providers And of course it's chef based so definitely appropriate Also went to UW for undergrad and graduate used to fly airplanes Then I had kids decided that was not a good idea anymore But I still like to travel and ski so airplane days are done But still try to ski as much as I can the picture by the way at the top is my youngest Oliver and on the

03:23

Bottom is me eating a gigantic maggot in the Amazon many years ago whilst traveling so I Huh awful it was revolting the head was crunchy so the body was squishy and then when you bit the It was it was one of the worst things I've ever done

03:41

But I did it and I'm here to tell the story right Okay, so like I said You know we've been recognized by a bunch of these companies as not boneheads We also are big fans of whale sharks you might see some of these popping up from time to time That's an inside joke that some of you probably will find out about someday, but by no means do we only do these things these are just companies that we end up working with a lot because they tend to be big players in the industry and

04:05

Ultimately we're demand-driven right we're gonna follow where customers want us to go because we're consultants and they pay us to do stuff so Okay enough marketing stuff I'm not gonna babble on about Cascadia or about what we do until I give you a demo of our product Which has never been publicly seen before at the end of this?

04:22

What I want to talk to you about today instead is the hell that is other people's stuff right so You know when you inherit somebody else's environment either at your job or a new job or because you're a hosting provider at MSP how do you rationalize that and Take control of it and have some sense of predictability around it when half the time

04:45

It wasn't even documented and it was built by hand years ago, and the people who built it are long gone, right? So it was really easy when we started out doing this And I'll say that as about 10 years ago We used to always have this conversation with clients. You know. What are you trying to do and

05:03

What are your parameters right because we can do it really fast if we throw a lot of bodies at it And if we pick the right bodies we can do it fast and we can get it right But that gets expensive and if you want to go hire a bunch of just sort of random bodies to fill seats That might be cheaper on a per unit basis But it might take longer and the correctness factor might go down so pick any two right which of those two do you want?

05:22

And that that used to work people used to be willing to accept that as a reasonable compromise Because everybody else told them the same thing that was just the reality of the market Today I think it's gotten better In part because of good tech like chef and in part just because expectations from consumers have changed

05:42

I think durability has gotten really good and by durability. I mean we're not going to lose people's data That almost never happens unless you're really really bad at your job Or you have some horrible. You know inside sabotage scenario, I guess Moderately high availability, so I think you know we've gotten to the point where even

06:00

Legacy vertical ERP apps can be built and operated in a way that keeps them Fairly stable and reliable although as was mentioned in the keynote this morning You know there's this culture of sort of fear and and paralysis around change management and around Operations because the consequences of screwing these big monolithic systems up are so bad right if you knock the one

06:26

Mainframe that United Airlines offline you've just stranded a bazillion people and you've cost the company a hundred million dollars right no No one wants to be that guy or the person at Amazon. He took down s3 for that matter And so sorry I don't mean to pick on them, but I've done it I haven't taken down s3, but I've done some dumb dumb things of which you'll see soon actually

06:43

And then scalability so we've gotten pretty good at building scalable systems But usually you get you know kind of two of the three you don't really get all three the problem is Customers, and I'm not talking about the people that pay me I'm talking about the people that pay our customers their expectations have changed dramatically right, and I'm gonna use Facebook as a good example

07:02

But it's binomial the only one Customers view this as first and foremost theirs. It's no longer your data It's no longer your company's data in their minds those cute pictures of their kids on a tricycle belong to them even though Licensing wise maybe they belong to Facebook they always they always want to be able to get those pictures no matter where they are They want to show their mom their grandma whenever wherever they don't want the system to be slow

07:24

They don't want it to go down. They want it to work all the time They don't want you to leak their personal photo library They don't want you to lose they expect all of this stuff to just work automatically and when it doesn't work They get really really upset if you're trying to share some cute kid pic with grandma And it doesn't work or you're trying to FaceTime with grandma, and it doesn't work. It's a

07:44

It's a incredibly upsetting and frustrating experience especially to non-technical users Right and the reality is that the majority of the consumers in the world are not very technical and if you explain to them Well, there was this that and the other and I lost your data. They don't hear this that and the other they hear What do you mean you lost it right so I think?

08:02

We're in a really challenging spot where? the implicit assumption that that customers carry into a Transaction with any kind of a vendor these days is that they're gonna get all of this stuff right all the time So what do companies do in practice this is my oldest this is Isaac

08:21

Isaac getting into the pool for the first time just decided to just jump right in and see how he liked it And I would say this pretty much sums up lift and shift in my mind so lift and shift for those unfamiliar means taking some VM or server that has been running and somehow moving it over into some other person's environment or company's environment and hitting go and maybe

08:42

Playing with some knobs and levers to get it to work, and then it works and that's considered done and I posit that this leads to That scenario even though it's not quite as instantaneous He realized instantly that jumping in the pool was probably not going to work out the way that he thought it would Most people that lift and shift and get it to work run in that state for some period of time

09:02

before the real misery begins Why is this a bad idea okay, so you guys are all sort of chef nerd So I think you probably know a lot of this already But I'm gonna quickly kick through it just to make sure that we've covered it First and foremost if you have machines that humans build and maintain by hand over time

09:22

They get crufty stuff piles up versions get layered on top of versions and your ability to Accurately describe the state of that system or recreate it at any given time is massively impaired Couple that with all of the recent revelations courtesy of Snowden and others about the extent to which not just the quote-unquote bad guys, but the good guys

09:46

Are mass deploying malware and rootkits even at the hardware level I think when you have VMs that have an unknown Operational history and have been running for months or years the odds of them becoming compromised in a way that you're never gonna find

10:00

Barring network traffic analysis. I think that goes way up So I think short-lived VMs trump long ones if only because there's less craft and there's less time elapsed From when a compromise occurs to when it's hopefully back it out But bigger than that I think is that a lack of repeatability Means that when you need it most you may not be able to recover, right? I think we all have been through enough operational experience to know that in the real world

10:25

DRBC plans exist on paper they get exercised infrequently and imperfectly if at all Sometimes those exercises are not a full failover and failback. They're just sort of a dry run or dress rehearsal And a lot of companies will say things like well

10:41

It's okay because we have a snapshot and if things break we can just roll back to the snapshot Which is great right up until that doesn't fix the problem right so many times people say well We'll just roll back, but then the bug resurfaces, and they're like well It wasn't here before that doesn't matter it doesn't matter if it was there before Just having a snapshot is not enough to recreate the system state to a known quantifiable objective right and then the last one

11:05

Band-aids become fossils right if you have humans in there applying bandages under the gun at 2 in the morning on New Year's no one's going to go back and fix those things and clean them up and Sooner or later someone's going to forget and then something's going to break and you're going to go through that whole cycle again

11:21

You're going to take a second incident of a human trying to figure out fundamentally why it's broken rather than letting it heal itself So it's even worse than this because and I again we're friends with a lot of the cloud providers But we do not represent them or their interests They want you to do lift and shift by and large because it accelerates revenue for them, right?

11:42

So if you move VMS into their cloud environments and hit go the meter starts spinning revenue numbers Go up Wall Street smiles everybody wins except. I would argue the customer there are corner cases where lift and shift is not evil But those are carefully considered And I think thoughtful decisions as opposed to using lift and shift by default

12:03

Which is what we see in the field a lot of companies doing a lot of them, you know Their mindset is let's just get the stuff into the cloud and then we'll come back and automate it later. I Would argue having been through a lot of these now At kind of all stages of the public cloud era and the stuff managed hosting that came before it

12:23

It turns out that even though it sounds like it would be easier to run a VM through a converter and maybe make Some tweaks to it you often end up spending more time and energy dealing with Driver issues or volume replication issues or whatever it may be And it would it ends up being the case that even if you had to manually redeploy it that probably would have been more efficient

12:45

Than just trying to suck the thing into pick your cloud and hit go And then the last thing I think We all know this there's no such thing as doing it once in the cloud You're gonna get the subnetting wrong you're gonna get the addressing schema wrong you're gonna get a

13:01

New Amazon account you're gonna get something that's going to necessitate redeploying even if you think it's done It's never done, and if you're gonna do it twice you're gonna do it thrice and if you're gonna do it twice or thrice you might as well just take the time to write some automation around it and Yet as a managed hosting provider of which we are one We're stuck with it right because customers come to us and say help us help us. Here's a common scenario

13:24

Help us help us The guy who moved us into Amazon is gone. There's no documentation The system is running in production and the airplane it's at 30,000 feet and it's on autopilot, and it's working But we don't know how much gas it's got and we don't know where we're at with maintenance And so you know we collectively and my little company are stuck with this right

13:46

The cloud providers want it they are spending buckets of money driving it because again They view it as a land grab they view it as get the customers in and we'll sort it all out later because Whichever cloud they land on they're gonna stick And most of these companies have already have already done it right a lot of the companies that we interface with

14:04

Are no longer in the thinking about cloud stages They are in the we've already got stuff running in production even though. We don't really have it set up correctly. What do we do? This was the case the last point when I was an ISP and a web hosting person And it's certainly the case today

14:21

Trying to be the ops staff for a thousand deployments done a thousand different ways by a thousand different humans It's almost impossible right these are all one-offs with their own little tweaks their own little vertical apps their own little Scripts and cron jobs and whatever else and trying to operationalize that is really really hard

14:41

All right another painful reality a hundred percent of everything including fume Humans like myself fail in In practice right if you have anything that runs for a long enough period of time something horrible will happen There'll be a fire. There'll be a earthquake There'll be a nuclear incident or in this case if any of you have been to the Philippines. There's a drink called Tanda

15:02

Why Chris knows about this? Tanda why is they call it rum, but I assure you it's not it's something far far far more dangerous Another thing you got to think about here is scale so I like to pick on Airbus a little bit The a380 obviously a miracle of engineering

15:22

It's so big in fact that they had to make special airplanes just to move the parts to make the a380 And then when they made the a380 they had to widen the taxiways And they had to put new jetways in and they had to have a whole new slew of facilities to accommodate Something that had scaled beyond what the rest of the industry had decided the natural vertical scaling limit for aircraft was it's cool

15:42

But boy, that's a that's a really extreme corner case. There's not that many a380s There's not that many airports that can even accommodate one And it takes a heck of a lot to fill one of these things up with 700 passengers, right? Now if you look at Tesla's approach to the Gigafactory and contrast that to the a380 I

16:02

Think it's kind of an interesting thing right the a380 is this giant Massive vertically scaled system that if it blows up you kill 700 people and cost a billion dollars So you better you better get it right every single time right you cannot afford to have these things fall off the back of the bus And they all get built by hand and there's customizations and the

16:21

different airlines want different seating configurations, and it's all built to spec Tesla if you guys have been watching this Fascinating stuff on the Gigafactory design and how they're doing that I think key thing here is that they view collectively the Gigafactory as the product The product is going to be made more than once so we're not just gonna make one of these things We're gonna make lots of them

16:40

and we're gonna automate the crap out of all of them and iterate on them and we're gonna make lots and lots of batteries and Put some of those batteries into cars and power packs some of them are gonna fail. There's gonna be fancy ones There's gonna be cheap ones, but at the end of the day. It's a automation and scale project Instead of a let's build the perfect thing that never falls out of the sky and can scale to

17:01

Impossible limits because we can kind of a project right both are cool but I would argue that the latter is probably more forward-looking from a overall engineering and architecture standpoint So how do we fix the mess right there's all these deployments out there customers have already spent money They're not going to tear them down my whole company runs on this legacy vertical line of business ERP app

17:23

I've moved it into let's pick on Google today, so it's running on Google Cloud even though I don't really know how it runs and I don't really know what's going on and things just got worse right and I think a lot of this has to do With the same fundamental design flaw that led to this is Fukushima by the way

17:41

But any kind of nuclear Incident like this where these systems are built with a fundamentally flawed assumption And that is that humans will be able to maintain positive control over the system at all times so Fukushima had Triply redundant cooling systems all active they have grid power they had diesel generators, and they made their own power with a nuclear reactor

18:01

Turns out that triple redundancy doesn't mean anything if a tsunami comes and wipes it all out right so the the big problem Was they thought that they could maintain operational control, and they baked that assumption into their engineering plan This is what companies in the non cloud world do with their IT They make the assumption Implicitly that they can maintain these things and fix them and that if they fault that they can come and run fast enough to

18:25

Press the button before the thing melts down and that sooner or later You know 100% of everything sucks given a long enough runtime So you guys probably already know some or all of this But I'll quickly recap it because I think these are important if you're building apps for the cloud or you're refactoring for the cloud

18:42

There's always compromises to be had in some no small part because you don't have good visibility or control over the underlying network and systems infrastructure I think you know on the first point Google has taught us among others that good enough really is easier than you think so think about Gmail Gmail sometimes

19:01

We'll say whoops. Sorry. I can't load your mail right now. Sorry the UX is still there customers Don't feel like they're getting a 500 or dead air, but maybe your mailbox is offline for a moment. I think Availability is over estimated in most people's minds I think people assume that they need five nines when in reality getting to that is astronomically difficult and expensive and in reality if you

19:22

Can preserve the end point in the illusion of? Uptime and availability even if there are back-end failures People are much more forgiving and tolerant of that I think another thing that people often don't fully contemplate and understand Is data loss so they'll say things like well if it breaks we have backups, okay?

19:43

So how often do you run your backups? Well? We run them every 12 hours, so do you test them every 12 hours? No? So what happens if the backup is bad? Oh, well it won't be well, okay Let's hope so so I think if you are willing to write down in Set in stone What is your tolerance for data loss and how quickly do you have to be able to recover?

20:04

You can make a lot of good engineering decisions and trade-offs that other non technical people can understand About what that means in practice, and I'll give you an example if the boss says zero data loss Absolutely unequivocally not one transaction can be lost. This is Visa and MasterCard, and you cannot lose one credit card transaction

20:21

That's fine, but that means synchronous data replication Which is very very expensive to do at scale right if you say hey look? I'm willing to lose no more than n minutes of data And I understand the cost of that and I can prove at all times that that is the worst case outcome That's going to occur no matter what that's a different scenario and oftentimes. It's much much less expensive than

20:44

trying to build a system that never fails right I think Scalability you know obviously running big but monolithic vertically scaled apps is not fun Running lots of parallel production deployments is better with some kind of a traffic management or load balancing tier in front of it not all

21:00

Legacy apps can be retrofitted easily to this the ones that can't good luck You just have to live with it. I guess until they're gone But many of them you can actually sort of wrap Orchestration frameworks around and run multiple copies and shard the workload across them. We've done a lot of this work for Extending the life of vertical apps that need to keep running at increasing scale

21:24

without trying to vertically scale forever and then Security I'm very much a pessimist We take the assumption of breach philosophy, which is to say everything is compromised Soup to nuts including the hardware that you're using right now We don't even know how many different parties have compromised it

21:41

We just know that there's at least one of them and probably more than one And so our focus shifts instead to blast radius so when it occurs how do we limit the destruction that's inevitable we don't want Your personal iMessage chat history leaked to the internet because that would be absolutely devastating to Apple So they've gone to great lengths to limit the blast radius if and when iMessage gets compromised to make sure that there's hopefully

22:06

Please never a you know fappening moment with a bazillion people's iMessage accounts getting dumped to the internet The real solution to all of this at the end of the day is micro services We define that internally as having three components

22:22

Interaction layer stuff this talks to the world could be API endpoints could be web endpoints can be mobile apps Coordination so some system that deals with asynchronous event handling Be that polling be that pub sub be that just API events coming in and then a bunch of back-end services This is usually where we try to relegate

22:42

Legacy applications, so I'll pick on Alaska Airlines by the way the best airline in America at this point if you guys haven't flown them Alaska Airlines did a magical thing where they took all their legacy airplane You know ERP stuff that probably runs on mainframes back here And they've really done a great job at wrapping their web interface and their mobile user experience their check-in experience with

23:04

Really good interaction and presentation layer stuff and the ability over here to deal with transient failures of various forms So even though I'm guessing on the back end at some ancient system. It doesn't feel like you're using saber on American It feels like you're using a modern. You know properly designed web app

23:23

Another thing we see companies get wrong all the time almost down the line Almost every single one of them get this wrong, and I don't know why I think that there's an implicit assumption That if you move stuff to the cloud that now the network is somebody else's problem And a lot of these companies also made really bad design assumptions in their in their systems architecture like

23:42

Here's the app server, and here's the database server And there's a wire connecting them or there's a switch between them And then all of a sudden they take things like an app server and they move it into the cloud But leave the database server behind and now that traffic is going across a wide area network with higher latency And they can't quite figure out why the app is slow, and it's like well. You know we

24:02

So here's the thing right? Network is more important than ever before and there's no way to make that somebody else's problem entirely Things like Meraki if you guys haven't seen this yet. This is a company Cisco acquired fascinating technology I'm absolutely convinced That's the future of network engineering And I used to be the BGP guy that got to sit there and you know feel important

24:23

It's special because I knew it nobody else did Meraki is absolutely the right way to go Not because of them as a company But because of the approach and the general approach here is they take all of the control plane and config management and analytics stuff Make that a sass they sell you the box

24:40

But the box is basically lobotomized and just contacts the cloud mothership to get its configuration data This lets you do all sorts of things like API integrations It lets you do things like create templates that get uniformly deployed you guys have heard this story before it basically lets you apply good Practices in configuration management and deployment automation to your network which most companies absolutely neglect to do

25:03

I can't tell you how many times I find firmware That's five seven eight years old running on production switches, and I'm like well. Why is this well. It's stable it works and Yeah, I guess that's technically accurate it is but what happens when it doesn't when was the last time that anybody exercised any of their

25:21

Operational process around network failures, it's amazing to me how little attention this gets in practice All right, so I don't mean to sound all doom and gloom, but a lot of this stuff isn't going to get fixed right For one thing companies have gotten bazillions of dollars wrapped up in investments that they're not eager to write off So a lot of them have spent big money on data center on

25:41

Whatever else and I feel like all of these problems are just getting worse over time So how do we solve this our attempt, and this is like I said the first time that we're going to show this in public Is to build a platform for making this less error-prone and less human driven The idea here is to make this as valuable as it can to everybody we want to deliver these automations

26:05

First and foremost we're not interested in pretty uis and flashy marketing demos We want to make it so that customers can fire us so we can leave value behind We will give you something for free and you can fire us and keep it but really what we're trying to do is take all of these painful lessons that we've learned over the years doing this in a consulting

26:22

Basis and in a managed service basis and turn it into something sort of like a SaaS service, so here's the idea Super super simple. I think we have a bunch of stuff of which chef is proudly featured That will colonize customers existing employment or infrastructure deployments

26:40

this could be on-prem could be cloud and By doing this we automatically set up an entire analytics pipeline alerting pipeline Visualization etc and we do it using open source software So that the customer can fire us and keep running the stuff and not have to pay us if they don't think we're bringing enough value to the table So here's an example of telemetry data

27:02

That's being outputted by the system in cabana which many of you probably used obvious system metrics nothing too exciting here Except that all of this can be done now without a human touching anything in an environment that we've never contacted before We're working really hard on

27:20

Smart auto discovery So how do we sort of get a beachhead in and then figure out as much as we can about everything as we possibly? Can including and this is a key thing any existing? monitoring or application performance doing right We want to support more cloud providers right now This is really kind of AWS centric because that's what the market is going after but we want to make this agnostic

27:43

But the real goal here, and I'll get to that in a second is the analytics angle So what we really want to do is teach people something that I think 99% of companies have no idea how to actually do in practice and that is do machine learning on system telemetry Right, so if you can extract and I learned this because I did this for one of the cloud providers

28:01

And it took years to really understand it. But if you can record every single data point about every single thing And feed that into machine learning you can get way way way better at operations In fact, you can often see things before they go wrong and you can prevent outages and I'm not just talking about the discus filling I'm talking about differences in the way applications behave and interact with each other so our real goal here at the end of the day is to

28:25

Bring this down market and make sure that every company can benefit from this to the extent that they want to So With that said I'm going to attempt to play a video here This is a time warp because it takes a while to actually deploy these things. So I'm gonna walk you through it

28:43

Ignore the clumsy UI here because this is just sort of the prototype front-end stuff. So here you've got some you know Security groups and key pairs and instance types and subnets very exciting I know you can see the form getting filled. Unfortunately, it's kind of cropping off the screen there I don't know why that's the case but

29:02

So you'll see here we've got our subnet we're going to go ahead and we're going to auto discover so we've all we've given it at this point is an Amazon account and a subnet that we're interested in and by doing that We have a provision of process that will go in behind the scenes and launch a instance. You can see it's called auto discovery there

29:21

Chef is used on the instance to deploy in this case is an OS core which is an open source Monitoring package that we have done a lot of work with over the years And the idea here is simple we have a basic configuration that tries to do as much as it can To auto discover everything it can about the environment and then push that back up into our SaaS service

29:40

So again, think of how Meraki took the control plane for networking and move that into a SaaS service We are trying to take the control plane for monitoring telemetry alerting analytics all of that and move that back up into the mothership So that companies don't have to run it unless they want to and I'll get to that part in a minute So you can see here chef is running Zenos is running trying to find things and now

30:02

Populated on the left here. We've discovered a bunch of assets So there you can see that the auto discovery instance was automatically terminated We're building a big back-end API using swagger.io that will give users customers and our own tooling Access again think microservices. We're not trying to lock anybody into using our tooling our scripting our front-end

30:24

Whatever we want to expose as much of this as possible to our Customers so that they can repackage and reuse all of these components in the ways that make sense for their business We're not trying to lock people into a specific monitoring tool or visualization tool or orchestration framework we're just trying to give them the option but not the obligation of having these things deploy automatically for them and

30:47

Operated for them under contract if they wish not required or expected So now we've moved on to actually deploying the real thing So here's chef running once again this time We are deploying a persistent instance that is going to run these an OS core application and do data collection alerting

31:05

thresholds monitoring visualization what-have-you And if at the end of this chef run here, the client said you guys are stupid. You're fired. That's fine They can still keep all of the work that's been done to date because it's in their account their environment. They have their credentials And we have no further interest in it

31:22

We expect by the way to include far more than just an OS in our suite of things that we will gladly You know deploy wire-up connect and sassify for you But we also keep in mind this idea that at any point we should be expendable. We should not be Mandatory for you to keep using any of this infrastructure because after all this is open source software and it is your environment

31:43

So now we have a bunch of devices here that are actively being monitored And you can see they're already emitting tickets into zendesk this literally takes just minutes to wire up Also wires into pager duty wires into slack so you can get alerts and notifications in this case, this is a managed services prototype customer so you can see that the

32:03

Organization ID is there you can log into the zenos core web app and see that it's already been pre-configured with all your Devices that you auto discovered and whatever else And Again, these are the events that are generating tickets coming in Right out of the gate so without any human interaction or SNMP community strings or win RM

32:24

Credentials or whatever else we're able to tell you some things And if you wait 20 minutes, which we're going to fast-forward here. You'll also find that all of the telemetry we collect from zenos and other sources Think cloudwatch think your existing monitoring tools think APM tools gets fed into

32:42

Elasticsearch log stash and cabana for a visualization analysis and you guessed it predictive analytics and machine learning functions, so This is the general idea behind what we're building We think this is the right approach given that there's only so much you can do to forcibly colonize someone else's deployment, especially if the

33:04

The people aren't there to to help you do it And with that all right well if anybody's looking for me Here's how to get ahold of me and happy to take any feedback or rotten vegetables And I hope you guys have a great chef conference