We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Planning for the worst

00:00

Formale Metadaten

Titel
Planning for the worst
Serientitel
Teil
91
Anzahl der Teile
169
Autor
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Alexys Jacob/Guillaume Gelin - Planning for the worst Sharing our worst production experiences and the tricks, good practices and code we developed to address them. ----- This talk is about sharing our experience about how we handled production problems on all levels of our applications. We'll begin with common problems, errors and failures and dig on to more obscure ones while sharing concrete tips, good practices and code to address them ! This talk will make you feel the warmth of not being alone facing a problem :)
11
52
79
CASE <Informatik>Mathematische LogikMultiplikationsoperatorBitKartesische KoordinatenUmsetzung <Informatik>Web ServicesDatenbankProgrammfehlerZahlenbereichStandardabweichungHalbleiterspeicherExogene VariableProzess <Informatik>AppletDienst <Informatik>
GruppenoperationSchwellwertverfahrenInformationsüberlastungMini-DiscServerZentrische StreckungSicherungskopieStandardabweichungExogene VariableSpieltheorieComputerarchitekturMereologieProzess <Informatik>Vollständiger VerbandFunktion <Mathematik>Überlagerung <Mathematik>HalbleiterspeicherDienst <Informatik>SystemplattformDifferenteCASE <Informatik>GoogolGrenzschichtablösungKartesische KoordinatenGefrierenRechenschieberDatenbankPhysikalisches SystemARM <Computerarchitektur>Overhead <Kommunikationstechnik>Familie <Mathematik>AggregatzustandServerGruppenoperationHardwareInterface <Schaltung>Spezielle unitäre GruppeReelle ZahlWeb logEbener GraphDatenverwaltungRechter WinkelFramework <Informatik>Serviceorientierte ArchitekturRechenzentrumDateiverwaltungVorhersagbarkeitNP-hartes ProblemMAPSystemaufrufArithmetisches MittelKomplex <Algebra>SoftwareEnergiedichteCluster <Rechnernetz>BildschirmmaskeAngewandte PhysikProjektive EbeneElektronische PublikationMailing-ListeProdukt <Mathematik>Web ServicesMini-DiscWeb-ApplikationÄhnlichkeitsgeometrieQuellcodeDatenfeldFront-End <Software>AuswahlaxiomBesprechung/Interview
SchwebungHardwareRandwertPunktwolkeServerFehlermeldungBildschirmfensterClientSoftwareBeweistheorieRelativitätstheorieProdukt <Mathematik>SystemaufrufMereologieGemeinsamer SpeicherAutomatische IndexierungSchnelltasteInstantiierungReelle ZahlPlastikkarteBitFehlertoleranzFront-End <Software>BandmatrixSystemverwaltungVorlesung/Konferenz
PASS <Programm>FehlermeldungCOMSchlüsselverwaltungRouterHackerInklusion <Mathematik>SchnittmengeMetropolitan area networkFehlererkennungscodeFront-End <Software>Web-SeiteFehlermeldungDienst <Informatik>URLPixelClientVerdeckungsrechnungMinimumParametersystemFlächeninhaltProgramm/QuellcodeComputeranimation
CAMARM <Computerarchitektur>EmulationPrimzahlzwillingeLastProdukt <Mathematik>Front-End <Software>MereologieRechenzentrumInhalt <Mathematik>Dienst <Informatik>VersionsverwaltungLastMultiplikationsoperatorFehlererkennungscodeOrdnung <Mathematik>DatenbankSystemplattformWeb SiteWeb logWeb-SeiteCachingClientMAPReverse EngineeringBrowserSystemaufrufTaskMessage-PassingProzess <Informatik>RelationentheorieMini-DiscServerEinfache GenauigkeitKartesische KoordinatenExogene VariableLoopPunktGrenzschichtablösungCASE <Informatik>FehlermeldungServiceorientierte ArchitekturEinfügungsdämpfungMaschinencodeRechter WinkelSummengleichungRadiusMathematische LogikFamilie <Mathematik>LastteilungAnwendungsspezifischer ProzessorGruppenoperationArithmetisches MittelAggregatzustandZählenHydrostatikVorlesung/Konferenz
EindringerkennungSoftwaretestSicherungskopieZeitzoneElementargeometrieMultiplikationRechenzentrumGrundsätze ordnungsmäßiger DatenverarbeitungDigitale PhotographiePunktServerSoftwareSystemverwaltungDienst <Informatik>Peer-to-Peer-NetzKartesische KoordinatenDirekte numerische SimulationDatensatzDatenbankBitReelle ZahlRoutingExogene VariableAbstimmung <Frequenz>SicherungskopieZahlenbereichMultiplikationsoperatorWeb ServicesEinfache GenauigkeitService providerEinfach zusammenhängender RaumKoroutineZeitzoneNetzadresseRPCURLClientWasserdampftafelDesign by ContractBildschirmfensterMinimalgradCodeVerschiebungsoperatorProgrammierumgebungFlächeninhaltLeistung <Physik>SoftwaretestUnrundheitVollständigkeitAbfrageSpieltheorieObjekt <Kategorie>MaschinencodeQuick-SortResultanteInterrupt <Informatik>ARM <Computerarchitektur>DistributionenraumSoftwareentwicklerWellenpaketVideokonferenzEndliche ModelltheorieStrategisches SpielTexteditorSchätzfunktionWort <Informatik>RichtungVorlesung/Konferenz
TypentheorieClientAuthentifikationWeb SiteWeb logSummierbarkeitApp <Programm>Computerunterstützte ÜbersetzungCodeFaktorisierungServerVirtuelle MaschineMultiplikationsoperatorClientLoginSchätzfunktionAutomatische HandlungsplanungMini-DiscSoftwarewartungFunktionalE-MailDateiverwaltungSoftwaretestSoftwareentwicklerProgrammierspracheZweiKartesische KoordinatenElektronische PublikationMaschinenschreibenBenutzerbeteiligungCodeFehlermeldungWeb SiteQuick-SortProxy ServerReverse EngineeringWeb-SeiteLastAusnahmebehandlungSchlussregelExistenzaussageDefaultKlasse <Mathematik>ProgrammbibliothekRechter WinkelBestimmtheitsmaßProjektive EbeneSkriptsprachePunktComputerspielDruckspannungMetropolitan area networkSchnitt <Mathematik>SchlüsselverwaltungTablet PCCoxeter-GruppeKontextbezogenes SystemGeradePhysikalisches SystemDigitalisierungDienst <Informatik>WellenlehreResonatorBitrateDifferenteEuler-WinkelComputeranimationVorlesung/Konferenz
SichtenkonzeptZeiger <Informatik>CodeInformationsüberlastungServerDatenbankDomain-NameAuflösung <Mathematik>StellenringOrtsoperatorMittelwertDatenverwaltungDienst <Informatik>Physikalischer EffektMultiplikationsoperatorSoftwareentwicklerImplementierungProzess <Informatik>PunktDatenbankKartesische KoordinatenApp <Programm>Virtuelle MaschineDämon <Informatik>Lokales MinimumStatistikDifferenteElektronische PublikationMomentenproblemBenutzerbeteiligungElastische DeformationGrundsätze ordnungsmäßiger DatenverarbeitungDirekte numerische SimulationServerCodeNetzadresseEinfügungsdämpfungService providerWhiteboardPunktwolkeProdukt <Mathematik>EDV-BeratungGraphBaum <Mathematik>Wort <Informatik>Rechter WinkelBeweistheorieFeuchteleitungProjektive EbeneSystemverwaltungKlasse <Mathematik>StichprobenumfangWeb ServicesCluster <Rechnernetz>Physikalische TheorieObjekt <Kategorie>Reelle ZahlLeistung <Physik>MusterspracheTopologieAbfrageAusnahmebehandlungVorlesung/Konferenz
Response-ZeitCodeComputerspielVideokonferenzCASE <Informatik>TypentheorieMetrisches SystemRechenschieberKartesische KoordinatenBitCoxeter-GruppeCachingMaschinencodePaarvergleichStellenringZweiElektronische PublikationGraphProdukt <Mathematik>TeilmengeRechter WinkelWiderspruchsfreiheitMultiplikationsoperatorGemeinsamer SpeicherProzess <Informatik>ServerDatenbankDienst <Informatik>SchnelltasteExogene VariableSpielkonsoleDirekte numerische SimulationAnalysisBetrag <Mathematik>ResultanteMittelwertVorlesung/Konferenz
Güte der AnpassungChirurgie <Mathematik>ServerVersionsverwaltungPunktVisualisierungGraphWiederherstellung <Informatik>PaarvergleichAnalytische FortsetzungResponse-ZeitMetrisches SystemBasis <Mathematik>Dienst <Informatik>Workstation <Musikinstrument>Schnitt <Mathematik>JSONVorlesung/Konferenz
SoundverarbeitungMultiplikationsoperatorServerKette <Mathematik>Kartesische KoordinatenInstantiierungMAPMereologieArithmetische FolgeRechenwerkAnnulatorSoftwareZahlenbereichDateiformatVersionsverwaltungStichprobenumfangRechter WinkelResponse-ZeitVisualisierungOrdnung <Mathematik>Algorithmische ProgrammierspracheAusdruck <Logik>PunktStrategisches SpielCASE <Informatik>EinsElastische DeformationBitFront-End <Software>Deskriptive StatistikProgrammfehlerMaschinenschreibenMetrisches SystemData MiningOffene MengeVorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
So I guess I'll have to ask someone I know to pass around the mic Because as you may know or remember, this is an interactive talk So we prepared some Stuff to follow around but we are really hoping to have a discussion
So I guess I'll have to ask someone I know to pass around some thoughts and sharing experience.
I guess it should almost be time to start. I don't know.
I guess you'll agree that we'll now proceed. Okay. So the topic of this talk, as I just told earlier, is an interactive talk. So we're really hoping to share
your experience. Why did we want to make this talk? This is Ramness, I'm Miltra Bug and we work at Numberly That's where you can find us if you want to later discuss things with us. But to get back on the title of this talk, it's about
what happens when shit happens. The main thing in our job, daily job, we run some pretty heavy throughput web services that gather data for our
customers and we can never be down. Down time is not acceptable and losing data, which is another story, is not acceptable either. So we've developed over the years some kind of practical reactions and we have learned to develop and
design our infrastructure a bit differently and we're still learning. That's why it's an interactive talk because we don't claim we have the answer for every use case. So we wanted to start with the basic stuff
which will lead maybe, I hope, to the conversation we'll have. Let's take a simple example which Guillaume will introduce you to. As Alex said, this is like a very basic application like what you could have when you start a company or
anything. So you have Nginx who serves HTTP requests. Behind that you have Flask application who handles all the logic stuff and you put all your data in MongoDB database, for example, but really that could be like any database. So first example is what happens when your database
is down. So in our cases, we have multiple solutions. So, yeah, for example, the server could be burning. But if you have that, you can have just a replica asset of databases. So if one burns, well, there are
still two or three other databases that can take the lead and, okay, you continue to serve requests. Something else that could happen is that you miss some resources. For example, you don't have RAM anymore. So if you don't have RAM anymore, well, you
could trigger some automatic kills like with uWSGI, for example, do that. You can just say in uWSGI, okay, if that process take more than, I don't know, like one gigabyte, okay, kill it.
You could use cgroups like with Docker or anything just to say, okay, this process just have that amount of memory. So if you don't have any disk anymore, like if the disk burn off, if you have big failures, what could
help is a red one, red 10, anything. Basically, never run in production a web application on something that doesn't have a red billing. Another good thing you could have is a distributed file system like NFS or anything. There is a lot
of things you could have. This is a good idea for some use case. For some of the case, sometimes it can add some other risk, but that's a choice to take. If you have a server overload, like the database
can't handle anymore requests because it's already like this full source code. Well, there's not much you can do except monitoring it so you know when it happens and scaling or recently, so you just add more servers so you can handle more requests.
So if you have like some other ideas or some remarks about that, don't hesitate to tell us about it. Like Alex said, it's really an interactive thing.
And while you get the microphone, I'd like you to raise your hand if your back-end database server already crashed your web service applications or any web application. All right. So I guess you all have experience in these fields. Like I said, we prepared basic stuff
like this, and we'll get deeper and deeper in between the talk. So, yeah, please. Hi. So, yeah, I think we all share the experience. My question would be why don't you use or didn't use any of the standard tools or solutions for
those kind of problems. For example, for the list here, Mesos seems to be a good solution. I can answer that. For complexity's sake, who
doesn't know Mesos? Okay. So we lost already a lot of the audience. So just to get back about what it is, it's, and correct me if I'm saying it wrong, it's cluster service oriented, clustering
service oriented solution with resource management so that it can spawn a resource somewhere and spawn it somewhere else if the given server where it was running accidentally dies. Right. But
setting up Mesos and managing it is an overhead that you may or may not want to have. Kubernetes is also the same kind of thing by Google. Google platform runs on Kubernetes and it's also maybe a good solution. It depends on
the architecture. Here, yeah, we took a basic example with no automation whatsoever. And because also we believe that sometimes simplicity is, and built-in features of the technology we use are a better response to making a bigger infrastructure and adding again
complexity. Maybe you can save complexity by using right technologies or technologies who handle failure in the right way. Also, we won't talk about Mesos or Kubernetes in this talk, but this is really like the first example. In the next example, we'll go
like on bigger architectures, so. Yeah, so it is my experience that I heard a lot of similar responses from different teams and the thing is sooner or later they end with a lot of moving parts. And sometimes, okay,
really sometimes it's perhaps cheaper to just use something and invest like a week or two instead of having to answer the phone at 3 a.m. Yeah, yeah. Like I said, it really depends on
your team and the size of your team or your company, yeah. But I really agree with you. I just wanted to say like please don't call like plain NFS or distributed file system. Yeah. It's just like, use something like Gluster or just, yeah, it will burn you.
Yeah. You're right. When I, when we wrote distributed file system, we are more, we have, we had more in mind HDFS, which we use intensively. Yeah, it's a misspoke. Okay. In my experience, it's not very hard to
avoid hardware failures. We have replication, we have master-slave, we could clone, backup our data. But it's very hard to recover logical failures when we logically corrupt our database or corrupt our MongoDB
database and how we could avoid this. Yeah, yeah. We'll cover maybe deeper which example will relate to your, to the problem you're talking about, yeah. And I agree with
you. That's only pure hardware failure here. Any other hardware failure experience? Hi. So I forgot to mention that mostly with this kind of home brew solutions, I noticed
that they end up with a much more complicated architecture. For example, if you would want to somehow make now out of this technology stack some fail-safe architecture, I mean, in my experience, teams have ended
with multi-master highly complex MariaDB clusters and whatever. And, you know, the solution is simply use salary, use blah framework, just do it. Yeah, we'll get some of those afterwards. You're right. But let's continue on.
I don't know if I'll be contributing much, but just an anecdote about hardware failures. On this one project that I was, only briefly, we had this big data center in Verizon or
Amazon or something, but it was in one place in the world and tsunami hit. We'll talk about this later. No, no, no. Keep on, keep on. Well, you know, then we thought, yeah, we have to have another one on the other coast of US and stuff like that. Sure, sure, sure. Of course we'll get to
that as well. You just want to see me walking. Yeah. That's because you said you were tired earlier. No, actually, yeah, I was. Oh, no, or you're? No, actually, I was a little bit late because I was stuck in the EPS meeting, sorry.
These are all server things, but if you have servers, just servers, you're not reachable, so the network is missing, and network is a big problem as well. Thank you. Also, hardware can fail there. Yeah. Very badly. Yeah. That's another possibility.
Unreachable backends. Indeed. That's maybe what occurs most than a server burning, a burning server, actually. The first thing that comes to my mind with
unreachable backends is a sysadmin guy who tripped over the cables. True story. I'm sorry. Not me, but anyway. The first thing is you have to make him remember. That's human behavior. So maybe find a forfeit for it and
allocate a keyboard for one week, whatever you want, but you have to make him remember. On the hardware side, you can handle also switching and switch failure. The easy answer to this on Linux, for instance, but it also works on Windows,
is use network bonding. Now when you buy a server, they have at least one network card with two ports. Use those two ports and plug them to two different switches. It's really easy to do. When you have a real network people,
you can do LACP, which is a higher but more resilient and more robust way to do the same thing, aggregating two ports and adding up their bandwidth while adding fault tolerance to your networking. That's the principle.
Do you have any sharing knowledge about switch or unreachable things? Yeah. Yeah, hi. Is anybody using hardware anymore? Is not everyone running in the cloud or using virtual machines? And you're running it yourself?
Yep. Okay. Just asking. So, yeah, we do it ourselves. So, yes. We buy everything. We host everything ourselves. And so we have to take care of these kind of problems. And we use Gen 2 in production.
Yeah, we use Gen 2 Linux in production, which maybe a lot of you haven't heard about. We are some kind of crazy people. When I say we're used to shit handling, maybe it's part true. Any other thing to share about network resiliency?
Well, okay. Now let's get a bit deeper in the stack. Having a fail-proof stack can also help when it's not about only the hardware part. On NGINX, there are two things I like to use mostly.
Is that in NGINX you can handle backend HTTP errors. Your upstream gets back to you with a 500 error. What do you do? Do you pass back this 500 error to your client? Or do you try to handle it nicely?
I'll show an example of this. If you don't know about this, it's called name location in NGINX. We use this a lot. When something happens, you can see on the bottom error page, whatever it is. We will change the error code to 200
to mask it for the user while still serving some kind of pixel because this is a pixel service. We can even handle if there was a redirect in the URL, we can still redirect the user to the correct page even if our backend
didn't or made something terrible. So that's a kind of little trick. Location and error page handling can really save you from facing 500 error calls from your client.
We use it quite a lot. You can also serve from cache. So NGINX has caching capabilities. You can say, okay, if I get an error code from my backend, I will just serve a stale cache response. It's pretty handy as well.
On your Flask application, usually you can also use stale caching which can be handy if your database is down as well. You can have some answers in cache and serve from stale cache. It's better to answer something than an error code.
And then you can have multiple techniques to not lose data. This is more focused on not losing data. Spooling and task deferral in the basic way is the way
that you get some data from your HTTP call and this data is very important to you. You don't want to be asking your client to send this data twice. Even more when, in our case, it's navigation data, so it's a browser and user browsing a website data
and we can have this data back. Spooling it means that whenever we have it, we are not forced to immediately insert it in database. We can take this data, write it somewhere on disk and have another process be fitted with this data and insert it
in a safe way. So if your backend is down, it just can try and over and over inserting this data while it was a long time ago since you responded to the client. this data is very important to you. There are also message queueing technologies such as maybe you heard
about it already here, zero MQ, rabbit MQ which is more resilient and stuff like that that can help you get data and make it into a task. That's also the
important thing to me and to us is don't send back error codes to your clients even if you unless you really have to, depends on what you're doing,
but you can handle them even on higher levels of your infrastructure and don't lose data. Don't ask your clients to send again this data. You have ways and means to handle these kind of failures as well and to not ask for it.
Do any of you use any of those techniques? Two? Three? Four? What techniques do you use? I used to work for a WordPress hosting company
and a lot of what we did was basically rely on the reverse HTTP cache to a lot of the content being served is actually just static content in a way. Like think of a lot of people running basically websites, those glorified blogs are basically just static content after a while.
And then the backends could fail all the time and customers would never notice if you served from cache. Everyone's happy. The front page is up. The main articles are up. A lot of things are available especially when your website is basically a content publishing platform because that content doesn't actually change that much. It's not very dynamic. It works very well.
You don't have to wake up every five minutes in the middle of the night during an outage. You can sleep through it and everything's fine. No one will notice except the people trying to publish an article. If it's something really urgent then yes, they will complain. Any other users who want to share their experience of what they are using it for?
Yes, to complete this thing even on websites like e-commerce, you can use similar techniques even if you need database actually to insert the orders or stuff like this because like 95% of the content is static so you can have something like Vanish serve the static content.
Then use some tiny JavaScript to just get the little tiny parts specific to the user like the user name, the name, the basket, etc. And I've seen it used to like lighten a lot the charge on the backends and it's really effective and even if you have like
one or two minutes of downtime for your backend, your user can still add navigate the website, see all the products and maybe by the time they add to cart, the backend will be backup and you won't lose any money.
Anyone else? I guess the conclusion here is it's better to run even a degraded version of your website or whatever services it is you run than having it
fully down. Depends on the use cases. Yeah. It can be argued. You want to argue it? Come on, we are here for it. Yeah. I want to hear the content point. For example, if you charge money from client, it's better to say
I cannot than take money after several hours. I guess even that can be argued. So the next thing you can do is of course clustering your application. So
if one of your backend is down or one of your databases is down, well, it's still working. So the bad thing is even with a load balancer, there's still a single point of failure. So you can always go you can always get more redundancy.
Even if you have two load balancers, then the whole data center can go down. You have to get another data center. So it's kind of an affluent loop. But, yeah, redundancy is cool. Okay. So now
we can get to your point where your data center burns. Yeah, this photo looks pretty bad. I don't know if it was photoshopped or if it's an actual photo, but I was like oh, my God. I don't want this to be the C subs coming back after the fire alarm in the data center room.
On the upside, actually it's pretty simple. Have multiple data centers if you run them yourself. If you use the cloud, like it's been suggested, in Amazon you have this notion
of availability zone that you should use. Make sure you do remote backups, whatever you do, and test them. In France we had a recent story where a big company lost its customers' data
and they found that they had backups because they were using backups and remote backups and when they tried to get them backup yeah, it can be said, it failed. There again, I don't want to be the C subs over there and you don't want to, I guess.
On the IP routine and connectivity stuff, you have BGP anycast stuff for having a single IP address accessible all over the world. Something I appreciate also is DNS health checking.
For this we use route 53 on AWS. Who knows about route 53? Okay. Not so much. It's a DNS service from AWS where basically you can have geodistribution
based DNS responses and add to those DNS records the health checking. So if your data center or whatever happens is down, one of your IP to your
web service is down, it will not be answered from DNS queries anymore. It's pretty handy. And cheap as well. On the application design, you have to think about geodistributed applications. Who runs at least one
geodistributed service here? Okay. So I'm not talking about too much people, but still it's a very interesting thing to do. As a developer, it's a real challenge. It's a real challenge even when you want this service or this
kind of when I say service, it can be a database service available all around the world. It's also a nice thing to try and achieve. Anyone had this kind of problem already? Where they were relying on everything in one place?
Yeah? What happened to you? On the whole data center? Yeah, so obviously I'm not an administrator of network of some kind, but I was you know, I've seen this all.
main service was located in one data center and it failed power. And it ended up just in four hours of outage. Complete, nothing worked. Crucial infrastructure was located there.
So we just dialed up our clients and said we're sorry. And afterwards we apparently distribute. Yeah. What time did it take to distribute the whole thing? I would have to ask my administrators, but I know
that certain steps were carried out. Just add a little bit because just how easy terrible stuff can happen to a data center especially if it's not like a big company, like a small data center or a service provider having a small hosting area because I used to work in
kind of the same environment and basically so many things can go wrong. We had a story I mean, I won't name the company, but basically it happened overnight and the night shift who was monitoring the object, just everyone fell asleep suddenly and they missed all the alarms and basically when the morning shift came like all the
temperature in the server room where we had a lot of our customers hosting their services was like 70 degrees and we opened the windows and started just to try to get some air there. But basically a lot of things can go horribly wrong so choose your data centers carefully and try to really get more of them if it's possible.
Contracts with your providers One more. Yeah, yeah, yeah. But I'm just relating to the contracts to your providers are not enough usually. Even providers say like 99.99%
but not 100%. Yeah, yeah. Luckily this was a data center that was only used for the development, but we had an air conditioning that was running really hard and it leaked water into
the power outlet that was behind the UPS so no more uninterruptible power supply. It proved that it was interruptible. It was down for two days. Yeah, it was a major major problem.
Yeah, you have to call your clients in the end so I guess this must be very hard to explain. I don't want to be in the sales department at this time. The problem with geo-separated
distributed locations is not when it goes down but when things come up again. I've had a few times where services came back up and we had both of them active because they couldn't see each other but the rest of the world could either see one or the other. And then people start using it and when they see each other
again then one of them has to decide to be slave again and weird things happen. Yeah, that's called the split brain situation where your brain doesn't know anymore because you had usually two peers. That's why in clustering
in general and in everything you should do is that always be at uneven numbers and you already know about the voting strategy, okay, if I am in a disconnected situation who is down? I am
or is my peer down? If you have only two peers you have no way to know. You have at least to have three peers to be able to know. If you can't reach any of the two other peers you are down. That's solid, pretty solid. It's not always solid
but it's pretty solid. At least always think in uneven numbers, always, whatever you do. Yeah. Okay. So Terry is great but sometimes real world problems are a bit more complicated and it's not always like DevOps stuff.
It can be like really coming from a code. That's what we are going to see. So one day I was working like normally doing my stuff and one of our market guys came and told me hey run this, the client says you can
authenticate on the server on the website, something is wrong. I was like okay I am going to check the logs. This happened like maybe ten times those days so okay, let's see maybe something is wrong. So I searched the machine I looked at the log
and everything is okay. So the client is wrong. So yeah, the client must be wrong. So he goes away and I am happy. Something like one hour later
I am still working and the guy comes back and tells me that it's still not working for the client. So I am exhausted alright, I will check the code, maybe something is wrong. Then I look at my application and I see does anything see something wrong? So
after 30 seconds I notice that the same email function can fail. So if the email function fails well it returns okay, it works. So
yeah my conclusion to that story is that you have to know your code infrastructure is great but code can fail too. Even if you don't like the guy who wrote the code even if you don't understand the code if you are a maintainer
of something, you have to understand what you are doing and you have to refactorize when needed. Error should never pass silently. That's from the design of Python. And yeah, don't always blame OBS guy. Sometimes it's easy like okay, that's not my fault. It might be another server thing.
So that's why the developer thing is great. So you can't really understand what's happening on your server even if you're just a developer at the origin. And the other way I run this is true too. So do any of you have similar situations? What kind of really weird
things happen? Okay, now he's going to be brave for developers to raise their hands here. I know. I had a silly situation where a similar thing where saying oh, this isn't working for the client she's trying to like do all these things she should like she had a really odd workflow
and so I was thinking hang on, this is all working, all the tests are passing I go onto the website and I'm looking thinking this is all this is all working fine. And I ran all the tests and it's working fine. And what I didn't really realize, it took me like a week to realize where she kept on coming back. The point where like I use no script so I'm happy using the HTML back end
and everything so I can find. What I didn't realize was if you enable the JavaScript JavaScript uses a different API and that's the thing causing the problem. So make sure to eat your own dog food and use your own API. Like that got, yeah, it looked like it was working but I didn't write the code so it's fine.
Yeah, but in the end you were responsible. Yeah, that's, yeah. Right in Python we get used to the libraries we're using raising exceptions
a really common one that doesn't is memcache. Pretty much every memcache library will return zero instead of raising exception. So you need to wrap it or do something like that but there's four or five places I can think of in different projects that we've been working on where we trace back something to like why isn't anything working? It's because
we think memcache is working when it's not. I tend to like the memcache Python library because of this but sometimes it can be a nightmare so you have always to check about it's like the go
it's like in go you have to check the year you return of the reparations any brave any other brave developer want to share about this? Oh yes. We have one here. Pass it pass it through.
Yes, so my example is not related to Python really but to PHP. Yeah, I know. Who thinks it happens more in PHP? That's brave developers.
No, but for my defense I am not the one who wrote the code. But there is a very nasty thing when you try to auto load some file class and you have a syntax error then if you do not handle this properly then PHP dies and web server returns
blank page with 200 OK. And no other way to debug this issue. That's a nice one. We ended up at the WordPress
hosting company where I work we ended up writing some code in the reverse proxy that would detect these sort of situations, the white pages and alert us to it just because it's such a stupid default. Why would you return a 200 when something's wrong? Yeah, it's horrible to monitor for that.
I have to mention Python break itself with this rule. For example has auto write a bit exception. And sometimes it's can make very strange things.
True. Switch to Python 3. One thing that's not related to Python or any programming language really I was having
a server with a pretty large disk in it and there were very, very, very many files on that. And then suddenly a developer called in and said hey I think the disk is full. So I go look do a df and no
only 10% used and he says well I can't write any files anymore. OK, touch file this full. It ran out of our notes. That's something. Yeah, that's a nasty one. That's a nasty one we often overlook. Yeah, absolutely
right. And there are file systems who don't rely on inodes. So when you know that your application might spawn a lot of files think about them. Indeed. I have another story that I forgot to put in the presentation. So I'm going to tell it right now. Basically in my
old company like it was a small startup before I worked on Amboli. So we were trying to get things fast so basically our web server was running inside a tmux. And sometimes when we looked at logs we were like just crawling on tmux. And one day
we were like oh my god the web server is not running anymore. And actually it was just tmux when you scroll it's in a pose. You send pose to your application. So the application was down just because we were trying to take the log with
tmux. Don't run tmux in production. And about the devops philosophy I don't know what kind of adjective we can add. Who works as a devops or in the devops-minded
company? I see I don't know if you're waving to say hello or ... Okay. Just wait I come with a microphone.
Cause we can't understand you in the back. Yeah we should have thought about that probably a little bit early. The devops question is hard because when your managers
and everybody talks a lot about devops but they hire a guy who is a devops as a devops position then it gets you know tricky. So you get back to the silos. So we are developers and the devops. So yeah back to developers and sysadmins. I was actually
a developer who had to run back to our admins to check up why the fuck is docker not working again. Oh the elastic search containers clustered with each other. Oh interesting. So in that sense I was a devops because I needed to worry about the code and about
infrastructure mess up. So it's a tricky word. Yeah it has a different depth depending on where you stand. Which leads me to a question who runs docker in production and can you share
some experience with it, I'm interested and more interested in when it's failing obviously. Just one thing, we were actually doing just this new project so it was more of a proof of concept but we've already started to
get it out to customers and we were working together with this consultancy who told us how to do docker and cloud foundry if someone knows cloud foundry. So our whole infrastructure
we would provide services for cloud foundry based on docker. So you know like spawn redises, elastic searches, stuff like that. But the docker cluster was actually one machine with all the containers for all apps for all services. So don't do that.
Oh yeah, okay. Thank you. So I can share a funny story where the docker daemon crashed on the CI server. So you can imagine having like 15 super highly
paid developers who are just mixing the AI. And it was also fun to debug because you know who would have thought of it. Yeah but to get to the previous point how much effort would it take to implement
something like supervisor or whatever process that would monitor the daemon probably not much. Yeah, you're right. You're right. Sometimes we are on our own barriers and we yeah, I really agree with you. Just about the DevOps thing.
Yeah it's kind of a buzz word especially for recruiters. But all we see is really like not a single person being in DevOps but really a team where you have yeah, people who DevOps people who make DevOps the thing here.
But just working together and understanding what the other is doing is just very important. Giving time also to a developer to acquire and helping him understand what is not used to do and it's a
another real world problem. Go ahead. So we have our statistics on Grafana so it's a very nice board. One day I was looking at Grafana and I saw those statistics so it wasn't really important because it was
really just the maximum processing time of one of our services. Basically the average processing time was still very low so we didn't really investigate the thing. But it stayed there for like, I don't know, maybe
two or three weeks or even more maybe, I don't remember. But we never understood really what was happening with why the maximum processing time was so high. And then one day, boom. Like, what? So
my first idea when I saw that the graph was going so low is oh my god, the service is down. Actually no, it was still running. But what was happening is so I searched for like one hour to understand what happened and I
ended talking with one of the most ops guys in the team and told me hmm, that's strange because at that moment I deployed an Ansible playbook on one of those servers and so we looked at the playbook. What was the difference? And
the only difference was in the ETC host file. Basically our DNS server was like all the time queried at each database insertion so sometimes it was just overloaded so just putting the IPs
of our database in the ETC host file of each machine fixed the trick. So yeah, that was pretty weird.
Sometimes some other stuff I have to solve. And it's not just resolving the database server, you be surprised to see how many
code is reversing the DNS so you have to not just forward but also reverse is happening quite a lot. Yeah. That looks like this problem.
To be honest we felt pretty stupid with this one as well. And this one is pretty interesting because two days ago you made a presentation about using console.
Sounds very contradictory. And the question is have you tried to put a local cache I don't know, a bind 9 and a TTL around 30 seconds or minutes, something like that. Would it make the trick? Yes. Absolutely.
What's embarrassing with this is that we also lacked consistency in what we do. You know? On another type of infrastructure, we have local cache, local DNS cache, but there we didn't have it. And when Guillaume says
the people were working on the uncivil playbook, it's also to start normalizing all of these. So yeah. So maybe we think that we have something in production and it's running for so long that
nothing can happen to it and we then maybe sometimes to forget about its resiliency or performance or just applying the latest of your knowledge just for the sake that it's running, I don't care or I don't need to bother so much unless something
weird happens. In this case it was good news, you know? We were satisfied with this shitty processing time. But on another type of application, it might be not so. I think that one good trick is to always
profile your applications at least once. And I recently used VMProc from PyPyGuys and it actually just slowed down my service at around 5%. So it's actually viable to just switch one instance and check what actually your code is doing.
Actually, in that situation, I profiled the code and I didn't have the same result as Synod's graph analysis. That's why I was like, what the fuck? It's not working as it should. And that's why it wasn't really important, so like I said, we just let it go and well, it was a good surprise when it was
fixed. We came up only with embarrassing examples, so you feel more comfortable sharing this. I have a comment regarding performance monitoring tools.
You really have to configure it properly. We had a situation that response time was average response time was between 30 and 60 seconds and it was caused by uploading files. For example, 5GB files was uploaded by a few people
and it increased the average response time. The same kind of problem. Sometimes our metric server goes down and then we think, oh my god, my application is done, but no, it's still running. Yeah. Who is using
who is not using a metric system? Who doesn't do metrics on their applications? Who doesn't have this kind of graph? Nobody? You all have it? Okay, I see you again. Waving people. Question. Yeah, yeah.
Question, I was precisely going to ask this tension of this kind of question. Have you managed in the end to put in Grafana some kind of percent still graphing so you can know a 90% or a 99%
of the response time so you will know the kind of problems. We're having a hard problem with this. We are deploying Promethium plus Grafana and using that with elastic surgery we are having a lot of problems to really calculate the
99% or the 95% but have you managed to do this? Basically we just show what we use in general is a comparison between the current day and the day seven days ago so it gives a good idea to is it normal or is it weird?
And using Carbon and Grafana for the visualization you also have the annotation feature which is good where you can have a bar on your graph saying and you can plug it to your deployment or continuous delivery stuff
so you can have a bar on your graph saying okay from this point on this is version 2.1 and then you can do also matrix comparison related to code deployment. It's pretty good. In disaster recovery it's also a good thing to have
so when you know that you broke something and you can do the same with server provisioning and deployment. At this time I added a new server. Maybe it has some weird side effects. For the percentiles if you mention you've already
got elastic search if you don't have aggregates but have the actual requests logged there you can just use Kibana because it has a really nice visualization and fire quality gives you the percentiles as well. You should get that for free from Kibana.
The answer is the problem is to combine it with Grafana for the remote audience. Did anyone come with a question? I guess we have four minutes left. Open the description now.
I'm looking to know if Redmi has some experience with trying to deploy a new version of your backend and only deploy it to let's say five percent of your users, try it out
and see how it handles and then go for one hundred percent. Especially Nginx if you do Nginx that's really good. Progressive deployment. Anyone?
Thank you. For the same company I was talking about a bit sooner about the e-commerce we were always rolling up the traffic but it was with HAProxy in front of I think
HAProxy, then Varnish then Nginx in that order and the new servers were rolling up like ten percent traffic on the new version. Then we had the software error monitoring and all the metrics on this server. We were checking that the response time was not doubling etc.
and after a few hours a few more servers were joining and at the end of the day all the traffic was rolled up from there. I don't know if I answered the question. Depending on your stack we do a lower level of this and we run our Python using Usgi and in Usgi you have this feature
where it's called Touch Chain Reload where your workers are reloaded one by one and Usgi will make sure that the one that is reloaded reloaded correctly before reloading the others. So it's a good fail-safe low-level deployment trick.
And on a side note, if you are really really committed to trying canary releases which is usually named canary releases when you put the bird and try the mine or if it's poisonous try Kubernetes that resolve this problem in a very reliable way but it's obviously very
it may be too much complicated for your case but it has exactly this kind of procedure when you say I have a rolling deploy strategy when I want to keep a number of ports which is your application deployed and in an hour it began to increase the number.
Kubernetes this feature in Kubernetes is very nice but Kubernetes still doesn't have else check, right? Kubernetes still doesn't have else check, right? Half-shake
So yes it has. So that was one of the really bad things that I wanted us to do. Readiness check. That's right. Thank you Paul.
Yes I want especially thank you because this is like an interactive format and it's a little experiment just like to have only one-sided talks I want to thank you for taking a leap of faith in the first year we tried this so I think please give these guys an extra hand please.
Thank you very much. And lightning talks Thank you Thank you