Data Corruption: Stop the Evil Tribbles
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Teil | 86 | |
Anzahl der Teile | 86 | |
Autor | ||
Lizenz | CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben. | |
Identifikatoren | 10.5446/31238 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
RailsConf 201786 / 86
1
8
10
12
18
19
23
30
35
42
49
52
53
55
56
59
65
74
77
79
82
83
84
00:00
SoftwareentwicklerMusterspracheEnergiedichteQuellcodeRechenwerkNeuroinformatikMinimalgradEuler-WinkelEntropie <Informationstheorie>Konsistenz <Informatik>DatenbankReelle ZahlIntegralBootenProdukt <Mathematik>MathematikDigitales ZertifikatCASE <Informatik>Exogene VariableEreignisdatenanalyseRechter WinkelInformationsspeicherungMereologieEinsSelbstrepräsentationPhysikalisches SystemGeschlecht <Mathematik>Logischer SchlussPhysikalischer EffektInhalt <Mathematik>Bildgebendes VerfahrenRahmenproblemComputerspielForcingWellenpaketTransaktionSchreib-Lese-KopfWurzel <Mathematik>CodeStützpunkt <Mathematik>ValiditätMetropolitan area networkVererbungshierarchieComputeranimation
05:01
CodeMultiplikationsoperatorNeuroinformatikArithmetisches MittelIntegralPhysikalisches SystemMigration <Informatik>Produkt <Mathematik>ValiditätMathematikEndliche ModelltheorieSchnittmengeProzess <Informatik>Kollaboration <Informatik>Weg <Topologie>DatenflussNotepad-ComputerProjektive EbeneEreignishorizontProgrammfehlerRechenwerkVorzeichen <Mathematik>DatenparallelitätMethodenbankStatistikSoundverarbeitungPhysikalische TheorieAnalysisTermQuaderOffice-PaketPunktCASE <Informatik>InformationPermutationVererbungshierarchieWort <Informatik>ExistenzsatzCodierung <Programmierung>Dienst <Informatik>DatenstrukturDatenmodellLastDatenstromMAPVirtuelle MaschineDatensichtgerätInterface <Schaltung>BitrateKorrelationsfunktionComputersicherheitEinsKontrollstrukturKomplexitätstheorieKomplex <Algebra>Nichtlinearer OperatorGrenzschichtablösungSoftwareKonsistenz <Informatik>Logistische VerteilungGüte der AnpassungÄquivalenzklasseTopologischer VektorraumWurzel <Mathematik>Computeranimation
13:31
Einfache GenauigkeitMultiplikationsoperatorInteraktives FernsehenSoftwareEndliche ModelltheorieAggregatzustandIntegralSchlussregelMailing-ListeLie-GruppeSoftwareentwicklerProzess <Informatik>AnalysisFaktor <Algebra>Office-PaketSchaltnetzValiditätMechanismus-Design-TheorieAuswahlaxiomTelekommunikationPhysikalisches SystemUmsetzung <Informatik>BitrateSelbst organisierendes SystemArithmetisches MittelKundendatenbankNormalvektorWort <Informatik>QuaderMigration <Informatik>Komplex <Algebra>BenutzerfreundlichkeitReelle ZahlMusterspracheDigitale PhotographieLesezeichen <Internet>FehlermeldungProdukt <Mathematik>Overhead <Kommunikationstechnik>RichtungGebäude <Mathematik>Lemma <Logik>Wiederherstellung <Informatik>RechenschieberSelbstrepräsentationProjektive EbeneGarbentheorieBitKoordinatenDienst <Informatik>UmwandlungsenthalpieSchnittmengeCodeStützpunkt <Mathematik>SoftwaretestMereologieZeiger <Informatik>Mixed RealityKonsistenz <Informatik>Kontextbezogenes SystemPunktHoaxComputeranimation
22:02
Ordnung <Mathematik>PartitionsfunktionProzess <Informatik>FehlermeldungWiderspruchsfreiheitValiditätLastWeg <Topologie>DatenbankBitRechter WinkelWeb logPhysikalisches SystemEINKAUF <Programm>SummierbarkeitArithmetisches MittelWasserdampftafelVerschiebungsoperatorGruppenoperationServerReelle ZahlMinimumMultiplikationsoperatorProgrammfehlerKlasse <Mathematik>TopologieKonsistenz <Informatik>SLAM-VerfahrenMereologieSystemzusammenbruchSoftwareDiskrepanzEinfach zusammenhängender RaumNeuroinformatikSichtenkonzeptRückkopplungKontinuumshypotheseIntegralLeistung <Physik>KugelkappeKontextbezogenes SystemTheoremÄhnlichkeitsgeometrieEndliche ModelltheorieObjekt <Kategorie>InformationsverarbeitungDienst <Informatik>DatenflussVerzeichnisdienstVorzeichen <Mathematik>NormalvektorProgrammiergerätEinfache GenauigkeitSchnittmengeCASE <Informatik>CodeTransaktionResultanteGewicht <Ausgleichsrechnung>DefaultIdentitätsverwaltungSystemaufrufRechenwerkGebäude <Mathematik>PunktZeiger <Informatik>App <Programm>Fächer <Mathematik>AbfrageKategorie <Mathematik>Wurzel <Mathematik>Physikalischer EffektKonfiguration <Informatik>VererbungshierarchieNegative ZahlTermComputeranimation
30:33
MultiplikationsoperatorTwitter <Softwareplattform>ProgrammierungComputerunterstützte ÜbersetzungProxy ServerRechenschieberSoftwareentwicklerPunktTransaktionGruppenoperationTypentheorieOffice-PaketProgrammiergerätFlächeninhaltÄhnlichkeitsgeometrieDatenbankSchedulingProzess <Informatik>Physikalisches SystemRechenbuchProgrammfehlerKonsistenz <Informatik>MereologieSchlussregelTabelleMechanismus-Design-TheorieLastSoftwarewartungValiditätSkriptspracheZentrische StreckungFortsetzung <Mathematik>ResultanteRückkopplungNeuroinformatikAuswahlaxiomDienst <Informatik>Objekt <Kategorie>Produkt <Mathematik>Mixed RealityGüte der AnpassungTelekommunikationMustersprachePatch <Software>CodeNormalvektorCASE <Informatik>Ordnung <Mathematik>Kartesische KoordinatenIntegralSoftwareKollaboration <Informatik>AnalysisAbfrageDatensatzDefaultDifferenzkernARM <Computerarchitektur>Stützpunkt <Mathematik>BitMAPMomentenproblemNP-hartes ProblemSpeicherabzugElektronischer FingerabdruckMaskierung <Informatik>COMKontextbezogenes SystemComputeranimation
39:05
XML
Transkript: Englisch(automatisch erzeugt)
00:14
I'm Betsy Hable and welcome to Data Integrity in Living Systems. Now ordinarily I'd like to launch right into the talk content here. We've got a
00:24
lot to cover and that's what y'all are here to see. But Marco's keynote just now really hit home for me and I wanted to follow up on that. Obviously I'm a white woman and he's a black man and these are really different experiences. I don't want to collapse the subtleties of that but a
00:43
lot of the ways he frames survival were the ways I do in my head. I often paint my experience getting into tech from a non-traditional background as like this super simple happy path theater tech manic pixie nerd girl thing and I do that because it's always safe to paint yourself as an eccentric genius.
01:04
Let's get real for a second. A decade ago when I was getting into tech I was just some chick with no college degree who had just washed out of an arts career because of a sudden onset chronic pain disorder and boot camps were not even a thing yet so I couldn't even get like a certification that way. I was
01:20
doing this all on my own and that was terrifying and I learned a lot of bad survival lessons then and this is relevant because there are some places in this talk where I talk about places that I used to be kind of a self-righteous jerk. I want y'all to remember when you're hearing about those parts that I was only able to properly grow out of that after I wasn't
01:44
the only woman in the room anymore. After I wasn't so alone. That the only way I could move from surviving by being a jerk into nuance and kindness and the actual success they bring was to not have to carry this torch of
02:04
being the only representative of my gender there. Anyway back to data integrity. Now this talk used to have another name. It used to be called data corruption. Stop the evil triples. I didn't change it because that was kind
02:25
of hokey name for a talk. It is but that reason would imply caring about my personal dignity and let's look at this example of fine photo-manip art. Why don't we? Don't worry and suddenly there are plenty more humorous Star Trek
02:41
images ahead. You know they must be humorous because that's how I'm calling them right now. Anyway I changed the name of the talk because it imposed this frame of bad data being an evil invading force from outside and literally everything else in my proposal was about how counterproductive that frame is. When we visualize bad data as this mess that clogs up our
03:07
otherwise pristine database we do two things. First off we pretend that there is such a thing as a pristine database or a pristine anything when we are computer and let's be real most of us work on code bases architectures etc
03:22
that look a lot more like this. And the second reason is that it trains us to think of the situation adversarially. When we think of this as the tribbles versus the crew we're making a world of heroes and villains.
03:41
We are making a world where heroic developers are fighting bad data. It is real easy from there to think of ourselves as the heroic developers fighting the sources of that bad data. That is benign when it is computers that we are fighting but this attitude quickly turns into thinking of our users
04:00
as the enemy or even worse thinking of our teammates as that enemy. It is super easy to get self-righteous about data integrity issues but once we start doing that we lose our chance to solve the actual problem. When I think about the actual data integrity issues I've dealt with product changes are
04:23
actually the usual cause or maybe miscommunication between teams. There's well-known data integrity patterns like transactions or Rails validations and I'll talk more about this later but I just like to dive in right now and say no this is not your responsibility like you think it is. If we're going
04:44
blameless when looking at root causes we note that when people forget or even consciously skip these data integrity patterns it's because their code base they're working in is architected in a way that actively discourages their use. Sometimes yeah you're going to encounter weird computer nonsense like
05:06
there could be a bug in Postgres that emerges at 3 a.m. on the fifth full moon of a leap year and somehow your data model is peculiarly vulnerable to that but you know in the interest of scope I'm going to just not cover that I'm going to focus on the 90% case which is that your team culture
05:23
and code structure are creating a situation which bad data is likely especially cuz like designing our systems to be resilient against common issues like the product changed or an engineer made a mistake because they were toughing out a mild flu like designing our system use resilient
05:42
against that lets us also proactively detect and correct an awful lot of data integrity bugs that stem from the harder stuff like concurrency under load so we all have a concrete example to hang our understanding off of I am going to tell a story this is going to be loosely based on a job I had a few
06:01
years ago at an e-commerce company the details are not going to exact I would pretend that this is me changing names to protect the innocent but it was a few years ago and let's be real I forgot this company ran a rails monolith that was old enough big enough and entangled enough to deserve that name and my team was working on a module inside it that took
06:24
returned goods and shipped them back to their vendors for credit it was not a tremendously complicated in system in of itself even though larger monolith was pretty complex in fact well neither the UI nor the underlying data model looked exactly like this you can actually form a pretty good model by looking at this you could go let you get a long way by assuming that the
06:45
only three things that return to vendor cared about were the product its vendor and whether it had been shipped back or not of course no matter how few things your system cares about the world will come up with a way to only give you partial data in our TVs case sometimes are incoming data stream would
07:02
not contain vendor information spoiler it is super hard to ship things back to their vendor if you do not know what that vendor is in other words the return to vendor module had unearthed new product requirement and now that we were returning things their vendors the upstream system needed to guarantee
07:20
that they were recorded we had an issue but the issue was created by the RTD modules existence not by any inherent issues with the upstream code because of a product change the data that had served the systems needs perfectly fine the day before it was suddenly invalid and this is pretty common data integrity issues that are caused at the product level are often best also
07:42
solved at the product level we could have decided to tough it out we could have like done machine learning or something to identify all of these missing vendors but what we actually did for the feature was a small requirements change and new piece of UI small requirements change just don't display the units that are marked for our TV but don't have vendors in like the main our TV display and new piece of UI if we don't know the
08:05
vendor or something else about the unit disambiguation interface like kind of the virtual equivalent of the desk that the warehouse workers Chuck things on so their supervisors can deal with it which is a lot more efficient than like a more complicated more technical solution and the big takeaways
08:24
that we can get from this story first off scope data by whether it can move forward or not and closely related validating models in this absolute way is not going to complex enough for complex business operations or it's not going to be flexible enough for complex business operations and so we need to be super careful in how we define our product at any given
08:43
stage so that we can actually figure out what's relevant one last thing don't overthink data correction we do not usually need to do some complicated magic to derive information we suddenly need from information we have we can just go to people and say okay tell me this new thing in a living software
09:05
system users are as important as code and often humans are much better at solving these kinds of issues than computers are when we approach data integrity problems in a spirit of collaboration with our product and our user base intractable problems become tractable that's how you solve product
09:24
changes now earlier I was saying that eventually we found the solution that was us scoping down data and creating a disambiguation interface there's a
09:43
perhaps some miscommunication on this project before we got to be eventually and again it's been a few years I forget the exact course of events here but in general my team assumed that upstream would be able to provide us with vendors we assumed that when units didn't have their vendors marked
10:02
this was actually a bug and furthermore a bug that we were suddenly obligated to this was actually pretty arrogant sometimes upstream just doesn't have the data so the data for return to vendor was mostly sourced from other
10:22
modules returns receiving module returns receiving is a fancy warehouse logistics term for the folks who log and sort through boxes of return goods and you know the folks in receiving often just didn't know who the vendor was when they tagged a return unit for processing like let's visualize here
10:43
right like you're a warehouse worker things are coming in off a truck there's a big pile and you can hope the packages are going to be nice packages that people actually put return labels on properly 90% of the time that's true right like most people are not jerks some people are jerks
11:05
sometimes people don't label things you need to figure out what the reference point sometimes the box has a brick in it or nothing in it sometimes there's like all these permutation of ways that things actually get kind of
11:22
weird out there where people are actually doing their jobs and we're talking about warehouse workers here their job is not to like do this in-depth research on exactly what brand of brick just got returned their job is to log everything they can figure out quickly and move on to the
11:42
next thing they are measured pretty aggressively on how fast they can do this and because we this is blue-collar workers in America where the unemployment rate is not so great there is a super huge correlation between their job security and how fast they can do this so if we force
12:01
this process on them where suddenly they need to think really hard and slow down we are the ones being the jerks also like if you have a brick in a box that doesn't have a vendor sometimes your business processes desire for data is about as realistic as my 10 year old desire for a pony anyway I didn't
12:27
know any of this at the time all my team was really aware of was the like very local needs of the RTV module and because all we were aware of and all we were paying attention to was these very local needs we ran some
12:41
migrations that might have screwed things up maybe a lot and it made it production now we could and should have not done that we couldn't should have stuck some new validations in the module model rather than running this intense destructive set of migrations that unmarked a lot of
13:00
things for our return to vendor and we did not do this because when we just added these validations on our own CI failed we assumed that CI failure was us uncovering yet another new bug in this process again we were kind of arrogant instead of assuming as we should have that CI failing was a sign
13:23
that we had just broken things and the root problem here is that we isn't even that we were arrogant and ignored what CI was telling us it is that in the process of ignoring what CI was telling us we went off into a new thing
13:42
instead of talking to returns receiving all this story that I just told you about like how things actually worked in the warehouse we discovered at a cross-team retro a few weeks later and if we had slowed down a bit and like walked across the office and said yo John what's up with this then we would
14:06
never have been tempted to run this destructive migration that created a production error the snarky way of putting this is that all we need to do to talk to it was talk to each other or as the returns receiving lead may
14:21
have quoted when you're doing that cross-team retro later individuals and interactions over processes and tools this is true as far as it goes but just is and remains the most evil word in software development whenever it pops up
14:40
you can be sure that lurking underneath someone is radically discounting the effort something takes or even worse pretending to do so as a weapon to get here's another quote it's a bit condensed the full quote that Camille Fournier did she's the former CTO rent the runway is the amount of overhead
15:02
that goes into managing coordination of people cannot be overstated it is so great that the industry invented microservices because we'd rather invest engineering headcount and dollars into software orchestration and forced disparate engineering teams to work together and this is RailsConf there is
15:26
like some dogma joke place I could take this about microservices am I right but let's get past that let's listen to what she's saying underneath here what she's saying in a lot of ways is that if we allied the
15:43
cross-team communication problem into just talking to each other then we are setting ourselves up for failure point of agile is not that processes and tools have no value is that focusing on individuals and interactions is likely
16:00
to produce more value than declaring a specific set of processes and tools be correct similarly observing the way the software actually behaves is always going to be more accurate than reading its documentation what is going on there anyway it is from the combination of these two facts that we build our
16:22
mechanisms for making sure communication happens I mentioned before that when my team added this validation see I started screaming if we'd run get blame on the failing tests some of them would have popped up with recent commits from folks on returns receiving we could have gone yo John what's up and
16:42
this idea didn't occur to us we decided to be heroes but let's go a little bit deeper part of inter fixing interactions so we don't need heavyweight process isn't just building in these lightweight cues like validations it is in being able to listen to these lightweight cues this
17:05
this culture or this organization had a culture of pretty isolated teams where everyone had their goals they needed to pursue really aggressively cuz a certain person in upper management was breathing down our necks all the time
17:21
and that created a situation in which we simply could not be receptive to folks from other teams initiating these kinds of conversations and when we build this accidental cultural norm that your own work is the most important thing we are also saying communicating cross team is a waste of
17:40
time it is never a waste of time so we've like created this rubric for solving each individual special snowflake communication problem we encounter there'll be a lot of them and so next up let's move on to not using well-known data integrity patterns now this photo of course is
18:01
from one of my favorite Star Trek episodes the little-known 1983 Christmas special anyway people forgetting or just not using established that integrity patterns is one of the biggest opportunities we have to trap ourselves
18:21
into moralizing when we were talking about data integrity this is because it's an obvious example of human error but when we actually want to limit the impact that human error has on our systems blame and moralizing lead us in the exact wrong direction we need to look at the ways the system actually encourages this human error and build in ways to recover from
18:43
it I'm a note that I'm stressing recovery not prevention prevention is let's add more steps of horrible waterfall paralysis recovery is let's acknowledge things and move on you're not intended to read all the code in this slide you are merely intended to note that oh boy there's a lot of it
19:04
this is like a symbolic representation of the way that a lot of older rails projects might acquire validation sections that are you know maybe five times that big or callback sections that are maybe 35 times that big and all this is a way
19:22
to like try to build in more modeling to escape these data integrity issues that pop up in older rails code bases I could say that all of the monoliths that don't have like that enormous list have a bunch of meal checks running around but that would be a total lie because we all know that those monoliths also have nil checks running around all over the place we can
19:45
make a lot of browsing jokes about this but let's not that leads us to blame culture instead let's look at how and why the situation is actually terrible this can be parallel to developers and it encourages developers
20:04
to get past this paralysis by skipping things I have spent a lot of two-day streaks going oh hell what magic incantation will get factory girl to work this time I literally wrote a fixture concatenate concatenation gem
20:24
to get past this problem because it was so bad at one particular place I was working at sometimes skipping validations is a totally rational choice which is scary but true and we need to get past this not by blaming people for having skipped validations but by reworking our systems to not
20:44
encourage them to do it anymore and we also need to remember this is just as paralytic to users if we think about our poor warehouse workers earlier imagine what it's like if maybe people aren't paid to work on it and how your conversion rate might perhaps suffer a little a lot requiring a
21:07
lot of things up front require encourages people to either give up or just enter some fake data this actually happened very recently and I work at banking software you'd think that people wouldn't just enter fake data sorry rant over but people just do a lot of things to get past that
21:24
annoying red validation box and you don't want to give them that chance you don't want to lead them to that trap we make folks deal with the full complexity of our systems oh hell how am I halfway through already at once we are making our system unusable for both users and developers and so how
21:44
do we get this bit more real-world usable let's start thinking out the actual business processes involved about figuring out what's needed one here we aren't doing everything at once we're using a conditional validation to validate based on state validations shouldn't fire unless a given fact
22:02
about the models true we can also use rails custom validation context to something similar here we're specifying what context we should use when we say we can even like go a little further away from like the designated rails happy path and use reactive model service objects and the
22:26
flexibility versus cognitive load because when we go further from this rails happy path we are also increasing the amount of cognitive load we need to safely invoke stuff when you're dealing with an app where you can just sling everything in the model it is super simple to save the
22:40
model on custom validation context mean remembering more things service objects mean remembering even more things now this seems subtle you think that people just remember look in the correct service object directory and do this they will not they remember the norms of default rails they will look in app models and they will screw up while using your pretty non realty
23:02
service object design they will do this for entirely normal programmer reasons like this is a rails app things are in app models I am a big fan of non realty service object designs but they do increase the cognitive load that it takes to work with the system and we need to be real
23:22
with ourselves about that and actually address the data integrity bugs that can result you need to build in a culture of pairing thorough code review etc to backfill this discovery ability problem another thing that we can lose
23:41
we move away from rails defaults is the way to rails defaults invisibly help us use database transaction correctly database transactions for those of us were less familiar our way of grouping database queries one two three four all of these will stand or fall at once it's a query three fails query four
24:00
doesn't fire and the entire thing goes away as it's never happened rails callbacks give us this for free and I am this bond of callback hate is the next person but reals callbacks do give us this for free then the bottom example doesn't this is super dangerous of a unit say it fails sorry if unit three of five doesn't save we are going to have us
24:26
some internally inconsistent data a shipment we mark to ship but not all of its units will be we can fix this by wrapping things in the transaction but remembering to do that is much harder than you'd expect and so we
24:41
need to make sure we're also having a base class that wraps it in with for you we need to just do this instead of going self-discipline self-discipline all the time because no one helps self-discipline when they have the flu anyway like the reason rails callbacks give us this for free is because I don't want to bother with this when I am working on a feature I want to work on
25:03
the feature that is what computers are for these things also get a bit rougher when third-party services get involved or when asynchronous jobs come in and really third-party services and asynchronous jobs are in many ways the same thing Amy Unger is going to go a bit more into this
25:21
probably in her eventual consistency talk right after this one each case their code that we cannot easily recover from failure with with transactions if a worker I queue up in a transaction fails that worker cannot reach magically back in time into this already completed transaction go nope sorry lol and similarly if I make an external service call and my
25:45
code comes across an error later on in that transaction tough luck there are ways to build distributed transaction systems I advise you not to do this it is very hard so if you like don't take this advice
26:02
probably will still want to implement the suggestions about catching mistakes they're about to give you brief digression this is a data integrity talk therefore I'm like contractually required to talk about the cap theorem the cap theorem was also knows Brewers conjecture until Nancy Lynch proved in
26:23
2002 talks about what can and can't happen with data and systems involving more than one computer uncapped stands for consistency availability and partition tolerance consistency is our updates applied in the order they received that you'll also hear this property referred to as availability means exactly what you think it does and participant tolerance
26:46
sounds fancy but it really just means that the system's behavior is predictable if a server crashes or a network connection cuts out you know like computers do the cap theorem says that you can pick two but you can't have all three and really what the cap theorem says underneath is
27:04
consistency availability pick one quote quote of hail and a really great blog post you cannot sacrifice partition tolerance computers fail everyone if the server goes down then you can either pick consistency by having
27:22
refused connections or pick availability by accepting potentially inconsistent data you do not get like any other magic third option as my coworker Michaela puts it all systems exist on a continuum between safety and liveness guaranteeing data safety safety will reduce the liveness of your system
27:40
and similarly you might want to prioritize liveness sometimes you need to be able to accept that your dad is not going to be perfect if you do this is in many ways conceptually parallels the idea that sometimes we can abandon complicated complicated validation call back systems to make our code base is more livable but this does come with the additional let's
28:04
think about this harder maybe we'll make mistakes straight off and this is okay as long as you're building systems that offset these negative consequences in terms of those offsets I have had a lot of success running lightweight audit processes like every five minutes every 90 minutes maybe if
28:21
you're something complicated every week or even quarter you should do some basic consistency checks like maybe they sum up all the entries in your accounting database and make sure that yeah everything adds up correctly or maybe they make sure that every order that is marked is shipped actually associated shipment and then you want to escalate these issues
28:43
that crop up to real humans if discrepancies are found discrepancies will be found it will be really noisy when you turn it off at first I have worked for a company I'm not going to name it where we turn this system off because it was finding a lot of discrepancies you're going to be worried
29:05
even that the system is the thing that's buggy and not your like data your data is the thing that's buggy I am so sorry um but work through the list it'll get better I promise slowly and another thing that you can do if all this seems a bit heavyweight that's kind of spiritually the similar stop
29:23
nil-checking let your error tracker be this audit system this sounds like a joke but it is not a joke and the reason this is not a joke is because the reason we really do our nil checks right is that we have data integrity bugs we go bugs nag says unexpected method whatever for nil
29:43
class and we go oh damn it shut up bugs nag and we just will slam this nil-check through to deal with the proximate issue instead of investigating the root cause and so I'm suggesting maybe we should stop doing this or even
30:00
if we do need to shut bugs neck up we start logging exactly how many times this particular piece of thing of data is nil so that we can determine how severe the bug is and prioritize doing something about it don't ignore system feedback the reason that these systems work is that they're fundamentally porting a DevOps practice namely monitoring real-world
30:23
performance and adjusting our systems accordingly and applying it to data integrity the other thing that's really powerful about these systems is again we're re-involving humans if we're escalating issues to support personnel then support personnel can deal with these artisanal one-off data issues in a way that's much more efficient than having a programmer trace
30:45
through and figure out exactly where postgres went wrong on the fifth full moon of the month again in order to get that integrity efficiently you need to recognize that software is a living system composed of code and people
31:02
that's how you get past situations where you can't or don't use well known data integrity patterns like validations like transactions so next up is bizarre computer nonsense this one goes by real fast but no seriously like
31:20
the things I've talked about so far in this talk having a strong product understanding making sure you have good inter team and cross team communication patterns building active practices around data integrity checking and making sure to get humans involved when you have data integrity issues these are going to save you when things go really weird because
31:44
fundamentally when there's an issue whether it's a simple issue like you knilled out because a worker failed or because I don't even know like computers can be weird the important part is knowing that things are wrong
32:01
and then fixing the thing that is wrong as quickly as possible preferably before the customer notices or the FEC notices in the case of the fundraising company I used to work for you can fix things if you know they are wrong so it is not so bad you need to be aware of your product its
32:28
needs and how these needs are likely to evolve but that's called being a business savvy application developer you need to be aware of communication patterns on your team and between other teams but that's called being a
32:41
collaborative you need to be aware of how cognitive load and the things you don't know can lead you to make well-intentioned mistakes which is called being a cautious engineer and finally you need to be aware and flexible when computers inevitably computer anyways I got nothing for this one other than my paycheck is real nice this is all stuff you need to be cautious about like
33:06
when I put it that way it seemed easy but it is not there is never such a thing as easy there is never such a thing as just the big overarching theme of this talk is that this stuff is hard and you are going to mess up these are
33:21
complicated problems they require attention to detail and that is hard to maintain over a long time of working fast but if you build in mechanisms that let you recover from issues quickly these issues might as well not have ever happened you don't need to fall into analysis paralysis you don't need to fall into here's three million and five and an
33:44
extra two waterfall steps which will actually be counterproductive and send everything to the place anyway you don't need to resort to unhealthy finger-pointing you don't need to do all of these struggling doomed attempts to prevent the issue next time you can just fix stuff and move on
34:09
so here's me I'm Betsy Haibel I go by Betsy the muffin on Twitter there is very little programming there's a lot of cats and also some feminism this talk slides and a rough transcript we posted at Betsy Haibel comm slash talks
34:24
at some point within the next week or so I work with a company called roostify based in San Francisco we are not currently hiring but we'll be reopening hiring for senior developers soon and if you want to help mentor
34:40
some of the best junior and mid-level developers that you could be privileged to work with then I highly encourage you to apply I co-organize a group called learn Ruby in DC which is a casual office hours type thing for newer programmers in the DC area and if you'd like to do something similar in your town talk to me about it it's not hard you just need to show up and I got
35:06
a few minutes for questions the question is are there any tools that can help us check the data in our databases and this is going to sound like a facile answer but sidekick like seriously whatever asynchronous or scheduled job
35:21
runner you're already using anyway you can build these audit systems that I'm talking about by just having a thing that emails you or sends a bug snag if it finds a problem run it every five minutes it'll be fine or if it's a really expensive calculation maybe less than that don't overthink it the
35:42
important part is that you build something quick and sustainable and you start incorporating you can start incorporating more things over time oh yeah so the question was when we have these transaction script objects that are external to rails to help us remember to wrap things in transactions we run the risk of forgetting to use that object which is
36:02
real and there isn't much that can help you with that other than a strong culture of code review like I've seen some people make more complex things there's a talk Paul Gross gave a few years ago at RubyConf about like monkey patching rails to get very angry at you if you step past things
36:21
maybe that's more complicated than you want to do but even having like this cultural norm of every single object in this folder must inherit from this object does do a lot to help because then you've made this obvious point where things will jump out at you if they are wrong so people are more
36:41
likely to catch things in code review yeah and so the question was a lot of the times when we're trying to deal with particularly bad queries that active record might generate by default we wind up resorting to raw sequel which of course naturally bypasses rails is callback mechanisms
37:00
validation mechanisms etc and therefore also bypasses the business rules that these are maintaining and so how do I how you deal with this and again there is no like perfect answer here unfortunately one thing that you can consider doing is and this is rails conch so I'm going to like get in trouble for recommending that we not be database agnostic but don't be database
37:24
agnostic folks like it is okay to use postgres is built-in check mechanisms for example push more of these things into the database this is going to lead to some flexibility issues like you can start running like check columns only
37:42
work within a table anything outside of a table and you'll need to start thinking about triggers and having worked on a system that actually did need to use this to maintain data integrity because we were pushing so many things into sequel for scale you can check out my former colleague Brawley o Karina's talk for a little more about the other scaling issues
38:02
that actually faced plug yay it's later today you need to be really careful like rails doesn't have great feedback mechanisms for reporting back the results of these kind of triggers failing for example back into rails in
38:25
a way that will make this easy to bug and so this is a really hardcore choice that you should make sure you're making only if you really need it and so a lot of the time I would definitely advise that you think really hard about whether you are best served by doing everything in sequel or
38:47
extracting things to a service object that imposes these business rules or whether you really do need to push things into database sometimes you do but it makes it a lot harder to debug and it sounds like from that beeping that I'm out of time um thank you all