Easy rewrites with ruby and science! - TIB AV-Portal

Easy rewrites with ruby and science!

00:00

0

Formale Metadaten

Titel

Easy rewrites with ruby and science!

Serientitel

Ruby Conference 2014

Anzahl der Teile

65

Autor

Lizenz

CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.

Identifikatoren

10.5446/37600 (DOI)

Herausgeber

Erscheinungsjahr

Sprache

Produzent

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

Ruby makes it easy to prototype a new data model or codepath in your application and get it into production quickly to test it out. At GitHub, we've built on top of this concept with our open source dat-science gem, which helps measure and validate two codepaths at runtime. This talk will cover how we used this gem and its companion analysis gem to undertake (and complete!) a large-scale rewrite of a critical piece of our Rails app -- the permissions model -- live, side-by-side, and in production.

Ruby Conference 201451 / 65

1

33:17

Your Bright Metaprogramming Future: Mistakes You'll Make (and How to Fix Them)

2

31:36

Writing mruby Debugger

3

44:22

The Social Coding Contract

4

28:52

The Quiet Programmer

5

28:14

Testing Isn't Enough: Fighting Bugs with Hacks

6

46:29

Template Engines in Ruby

7

40:46

TDD For Your Soul: Virtue and Software Engineering

8

26:57

Switch Up: How to Switch Careers to Become a Ruby Engineer

9

19:59

Sweaters as a Service

10

34:30

Strong Duck Type Driven Development

11

35:12

Stress Testing as a Culture

12

22:25

Scalable Deployments - How we deploy Rails app to 100+ hosts in a minute

13

31:59

Sauron: DIY Home Security with Ruby!

14

33:45

Rubyists, have a sip of Elixir!

15

29:23

Ruby-red onions: Peeling Back Ruby's Layers in C Extensions

16

23:38

Ruby Performance Secrets and How to Uncover Them

17

31:25

Ruby Idioms You're Not Using Yet

18

33:07

Ruby Changed My Life

19

28:52

Ruby After Rails

20

21:05

Rsense Knows Your Code

21

39:00

Roda: The Routing Tree Web Framework

22

42:54

Real World Ruby Performance at Scale

23

39:21

Rapidly Mapping JSON/XML API Schemas in Ruby

24

42:58

Ruby Conference 2014: Questions for Matz

25

42:56

Promises in Ruby

26

38:47

Programming, Education, and the American Dream

27

29:41

28

36:10

Overcoming Our Obsession with Stringly-Typed Ruby

29

38:42

Ruby Conf 2014: Opening Keynote

30

44:03

On The Outside Looking In

31

28:27

Norikra: SQL Stream Processing in Ruby

32

19:07

Nobody Knows Nobu

33

41:21

My Little C Extension: Lego Robots are Magic

34

38:13

Madam Sandi Tells Your Future

35

1:39:54

Ruby Conference 2014: Lightning Talks

36

36:38

Letting Concurrency Help You Today

37

37:43

Learning from FP: Simulated Annealing in Haskell and Ruby

38

37:54

Kids, Ruby, Fun!: Introduction of the Smalruby and the Ruby Programming Shounendan

39

40:43

40

31:19

It's so quiet. Let's make music.

41

34:33

Isomorphic App Development with Ruby and Volt

42

41:20

Incremental GC for Ruby Interpreter

43

21:30

2 + Cats = 4 * Cute: How Math Works in Ruby

44

41:16

Harnessing other languages to make Ruby better

45

30:40

'Good Luck With That' : Tag Teaming Civic Data

46

30:38

Going the Distance

47

39:22

Going Evergreen

48

32:26

Future-Proofing Your 3rd Party Integrations

49

24:49

Epic Intro Music: BLE Beacons and Ruby

50

44:52

Enumerable for Fun & Profit

51

34:02

Easy rewrites with ruby and science!

52

33:06

Eastward Ho! A Clear Path Through Ruby With OO

53

30:47

Deoptimizing Ruby

54

39:38

Containerized Ruby Applications with Docker

55

26:46

Chat Robots Next Level Tooling

56

35:12

Building Your API for Longevity

57

30:42

Build the Unified Logging Layer with Fluentd and Ruby

58

43:17

Benchmarking Ruby

59

36:01

An Introduction to Spies in RSpec

60

33:05

Affordances in Programming Languages

61

29:01

A World Without Assignment

62

35:12

A Partial-Multiverse Model of Time Travel for Debugging

63

38:54

A Lever for the Mind

64

39:39

6 Reasons Jubilee Could be a Rubyist's New Best Friend

65

46:00

5 Things I Wish Someone Had Told Me About Programming Before I Started

Automatisches Abspielen

Sprache

Text

Bild

00:00

InternetworkingTermersetzungssystem

00:28

Rechter WinkelBitInternetworkingDatenbankBitrateApp <Programm>Quick-SortTermersetzungssystemFront-End <Software>Computeranimation

01:03

TermersetzungssystemSchreiben <Datenverarbeitung>Rechter WinkelPhysikalisches SystemGruppenoperationProzess <Informatik>MultiplikationsoperatorGüte der AnpassungComputeranimation

01:53

Physikalisches SystemDokumentenserverLastSelbst organisierendes SystemCOMBitModallogikRechter WinkelPhysikalisches SystemDokumentenserverEinfache GenauigkeitWeb-SeiteComputeranimation

02:29

Innerer PunktKollaboration <Informatik>SinusfunktionSelbst organisierendes SystemQuellcodeGammafunktionBitKollaboration <Informatik>ModallogikPhysikalisches SystemCodeGamecontrollerQuick-SortSelbst organisierendes SystemDokumentenserverMultiplikationsoperatorTabelleComputeranimation

03:36

DatenfeldDefaultTabelleDokumentenserverSelbst organisierendes SystemSchreib-Lese-Kopf

04:19

DokumentenserverMailing-ListeSelbst organisierendes SystemWidgetNotepad-ComputerAggregatzustandRechter WinkelMailing-ListeSignifikanztestWeb-SeiteGefangenendilemmaQuick-SortGlobale OptimierungMultiplikationsoperatorSelbst organisierendes SystemProgrammfehlerDifferentePhysikalisches SystemDokumentenserverBitKollaboration <Informatik>AdditionDeklarative ProgrammierspracheZweiPerspektiveMathematikCASE <Informatik>GamecontrollerXML

06:55

Interface <Schaltung>Rechter WinkelMehrrechnersystemCASE <Informatik>Nichtlinearer OperatorDatenbankOrdnung <Mathematik>SoftwaretestTabelleLastObjekt <Kategorie>Komplex <Algebra>GraphPhysikalisches SystemFortsetzung <Mathematik>Prozess <Informatik>Computeranimation

08:41

Produkt <Mathematik>Endliche ModelltheorieSoftwaretestRechter WinkelTeilbarkeitPhysikalisches SystemOrdnung <Mathematik>FlächeninhaltBitMaschinenschreibenSchlüsselverwaltungRefactoringPunktComputeranimation

09:10

SoftwaretestVorgehensmodellEndliche ModelltheorieRefactoringSoftwaretestQuick-SortCASE <Informatik>BitProdukt <Mathematik>KonditionszahlComputeranimation

09:40

KorrelationCodeFaktorisierungTrigonometrische FunktionMenütechnikResultanteProgrammbibliothekMultiplikationsoperatorEreignishorizontEinflussgrößeCodeMusterspracheCASE <Informatik>RefactoringProdukt <Mathematik>SoftwaretestKomplex <Algebra>TeilbarkeitComputeranimation

10:39

Wechselseitige InformationDokumentenserverCodeKontextbezogenes SystemRepository <Informatik>ProgrammbibliothekKlasse <Mathematik>Endliche ModelltheorieDokumentenserverKontextbezogenes SystemCodeRepository <Informatik>MultiplikationsoperatorAggregatzustandZeichenketteSystemaufrufMatchingComputeranimation

11:58

Wurm <Informatik>Total <Mathematik>ProgrammbibliothekMultiplikationsoperatorCodeWeg <Topologie>VererbungshierarchieWurm <Informatik>Total <Mathematik>EreignishorizontPhysikalisches SystemLesen <Datenverarbeitung>Mereologie

12:56

Lokales MinimumDokumentenserverSpeicherabzugMigration <Informatik>TabelleMultiplikationsoperatorBitIterationCASE <Informatik>Rechter WinkelDokumentenserverPhysikalisches SystemSelbst organisierendes SystemAbfrageMereologieMigration <Informatik>SkriptspracheTypentheorieLastGruppenoperationMatchingEndliche ModelltheorieSpeicherabzugDatensatzVererbungshierarchieOntologie <Wissensverarbeitung>VerknüpfungsgliedZentrische StreckungSkalarproduktProjektive EbeneCOMComputeranimation

16:18

Hill-DifferentialgleichungSoftwareentwicklerDifferentePhysikalisches SystemPunktMultiplikationsoperatorStatistikVerschiebungsoperatorMultigraphGeradeCodeMinimumGraphWhiteboardComputeranimation

18:01

VakuumFluss <Mathematik>PunktMultiplikationsoperatorResultanteGamecontrollerAusnahmebehandlungHash-AlgorithmusComputeranimation

18:32

DokumentenserverCodeSpielkonsoleGeradeCASE <Informatik>Kontextbezogenes SystemKonditionszahl

19:12

AggregatzustandProgrammfehlerDatenbankPhysikalisches SystemMultiplikationsoperatorInformationsqualitätEndliche ModelltheorieTabellePunktSchreib-Lese-KopfComputeranimation

19:55

MagnettrommelspeicherGammafunktionLokales MinimumHill-DifferentialgleichungInformationsqualitätMultiplikationsoperatorDatenbankOrdnung <Mathematik>BitParallele SchnittstelleProgrammbibliothekGruppenoperationProgrammfehlerDatensatzGüte der AnpassungMigration <Informatik>Physikalisches SystemSoftwarewartungVorlesung/Konferenz

21:42

CodeSoftwarePhysikalisches SystemDokumentenserverComputeranimation

22:26

KontrollstrukturLogischer SchlussMathematikCASE <Informatik>MultiplikationsoperatorDokumentenserverDatensatzPhysikalisches SystemGraphVerschiebungsoperatorDatenbankRechter WinkelTeilbarkeitGeradeInformationsqualitätSoftwareentwicklerFamilie <Mathematik>Punkt

24:10

DokumentenserverGeflecht <Mathematik>MultiplikationsoperatorArithmetische FolgeInformationsqualitätSystemverwaltungPunktDokumentenserverRechter WinkelMehrplatzsystemDatenbankPhysikalisches SystemComputerspielGruppenoperationSelbst organisierendes SystemKommandospracheSicherungskopieMehrrechnersystemAbfrageTabelleBitDifferenzkernOrtsoperatorVollständiger VerbandProdukt <Mathematik>Arithmetisches MittelSoundverarbeitungGüte der AnpassungComputeranimation

27:49

VersionsverwaltungMIDI <Musikelektronik>DokumentenserverGamecontrollerTermersetzungssystemAusnahmebehandlungRuhmasseDokumentenserverProgrammbibliothekQuick-SortPunktRechter WinkelMultigraphSelbst organisierendes SystemInformationsqualitätTorusMathematische LogikQuellcodeComputeranimation

28:52

Quick-SortMultigraphYouTubeComputeranimation

Transkript: Englisch(automatisch erzeugt)

00:18

I'm Jessi Toth, I'm known as Jessi++ on the internet,

00:23

and I'm here to talk to you about easy rewrites with Ruby and science. So a little bit about me before I get started. Like I said, my name is Jessi Toth Toth, Jessi++, anywhere on the internet that matters. I do a lot of backend Ruby stuff for GitHub,

00:40

related to our giant Ruby on Rails app. I like to do things that cross over with some database stuff, some Git stuff, some permission stuff, all sorts of fun stuff back there. So, onto the rewrites. I have to admit that I said easy rewrites, and maybe I lied, because you may know

01:04

that rewrites are never easy. I don't think I've seen any rewrite that's easy. In fact, a legitimate reaction to someone saying they want to rewrite is a rewrite? What, why do you want to do that? That sounds like a terrible idea, because a lot of them fail. It's really hard to do a rewrite, first of all.

01:22

They take a long time. They take a lot longer than you expect. Most of them that I have seen, they don't ever finish. You keep rewriting and rewriting and rewriting, and it's never good enough, or it's never the same as the old system, and then maybe you throw away, and you start a new rewrite. You rewrite the rewrite, and you just keep doing this. So a rewrite is a pretty scary thing to start.

01:43

And the rewrite that we did at GitHub was extremely scary, because it was way bigger in scope than anything that I've ever done in the past, but it was pretty successful, because of the tools we used. So this was what the rewrite was. We wanted to rewrite our permission system,

02:02

and we wanted to create a more flexible system to grant and provoke access to repositories, forks, issues, pull requests, teams, organizations, basically anything that's controlled by permissions on GitHub.com. And that's scary, right? That's pretty far-reaching. That affects just about every single page load

02:21

of GitHub.com and every single API request. So touching a lot of stuff. But to understand why this was necessary, I want to give you a little bit of history of what the system was before we decided to rewrite it and why a rewrite really seemed necessary. So first, there was collaboration.

02:40

When GitHub started, we let people collaborate with one another, and there was a feature you could add someone as a collaborator to your repository and give them access to your code. And you could use pull requests or use things to collaborate back and forth and work on the code together. And as GitHub grew, the original collaboration was not enough. It was basically just two or maybe three people

03:02

working together, but people started to have teams, or they started to put their companies on GitHub, and they needed more effective ways to organize this sort of access and permissions. So then we added organizations. These were ways to group your teams together and give them access to different repositories that your organization controlled.

03:21

But there were problems with this kind of from the start. One of the biggest problems was that these two systems, they came in at different times, and so they ended up having different ways of granting permissions to things. The old collaborators, they granted permissions one way. In fact, the table looked kind of like this. It was a super simple join table. It said, this user has access to this repository,

03:42

and that's it. But then when we did organizations, they implemented it a slightly different way, which was team members. It was this, if you look at this closely, you might be scratching your head, and you might notice that this is a three-way join table, which is pretty terrible. It's joining on a team, a user, and a repository.

04:01

So here, users could be a member of a team. That's how you said you're on a team. But a repository could also be a member on a team. That's how you said this repository gives access to these team members. It wasn't the best schema, and it caused us a lot of problems. Places where it started to cause us problems

04:21

were places where we needed to get lists of things. So there were a lot of places that needed lists of particular repositories, lists of pull requests you had access to, or scope to this thing, lists of teams even. And they all needed to access the permissions in slightly different ways, depending on the kind of data that they wanted. So we had things like the repositories and organization controls.

04:41

You need to get those in one way, versus if you're going from a user's perspective, what pull requests can they access? Well, those might be different based on whether they have access to individual repositories as a collaborator versus through organizations. And as time went on, we found that there were a lot of bugs

05:01

around different edge cases and transitional states. We actually have a lot of transitional states in GitHub. You can be added and removed from a team. You can have your access removed. You as a user can transform yourself into an organization. You can transfer your repository to another user. You can do all these crazy things, and there were lots of bugs and lots of craziness.

05:21

So we started to see issues like this. People could see things on this dashboard that they didn't actually have access to. When they got these lists of pull requests, it said, you can access this one, but he would click on it and go somewhere, and it would 404 and say, oh, sorry. When we get to this page, we finally discovered that you no longer have access. And we kept seeing these sort of problems.

05:42

In addition, we started to see a degradation of performance. As GitHub got more popular, there were more repositories to be pulling in. There were more issues. There were more pull requests. There was a lot of stuff to be grabbing. And each of these places, they all grabbed these lists in different ways, and they all started to have

06:00

performance problems at different times. So different people would come in and say, oh, there's a performance problem here. Let me optimize it. And each person did this a little bit differently in each place, but they all ended up kind of like this with these giant hunks of optimize SQL. This isn't very pretty, right? And each of them were slightly different because a slightly different person had come in

06:21

and optimized this one, and it's grabbing slightly different data. So all of these things kind of compounded together and already gave a good reason to rewrite. But we had one more thing to add. Defunct Chris Weinstrauth, our CEO, said, you know what? Organizations aren't even good enough yet. We want to make them better.

06:42

But when looking at the permission system we have, we can't possibly add anything to it. It's already so complicated. If we wanted to, say, change our permissions, we can't do that. I'd like to do that. Let's find a way to do that. And so we said, okay, we have to do a rewrite to do that. If you want that, you must let us rewrite.

07:00

All this history that I've given you actually happened before I even joined the company. So I'm telling you the story of these two heroes right now. This is John Barnett and Rick Badley. They started off at the beginning of the rewrite to replace the system and see if they could make something better. So they started off with some pretty simple goals.

07:21

They wanted something much simpler than what we had, and much more flexible so it could be extended to different things that in the future we may want to grant permissions for, just any general permission. We want it to be fast. We want it to be super fast. Some of these things had already had permission problems. GitHub was continuing to grow and seeing bigger and more complex use cases.

07:41

So they need to be fast now so that they would continue to be fast in the future. And we also wanted to make it pretty easy to operate with the things that we already had. GitHub's pretty conservative about our operation. We don't like to add new databases or new technology. We tend to stick to what we have. So we said the old thing was in MySQL. Let's write this new thing as a table in MySQL

08:01

instead of maybe going for a graph database or something like that. So they started off with an initial spike. And they wanted to be able to spike something out quickly to test how it would perform with load data as quickly as possible so they didn't get too far into writing something and then realize that it wasn't gonna work. It turns out this was a pretty legitimate concern.

08:24

So they wrote something that they called capabilities. And this was gonna be the system that we're gonna use. And John started off writing this and saying, this is how it's gonna be. You ask the capability, can this user do this thing? Can this object do this thing?

08:41

But in order to test this, he was doing the rewrite and we also need to do a refactor because in order to test it out with production data, we needed a way to kind of shim it into the areas that were already reading permissions and just maybe dark ship it a little bit, run it, see what it was doing versus that. So while John was writing the new system,

09:02

Rick was trying to refactor just a few key touch points so that we could put this in and see what it would look like if we were to switch over it. But he ran into a problem while he was doing this refactoring. He was finding places where there were some tests and maybe there weren't as many tests as we wanted but the problem with the test is they also weren't modeling production data.

09:21

We'd been seeing these scenarios that were really complicated and maybe they were from people that had been users since the beginning of GitHub and they had accumulated all this data and no matter how we tried, we couldn't get these test cases into our tests to show the same sort of things. So what he decided to do was to run a little bit of experiment.

09:43

He wanted to conditionally execute a path that he had tried to refactor and see did this refactoring return the same thing that the original thing did which is usually what your tests do but with such complex data, we said we actually need to test this in production to see if it really is doing the same thing.

10:03

So that's what he did. He basically dark-shipped this little refactor and he used the instrumentation library we have to just throw off some events anytime it was run. So he started off running it very few times like 1% of the time maybe and comparing the results at the end, returning the original code

10:21

but doing some timing and some measurement around what happened with refactored path and this turned out to be a really useful pattern. He was able to see, oh, I didn't quite refactor this correctly. I forgot this little case and he was able to fix that and he kept doing this and it turned out it was really useful. So we pulled this into a library

10:42

and called it Science and we made it available to everyone at GitHub because we said this is really useful actually. You should try sciencing everything. You should run experiments on all of your code and see if this works for you. So let me run through an example of what science looks like. We have a repository class in our models

11:03

and here's the question that we ask repositories a lot. Are you pullable by a user? Can they pull your code? And to put a science experiment in there, what you do is you say, I wanna make an experiment, give it a little name, string, and then you take the old code that used to be in pullable by

11:21

and you put it into a new method. We came up with this convention just taking the same method, throw legacy on the end. So pullable by legacy, you take that code in there and you say, okay, for the science experiment, I want you to use this legacy code but I also want you to try something new. Try out this new code that I did.

11:42

This is just an experiment. We wanna see if it's gonna work. And you can add some useful things to it like context. We said, here's the repo we're trying, here's the user we're trying. So if these things don't match, then we can use that context to go back and investigate and see what was happening. So each experiment would publish this

12:01

and we pass that to our instrumentation library and we were able to gather some really neat things here. So we could see the total time or the total amount that we were running this. So how many times did this get called? We grabbed some timing data around it. So how long did the old code path take? How long did the new code path take?

12:20

We also threw a custom event whenever things mismatched. So when things did not match, increased the total of how many went wrong so you can keep track of how often you're mismatching. And then we just did something super simple where we just dumped the payload into Redis so we could go and look at it later and said, okay, if we have a mismatch,

12:40

we wanna go and investigate that data and see what went wrong and use that to go back and change the code. So they used this process over and over again and they were able to get a decent spike out. Now it didn't quite work. It had some performance problems so they stopped and they threw it away but they stepped back and took the lessons

13:00

that they learned from that to build a new system. So this is the part of the story where I come in. I had just joined GitHub at this time. In fact, Nathan and myself both joined GitHub at this time and we were asked if we'd like to join this project. And we sat down with Rick and John and we talked through some of the lessons

13:21

that they learned from capabilities and said, okay, we wanna build a new system. We'll give it a new name too so we don't confuse ourselves. We're calling it abilities. And so we're gonna take everything that we did wrong there and we're gonna build this new system and we really want to actually put it in now. I think we're ready. So what we came up with was something super simple like this

13:41

and it's basically you ask a question to the system. It says, can this user read this repository? And we generalized it quite a bit so you have a general actor. It could be a user, it could be a team, it could be an organization and you have a subject. That's a repository in most cases but sometimes it's a team or something else. So you can ask questions about it.

14:02

You can grant things. So a subject will grant an actor a specific action and you can revoke those things. So this was super simple. This was basically all we came up with. We had to add a little bit more. We had situations where we have users and teams and repositories. So if you grant a team access to a repository

14:21

and a user access to a team, we wanted that to cascade so that the user got the access to a repository. But beyond that, that was the whole system. That was it. And we thought it was super simple and it would maybe break down some of those huge queries that we were seeing. So we went through this and we wrote the core of abilities

14:40

the actual rewrite in maybe a few months. There was a bit of iteration on it and figuring out what we wanted to do. We went off into the weeds a little bit. We tried to make it too general and then we came back and said, no, we really need this for this specific case. Let's not go too crazy. But that actual rewrite didn't take very long. What did take long was the next part which was modeling that to our legacy data.

15:02

So once it was written, we said, okay, we need a way to see if the data generated by this system is the same as the old system. So we just wrote some little migration scripts like in the beginning it was the GitHub org. Run through the GitHub org and for every user and team and repository on there, try to generate the type of permissions

15:22

in the new ability system that it has based on the data that we have in the old system. And when we ran through that, we saw a few problems, we fixed them up. But then we opened that up and started running the migrators for everybody on github.com. So let's generate the data for everyone and see, does this match our old system?

15:41

And after generating the data, of course, we started off with generating the data and that was good, but data changed from the time we generated to the time we were measuring at times. So then we added places where we were dark shipping rights to abilities. So anytime you touched the old system, we said, write this also to abilities. Write a new record or if you're removing something,

16:01

delete the record. Just do both of them at the same time. And we kept this dark ship scaled down a lot. We would do it maybe 10% of the time or something like that. We always wrote but we didn't always read. We didn't want to put too much load on the system. But once we had that in, we wanted to science everything.

16:22

So we wanted to add, oops. Did that go? And we wanted to add science to everything. So any place that we read data out of the permission system, we added a science experiment. And we said, okay, keep reading the old system but now start reading the new system and tell us what the differences are.

16:40

And that's the point at which we could start looking at all this data that we had generated and seeing what we had. So it looked kind of like this. We built this dashboard to show the graphite data that we had. Our instrumentation all goes into graphite. So we had graphs and stats on how many mismatches we had,

17:04

how much is this running, how many total things. And we could, at a glance, see how all our experiments were doing. So this was a health check. Every morning I would get up and I'd say, okay, how's abilities doing this morning? And I would look at one specific experiment. I'd drill down in and I'd say, okay, how's pullable by doing this morning?

17:22

Well, we're running quite a few of them. You can see the top graph is how many total times the experiment has been run. And then it has a little line on the bottom for wrong. But because there's such a huge scale between wrong and total, you can't really see it. So we made a more zoomed in graph below that, which is how many mismatches you have.

17:43

And then we also cared a lot about performance, especially because we had this dark shift. We can't have this being very slow. So we also had graphs for what is the performance of the new thing that you were trying out versus the old code. And we just kept looking at this. And once we saw the mismatches,

18:01

then we wanted to actually analyze what we had. So we said, okay, there's about 20 mismatches per hour. What are they? What's happening? What's going wrong? So at this point you can, we did this super simple. Like you can just jump into the console and pull these things out of Redis and say, okay, how many times has pullable by mismatched? Oh, we've got about 3000 results

18:21

waiting for analysis there. And you can just pop each result off. And what it looked like was a super simple hash. It said, this is the experiment we're running. And then it had some things for the candidate, which was the legacy and the control, which was the legacy and the candidate, which was the new thing that you were trying. So it would show you how long did it take?

18:42

Did it raise an exception? And what was the value returned? In the case of pullable by, this was just a Boolean method. So it was true and false. But with the context that we added to this, so we had the repository and the user, then I could go in to the Rails console and I could start investigating. I say, okay, what happened to this user?

19:00

I could walk through the legacy code of pullable by line by line and say, okay, well it matched this condition, this condition, this condition, and this is where it went to mismatch. What went wrong there? So what we found were a few things. There were definitely bugs and abilities at first, like I said, we hadn't completely modeled

19:21

the old system correctly to begin with because we didn't even have the whole thing in our head to begin with. So at the first time when that happened, we would fix a bug and then we would say, okay, we've run our migrators, we had all this data in the database, let's just truncate the whole table and rerun the migrators and fill it up again because at that point it was a bug

19:42

in the system itself, not anything else to fix. But once we got past that, that didn't take too long, then we had something more, which was problems with our data. In fact, we ran into a lot of data quality problems. I can show you a, this is gonna go, sorry.

20:05

So sampling of the data quality issues that I saw, there were a lot of them. And this is where we spent probably a bulk of our time. We found quite a bit of problems in the data that we had in the database

20:20

for the old stuff. People had something wrong. Maybe they fixed the bug, but they didn't know they had generated a whole bunch of old data that was really bad and they didn't clean it up. And that data just kept going. It interacted with other data and got uglier and dirtier and more terrible. And so we had to track it down, each and every case, and find out why it got that way,

20:41

how it got that way, and fix all of them. When we first saw these data quality problems, we thought, maybe it's one or two things. We can just ignore it, switch over to the new system. It'll be fine. No. No, it was definitely not fine. There were a huge amount of data quality problems. So we said, we need to fix this in the old data in order for it to generate new data correctly.

21:01

And we need to be sure that they're both matching and true and correct for the right reasons. And because we did this so often, we ended up writing a lot of tools to help with data quality. We wrote a library for running data transitions because we did it so frequently. It used to be we'd just run Rails migrations, but that was too brittle for us.

21:21

We needed to be able to run them in parallel. We needed to be able to run a lot of them at the same time. We needed to be able to throttle them. There were times when we were deleting millions of rows and we didn't want to be attacking our database just because we're doing this maintenance to clean it up. So we built throttlers to run this slowly over time and get rid of it without anybody noticing.

21:42

We also had problems with just the legacy system itself. There were some things that just weren't thought out. There were features that were added together that didn't work together properly. In particular, I ran into a problem with networks of repositories and forks. When they had different visibilities, some were public, some were private,

22:00

the permissions were just totally messed up. And so I had to stop working on abilities and I had to go fix that. I had to fix the code. I had to fix the data. I had to contact a bunch of users and tell them that I was gonna do some crazy things with the permissions on their forks. It was long and it was involved, but we had to do it to move forward. Otherwise, the system, it just wouldn't have worked. We wouldn't have been able to generate the right thing.

22:23

So we got through all that and then we ran into some other stuff. So like I said, this was dark-shipped and we were watching the performance the whole time. In general, the performance was good, but we started to see some interesting cases. The graph up there in the top

22:41

showed something that started to happen to us every day at 5 p.m. We had this one particular customer that was doing something very interesting with our API and they were mucking with the permissions every day at 5 p.m. And they had teams and repository sizes that were much larger than we had dealt with before. And they were making a lot of changes.

23:01

They were basically trying to delete all their permissions and put them all back, one right after another, which sounds terrible, but we ought to be able to handle it. And we weren't. Abilities wasn't handling it. The old system seemed to do it just fine, but abilities wasn't handling it quite well enough. So we went back and we reworked abilities.

23:21

We were blowing out a lot of stuff into the database and we found there's a way that we could infer certain things. We didn't need to write rows. And what that ended up doing was we were able to delete 72 million rows of these things. And you can see the graph that Nathan shows there. After he finished this the next day, there was no blip at 5 p.m.

23:40

So having that dark shift was super helpful for us. We could work on the performance over time and then we could see things as soon as they happened. This wasn't, we ship it and then we walk away and we say, we don't care anymore. And then months down the line, something crazy happens. We could see this slowly as we were ramping up. We could even turn abilities off completely

24:00

so that if we really didn't want this to happen, we could just turn it off and not have performance problems at all. So that was really helpful for us in developing that. So at this point, where are we? We've done a rewrite and refactor and a bunch of science and a huge amount of data quality repair and some performance.

24:20

And it's taken a long time. This is probably close to two years in. And this is where our progress is right now. None of these things are using abilities. But we just kept going. We kept doing that loop over and over again. Have the data in there, read it out, find the mismatch,

24:42

just find the data quality problem, fix it. It's over and over again. This was my life for months and months and months. But finally, finally, we were able to start flipping things over. And we did them piece by piece. We started with organizations. So check, we got those. Then we did teams. That wasn't too bad, check. But the big thing that was left was repositories.

25:02

And this was the biggest thing that got us into this problem in the first place. And it was a place where we had the most data quality issues. So it was gonna be the hardest. It was gonna take the longest. We're expecting this. But we kept working through it. And eventually, it got to the point where the science was mostly green.

25:20

There was one last data quality issue. And the problem was that when users were being deleted, they weren't being removed from teams so that wasn't being cleaned up. It was just some legacy data that was there that we weren't gonna do that in the new system. So I said, okay, I'm gonna write a data transition like I have before to clean this up. And I'm gonna run it.

25:41

And I ran it. And I queried something afterwards. And I said, shit. That doesn't look right. I ran another query. Oh shit, oh shit, shit, shit. I just deleted every single repository from every single team. I mean, this is the reason we wanted to refactor this. This three-way join table was so bad

26:02

that it was really hard to write the correct query for it. And it bit me. One last time before I could get out, it bit me in a really hard way. And so you can see, I mean, we put up a status. We said, some of you may not be able to access your repositories.

26:21

But we did have abilities. And I said, you know what? We have backups for our database and our database administrator is fantastic. Like he was right on it. He's like, okay, we'll get this. We'll get all the data back in. And I said, while you're doing that, I'm gonna turn abilities on. I'm gonna switch it over to the new system because it's pretty much ready and right now it's more correct than the old system.

26:41

So I did that. We got to have a little bit of a fire drill with abilities. And so they got all the backups there and got all the data back. And I said, okay, I'm gonna revert that because I'm just not confident that we should leave it on. I wanna be absolutely sure we should leave that on. So once I cleaned up my mess

27:01

and got all that ready, I went back and I looked at the experiments. I said, well, we're all green now. That was the one last data quality thing I had to fix. But I was a little gun shy, especially after what had just happened. So I said, all right, I wanna be really sure. I wanna be scientific about this. I wanna do something that runs through

27:20

every single user and repository and calls us to be sure that there's no bad data sitting there that we just haven't hit because nobody's brought it up in production. So I made this transition. I said, I'm just gonna iterate through each of them. I was expecting data quality problems to come up. I was like, this isn't gonna be the end. I'm gonna find something else, I'm sure.

27:42

But I didn't. I ran through all of it. And there was not a single mismatch while I ran through it. And so I said, all right, guys, it's time. Let's switch it over. And the way we did this with science is just to be sure we would actually switch the use and the try blocks. So we'd keep everything the same

28:02

but switch what the control and candidate were doing just to be sure that we could still measure performance and see that there were no mismatches happening after the switch. And then once we were extremely confident, then we would remove the science. So at this point, we've done it. Everything, organizations, team, repositories, it's all using abilities.

28:21

This happened mid-August. And besides that little blip where I removed your access to repositories via Teams, you shouldn't have noticed. We open sourced this library because this was tremendously useful. I cannot even imagine doing this sort of rewrite

28:40

without having this tool. I mean, it helped us to write the original library, helped us to find massive amounts of data quality. It helps with problems in just the logic itself. I don't know how else we would have done it without this. So if you're doing a big rewrite with this, I would recommend that you use something like scientist or use some sort of data and graphs

29:02

and be very, very sure. And I hope that you use it. Thank you.

Empfehlungen