Predicting PR Comments
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Number of Parts | 561 | |
Author | ||
Contributors | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/44490 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Electronic mailing listVector graphicsCodeSoftware developerInformationFeedbackData modelBuildingImplementationRankingServer (computing)Multiplication signVirtual machineQuicksortProcess (computing)Axiom of choiceBitCircleDemosceneOpen setConnectivity (graph theory)File formatTable (information)Roundness (object)Event horizonRepository (publishing)Imperative programmingStreaming mediaCondition numberSlide ruleOperator (mathematics)Endliche ModelltheorieTemplate (C++)PredictabilityData storage deviceModel theoryPattern languageComputer programmingDifferent (Kate Ryan album)Front and back endsMachine learningAuthorizationConnected spaceRow (database)Greatest elementServer (computing)Projective planeInformationWave packetFitness functionCASE <Informatik>Inheritance (object-oriented programming)RobotDomain nameCodeThumbnailResidual (numerical analysis)Interactive televisionDiagramQuery languageBuildingSoftware repositoryChainMathematical analysisOpen sourceSoftware developerFeedbackSingle-precision floating-point formatSequelDataflowSPARCHookingExpressionSet (mathematics)Client (computing)Complete metric spaceWeb 2.0CubeComputer fileDatabase transactionBasis <Mathematik>WeightLevel (video gaming)MereologyComputer animation
08:54
Patch (Unix)AreaGradientModel theoryHill differential equationHypercubeParameter (computer programming)Software testingData modelType theoryRankingContext awarenessStress (mechanics)Product (business)BlogMobile appMathematicsConcurrency (computer science)Slide ruleClient (computing)Multiplication signInheritance (object-oriented programming)Pattern languageTwitterModel theoryBitQuicksortTunisNumberException handlingGradientBasis <Mathematik>DemosceneSpacetimeOrder of magnitudeWave packetSemiconductor memoryInteractive televisionSocial classView (database)PredictabilityFeedbackProjective planeVideo gameDemo (music)Goodness of fitBlogParameter (computer programming)Source codeDevice driverLink (knot theory)CodePoint (geometry)Active contour modelStack (abstract data type)Different (Kate Ryan album)IterationServer (computing)Network topologyCodierung <Programmierung>Physical systemSet (mathematics)Standard deviationInformation retrievalRight angleSoftware testingMereologyComputer programmingDecision theoryAreaObject-oriented programmingElectronic mailing listFormal languageAtomic numberLikelihood functionContext awarenessComputer fileQueue (abstract data type)Complex (psychology)Endliche ModelltheorieLine (geometry)Greatest elementBuffer overflowPatch (Unix)Process (computing)Sign (mathematics)Token ringPrandtl numberMemory managementTransport Layer SecurityLoginDoubling the cubeEvent horizonWordHypercubeComputer animation
17:43
Graphical user interfaceHookingGastropod shellInclusion mapSummierbarkeitDemosceneVector spaceInformationLoginServer (computing)Function (mathematics)Asynchronous Transfer ModeRight anglePredictabilityFront and back endsComputer programmingEndliche ModelltheorieZoom lens2 (number)Multiplication signComputer fileElectronic mailing listMessage passingWindowAliasingType theoryLine (geometry)Signal processingConnectivity (graph theory)Computer animation
21:27
WindowCodeGraphical user interfaceMathematicsTerm (mathematics)Chi-squared distributionModulo (jargon)Scale (map)Hash functionLink (knot theory)HookingModule (mathematics)Convex hullSignal processingPredictabilityCodeLine (geometry)Projective plane2 (number)Endliche ModelltheorieHacker (term)Sign (mathematics)ResultantSocial classContext awarenessRepresentation (politics)Sampling (statistics)BitComputer fileMathematical optimizationClosed setRight angleWave packetLimit (category theory)Module (mathematics)Goodness of fitMereologyDefault (computer science)Formal languageAverageComputer animation
27:18
Gastropod shellSoftware testingFluid staticsCodePoint (geometry)BitWave packetDemo (music)GodInteractive televisionEmailWordElectronic mailing listRepository (publishing)Set (mathematics)Projective planeGroup actionMathematical analysisElement (mathematics)Formal languageRight angleComputer animation
29:36
HookingGraphical user interfaceConvex hullEuler anglesModule (mathematics)EmailPolygon meshCodeModulo (jargon)Core dumpHash functionImage resolutionWindowMaxima and minimaServer (computing)LengthCodeConnectivity (graph theory)Einbettung <Mathematik>Formal languageLine (geometry)Right angleMultiplication signoutputWave packetPoint cloudProjective planeCrash (computing)Electronic data processingDifferent (Kate Ryan album)WordQuicksortGodEndliche ModelltheorieComputer fileMereologyThumbnailFeedbackFamilySinc functionRootVector spaceJava appletRepresentation (politics)DatabaseSingle-precision floating-point formatModel theoryBitCloud computingDegree (graph theory)ResultantNumberComputer animation
35:04
Computer animation
Transcript: English(auto-generated)
00:06
Okay, so next up I'm very happy to introduce two of my friends that are going to be talking about how they try to use machine learning and maybe it worked, maybe. So, a round of applause for Holden and Chris.
00:24
Hi everyone. So, I'm Chris, this is Holden. We did some work on trying to predict PR comments with GitHub. And we're going to be talking about it today. So if you want to, we put this up earlier, but this is for the stream for folks at home. If you want to go to github.com slash Frank the Unicorn slash FOSDEM, it's a repository
00:46
and you can open up a pull request there with any old open source code you want. And we'll actually iterate on that and process it live today and we can actually talk about the machine learning we're doing behind the scenes. So if you want to play along at home, feel free to open up a PR there. So a little bit about me. I contribute to a lot of open source, primarily Kubernetes and a lot of stuff in the Go community.
01:07
I've written a book and I just engineer a lot of things, I guess, I don't know. Anyway, I helped Holden here with the Kubernetes bit of things as we started to go down the machine learning path. And I work for VMware.
01:22
Cool. I'm Holden. I contribute to other projects, mostly in Scala. I'm an author but of different books and I work on different projects and I work for a different company. Thank you glorious employer for paying my salary. Despite all of this being mountains, I don't climb mountains and also that's not my Twitter handle at the bottom.
01:43
This is Nova's slide template. I climb a lot of mountains and this is my slide template. We're going to look at mountains too today. I hope that's okay. So this all started four months ago, three months ago. We got together and we were brainstorming ideas for machine learning and Holden had the idea
02:02
that we might be able to scrape data from GitHub and try to predict the lines of a pull request where people might make a GitHub comment that says, hey, maybe this isn't the best idea or something like that. And the more we thought about it, the more we realized that GitHub stores a lot of information about developer habits. These are things that people commonly do and then people who are reviewing pull requests all the time, like Holden with Spark,
02:24
there might be some patterns behind the cognitive behavior there and we were wondering if we could train a model to represent that. So we thought maybe we could learn about what normally gets attention from humans in GitHub. And maybe we could use our skills in Kubernetes and Kubeflow and machine learning to build a pipeline
02:41
to predict where these pull request comments are most likely to show up. So our very high level, super high quality design document, as you can see here, is step one, extract the data. Step two, build the model. And step three, serve the model.
03:03
And then step four, just in case step three resulted in garbage, is to collect explicit and implicit feedback. And so the idea is that you can thumbs up and thumbs down pull request comments. And so the bot will make comments and we can record the interactions. And also we can do sentiment analysis and if people start swearing at Frank the unicorn a bunch,
03:24
maybe Frank the unicorn is not the greatest unicorn. It would be very sad. Also we don't use Kubeflow because I couldn't get my PR to add Spark to Kubeflow integrated in time. But I really wanted to. I have a PR and it will happen next week.
03:44
Cool. Okay, so we started to look at building this thing out. So the first thing I was tasked with was building some sort of data extraction from BigQuery. And I wrote a program in Go that we can look at later that runs not only as just like you can run it as a single cron job to export a year's worth of data,
04:03
but we also have it running in Kubernetes as a Kubernetes operator that will go and audit BigQuery and see if anything's changed and automatically update the record behind the scenes if it has. So we have this sort of resilient ongoing update of data coming out of GitHub as well as the ability to back port back to 2014.
04:21
Although 2014 wasn't like the best year for GitHub data. So 2015 moving forward is really as far as we got. And then the BigQuery side of things, this is where Holden really took over. Yeah, so I utilized my wonderful skills of writing SQL which I pretend not to have most of the time to avoid writing SQL queries.
04:42
But I needed some data unfortunately. And so yeah, the SQL query wasn't so bad. The main thing is that GitHub changed their format a bunch of times. And the BigQuery table that represents the GitHub data represents it at the format of the time the event occurred rather than the current format.
05:02
So it's an exciting opportunity to write a bunch of if-else conditions in your SQL expression, which is just a really great way to spend your Sunday afternoon. So after that exciting joy of writing a very imperative style SQL somehow, I started to build the brain for Frank the Unicorn.
05:23
And the brain is built in Scala with Spark. This was a mediocre choice but informed by the fact that I work on Spark for my day job. And I could probably get away calling that work if anyone asked too closely. And this is, you know, it had some good benefits.
05:41
It allowed us to train our models in parallel and do kind of cool things there. And it also let us do a bunch of sort of data filtering things there too. Once we built the brain, unfortunately, I remembered why I don't build machine learning models in Spark very often. And it's because serving them really sucks.
06:00
So sadly the serving layer is written in Scala and Spark. And we spent the last, I don't know, six hours getting Scala, Spark, and gRPC to all not fight with each other too much. And that was just a really great way to spend a Saturday night.
06:21
So after we had the Scala and Spark side of things set up, we needed a way to plug that into GitHub. And that's where the Go came out again. And I wrote another Go program, the second one. And we interface with holding Scala over gRPC. Her Scala is the server and the Go is the client.
06:41
The Go also serves as an HTTPS server as well that the GitHub event API sends a webhook post request to whenever there's a pull request event. So if you were to open up a PR, the Go program that's listening on HTTPS will get a big blob of data about your PR pushed to it. And then that'll go and open up a connection to holding the Scala side of the fence,
07:04
and start to iterate and do the backend processing. If all goes well, that's going to come back to the Go program, and then that Go program is going to go leave a comment on your PR. Yeah, so that's sort of the high-level way of things we're working out. We do have a diagram.
07:20
So this is how it all kind of fits together. And I'll walk folks through this. It looks really complicated, but it's really not that bad. So the BigQuery data comes into the extract operator that runs residually. All of that gets pushed up to Google Cloud Storage as CSV files, and that's an atomic transaction, whatever that goes in.
07:40
So whatever's reading from it can guarantee that it's a complete dataset. Then we go into the Spark Scala training bit, and then there's the Scala gRPC server. And then we use the domain fabulous.f using Kubernetes ingress to serve a public HTTPS endpoint that the GitHub Events API pushes to. And all of that sort of kick-starts down here by installing Frank the Unicorn on a GitHub repo.
08:05
And then if you open up a pull request, it kicks off the other chain of events that will eventually circle back around back to the pull request at the beginning. Okay, so I'm going to talk a little bit about the components and what we learned as we were building them.
08:20
So as I was building the data extractor, we really wanted it to run in Kubernetes, and we wanted to make sure it could go and check new records. So I think the big takeaway there was we didn't really gain much value as running it as an operator in Kubernetes. It sounded like a good idea at the time because we thought we would be getting meaningful data on the day-to-day basis, but it looks like we probably got a lot more
08:41
just by backporting over the last couple of years. Furthermore, as we were going through the GitHub data, we realized that it really wasn't clean at all. So the majority of the Go program just turned into data sanitation and just checking values and making sure that it was all going to fit nicely together. And then the atomic part of things was important as well, and Go was pretty okay at making this happen
09:02
because we could use a new text to make sure that if we were writing a file to CSV that it wouldn't undo itself before the file was done and written. Let's see what's next. Oh, so the suggestor, this one was fun. So I got to write an in-memory concurrent queue because we had two parts of the program happening at the same time.
09:23
The first one was the HTTPS server, and then concurrently we were running the GRPC client that was talking to Scala and Spark. So the way that this worked is the HTTPS server would get a post request to it and stick it in a queue, and then concurrently in the same process the GRPC client would go and pop something off the top of that queue
09:42
and then go process it in the background. So we're sort of seeing this really interesting concurrency double-server client pattern that Go allowed us to do that was exciting. Furthermore, we had the patch view for comments. This was what we learned about the GitHub API. If you want to leave a comment on a PR, it's not as simple as saying leave a comment on line 12. You actually have to go through and calculate based on the patch view
10:03
how many lines down from the previous change you want to leave a comment. So there's a little bit of math that we had to do for every one of these. Furthermore, we used contour for ingress because we had to serve this whole thing publicly or GitHub wasn't going to send us any events. So that was exciting to get to use Kubernetes ingress to solve that problem.
10:24
Cool. And so to make this all work, we needed to train a model. Otherwise, it wasn't going to be very useful. So we trained this with Spark on Kubernetes. There's a whole bunch of different kinds of classification models built into Spark. I tried a bunch of them under the principle of, eh, why not?
10:43
And for the most part, they all kind of performed similarly poorly. Gradient boosting trees performed a little better than most of them. And so that's the one that we just went with. And the first iteration of this, we performed a little better than guessing but not a lot.
11:00
That was really depressing. We added some more features and we got more better but not much more better. And so this was a kind of sad start to the project from my point of view where I was like, oh, I have a model but it's garbage. Okay. Right. And if anyone here has run Spark, you know that part of running Spark is collecting out of memory exceptions.
11:27
They're like Pokemon except they're like the really crappy Pokemon because they just show up everywhere. Our favorite oom that we collected was the container oom kill. And this is because it happened often and there were no logs.
11:42
So unless you were watching the pod status, you'd just be like, why are my executors disappearing? There's no log messages. And that took me several days of trying and then eventually asking Nova. Yeah. And then I got on the phone and fixed it in like half an hour. It took you? Yeah. Okay. Fine.
12:00
And that was good. And then the rest of them were sort of the standard ooms that we get with Spark. And I'm well aware of maybe not happy with JVM worker heaps space out of memory exceptions. But that's just my life. That's my jam. The driver out of memoryed a bunch of times with some of the models during training.
12:20
And this is because we essentially used the driver as a parameter server during training. And for some of the models, that was just too much memory because it wasn't cleaning up very well. And that was kind of sad because we weren't training on huge stuff. So it was a little depressing. But it's okay. Okay. So we used a bunch of different features.
12:43
Word2vec was the one I started with because the source tech blog post about ID2vec encodings, I was like, that sounds like a cool encoding. Maybe I can do this. And then I did. And then the features. I mean, I'm sure they're good features, but they weren't good for this. But it was our starting point. I also tried TFIDF, which is a standard document retrieval thing.
13:04
Not super useful. And then it started to become things that look more like suspiciously similar to what a linter might be looking at. Just fed in as features to a decision tree. Oops. So lines that are all spaces, the percentage of spaces, what language it's being written.
13:24
And the last one is the one that I think is interesting. We also looked at the GitHub issues associated with the project. And this one actually turned out to give us one of our bigger performance boosts. If a stack trace ended up touching the lines that were being changed,
13:43
there's a higher likelihood that that is sort of complex or confusing code that people will want to ask questions about. And so that ended up being a pretty good feature. And there's some ways and ideas on how we can extend it from there. Yeah, so we did some hyperparameter tuning. And for the most part, more trees approximately better.
14:03
And then we got diminishing returns pretty quickly, around like 20. The other things we did didn't really make all that much difference. And before anyone gets worried, and you should not be worried that our performance numbers are juiced based on how incredibly bad they are. But I did save a test set before I started doing my hyperparameter tuning.
14:24
So it is valid and legit. OK, so yeah, this is the slide of sadness. So like a good score here would be two orders of magnitude higher. So that was not great.
14:41
But for the data set that we have, it's like super imbalanced. It turns out that people really, as a percentage basis, do not leave a lot of comments on PRs. And so the random guessing score would be like 0.003. So 0.09 is three times better than just flipping a coin and being like, cool.
15:01
Looks like a great place to leave a PR. And I think we can do better with more data, and we'll talk about that and how you can help us make this model less janky. Yeah, and so this is the list of things in the training area specifically that we wanted to try and improve. One of them is the classes are super imbalanced.
15:22
And so that was really rough. And so we really want some more data of people interacting with Frank to say whether or not Frank is making good predictions. Flexors and tokenization, we could do much smarter things than what we were doing there. We don't have a lot of context. We look at things on a line-by-line basis.
15:41
And context matters, so we could do smarter things there. And also I think bringing in more logs. If a piece of code is really confusing on Stack Overflow and frequently referenced, that's probably another good sign that would be useful. And we can explore some different models. I don't know, we could use deep learning if we wanted to, but I don't know. We have a link at the bottom.
16:01
If this is a thing that you care about, those in the back cannot see it. Sorry. But the slides will be posted later. We'll tweet them, and you can fill out that link. And you can submit PRs, and Frank will ask you for feedback. Cool. So a little bit about building the GitHub app that we're hoping gets approved,
16:23
but we'll see how it goes. We had to go through and effectively demonstrate that we did have a working public endpoint that was TLS encrypted, and that it actually did something behind the scenes. So that was exciting and kind of fun to go and get to play with GitHub API and actually make it so that we had an interactive demo for today.
16:43
And then I think Holden here wanted to shout out. If anyone is watching that works at GitHub, please approve Frank the Unicorn. Frank-the-Unicorn. Yes, good point. It is totally not going to steal people's credentials, I promise.
17:00
We are good people. Cool. So if you want to find the source code for any of the programs we wrote, the Scala, the Spark, the Go, the gRPC, that is all in github.com slash frank-the-unicorn slash predict-pr-comments. And all of the Kubernetes side of things is there as well. So if you want to see examples of how to run all of the pipeline in Kubernetes, that's all there.
17:24
And you can try Frank out here, and then we can show you the whole system up and running. We have it working right now live. And let folks see what they want. Or if they have questions or whatever, we can do a demo. Demo, demo. Okay, cool.
17:43
I think first, do we want to look at the answers, or do we just want to show folks behind the scenes first? Okay, cool. I'm going to put this down for a second. So Nova is going to bring up the logs for the different components,
18:00
and there is some debugging information that we output about the PRs that are being sent to Frank, and also the features and what Frank's predictions were for those features. And I guess Nova really likes aliases, because typing is too much work.
18:22
So yeah, we have three pods. There's the model server, the PR suggester, and there's Ubuntu, which was just used to debug it because they weren't talking to each other this morning. You can see we were working on this eight hours ago. I have not slept a lot.
18:41
Okay, yeah. So the model server is the one which has the more interesting pieces in so far. There is the file name, so signal process. Someone submitted this. And these are the lines that Frank was curious about. I fucked up this log.
19:00
Oh, damn, whatever. I made a slight mistake in this logging message, so we just print out that it's a list, which is not useful to anyone. But it's a very nice decorative list. I think it really brings the log messages together, so I'm not taking it out. Do you want to show people the features and stuff, or do you want to talk about this?
19:21
Yeah, go for it. I can zoom up. Oh yeah, cool. Okay. So, whoa, okay, cool. I can zoom out, maybe? Whatever, yeah. So, you can see some of the feature vectors. They're all mostly chopped off, just because it's in show mode and the feature vectors are pretty long.
19:40
Right here. Cool. Yeah, and so, I think the last one is prediction. So, this one here is set to one zero, and it looks like it really didn't have much going on on that line. And so, that's a thing.
20:01
I don't know. Whatever. It's some log messages, and we could go look at that PR, maybe, and that'll be more informative. We should probably open one up and watch it go live. Or someone in the audience could open one up and watch it go live. Yeah? Yeah. Yeah, yeah. Thank you. Thank you.
20:21
Okay, so the other side of things that we're going to split the window here, as Francesc opens up a PR for us, is we can do our alias again, because I'm lazy. Does the alias actually save you any time if you type it in at the start of each message?
20:41
Yeah, but now I know that this is, like, always, and I don't think about it. So, tailing the logs for the Go program here, as soon as Francesc opens up a PR, we should see the Go program.
21:03
Yeah, you want to count down and hit Go? So, yeah, there it goes. So, yeah, what happened was GitHub sent us an HTTPS request. You see that over on the right. It says we received, oh, thanks, a PR called hellofriends.go, and then you can see the Scala and Spark in the back end here,
21:21
processing the request, and we're already done. So, Frank's already left some comments on your PR here, so let's go look at those. Oh, 28 pull requests. Wow, we have a lot. Okay. Man, we have friends. Yeah, so this is always the exciting part, is to see what Frank decided to comment on.
21:41
And this is actually a lot of fun. So, it looks like Frank, for whatever reason, thought package main did not look very good. Empty line on line 2, line 3, and line 6. So, I think this is a representation of,
22:01
if we have a small sample size, it's going to find more in there. Yeah, so one of the hacks that I did to make it slightly less garbage is that it more or less pulls the top K worst lines. And so, when you submit four lines, it's like,
22:21
well, of my top five worst lines, these four lines are it. And that's not exactly what it's doing, but it's pretty close. And so, for small PRs, that optimization was a bad idea. But for some of the bigger PRs, it's a little bit better, because otherwise Frank was kind of unpredictable with how many comments Frank was leaving.
22:44
Do you want to pull up a bigger one? Yeah, let's go look at a bigger one and hope it doesn't prove me wrong. Well, this one's got signal processing module. That sounds like it's probably not five lines, unless signal processing has changed substantially. OK, cool.
23:06
OK, so Frank really doesn't like blank lines. It's not the silliest thing for Frank to be upset about, because I think there's a lot of situations where,
23:23
especially in Python code, people have extra blank lines. You leave comments and you ask them to take them out. And the problem is that we don't have a per-language model. And so Frank is just like, here's a blank line. And it doesn't have the context of the previous line, so it doesn't know there have been X blank lines before this. All it knows is that blank lines are a little suspicious
23:43
and end up having comments on them. So that was a depressing discovery. OK, so Frank doesn't like how this comment was closed. That's probably a style matter. But this means that on average, probably across GitHub, people don't like that style.
24:02
They prefer closing their C style comments differently. And so that's why Frank is upset. Stars close together. Yeah. OK. Oh, right. And here's an else, I guess. No curly brace. Yeah, Frank gets upset with elses without curly braces. To be fair, this happens a lot in the Spark project.
24:24
We make people use curly braces. And I imagine there are a lot of other projects which also make people use curly braces. And so Frank has just learned this de facto style from the aggregate of GitHub. OK, Frank doesn't like incrementing size.
24:40
Well, to be fair, a lot of people, a lot of folks, especially in Kubernetes, like to stay away from the plus-plus convention. And they actually like to do the literal arithmetic. So there might be some learning there as well. Yeah. So essentially what this shows is that Frank learns a lot of things and most of them might not apply to you. And so I think one of the challenges is
25:03
perhaps training it across all of GitHub didn't give the greatest results. And if there's individual projects that are large enough to have sufficient training data, doing a per-project model could also be kind of cool. Yeah, OK.
25:24
Anything in Frank? OK, cool. Do you want to pick another PL? Sure. Does anyone have questions? Does anybody want to see anything else?
25:43
OK, cool. Oh, there's some great Scala code. Oh, yeah, let's look at Scala. Where's that at? Oh, yeah, a great piece of Scala code. OK, so Frank came in and left some comments. That's a good sign. Largeness are not rendered by default. Nice. I already did, sorry.
26:08
It's loading. Conference Wi-Fi. OK. Interesting. Class B. I'm always fascinated by what he chooses to comment on.
26:22
What do you think, Frank? We have another one here. I don't think this would be described as idiomatic Scala.
26:40
Yeah. This looks more like C. Honestly, in this one, we probably should have taken the limit off of Frank and just allowed Frank to comment on every line and be like, what are you doing? Oh, interesting. So we're getting the same comments on the same line.
27:01
This file really messed with Frank. This is a good file. Questions?
27:22
So the question is, when is this better than static code analysis? I think that the set of things that Frank has learned
27:43
are not all that better than static code analysis. It's interesting, and I think you could take the same thing and apply it to a specific project or a group of projects, and then it could perhaps learn.
28:01
We've seen that it's able to extract elements of common style and ask for that, and those things might not be easily captured by the static analysis tools. I know that, for example, the Scala static analysis tools are special, and so it probably depends on your language and what the other tooling is that's available there.
28:20
Also, I think the ability to plug in other features, like the issue data or data from mailing lists or data outside of the repository and pipe that back into the repository would offer a little bit more as well. Oh, yeah, right. That's a really good point. Unfortunately, the demo isn't able to show the one feature that Frank did really well at because we don't have the issues
28:40
associated with the pull request that people are making. But in its train, test, validate, it did much better for finding situations where the code was kind of sketchy based on users' interactions with it and their reported problems, for sure. Question?
29:28
Yay! Oh, my God. So if there's one where it's working, like... Yeah.
29:46
So while it is one model per language, there is a feature which represents what language it's written in, and since it's GBTs, that's often probably close to a root node. So to some degree, we have per-language models,
30:03
but not really. And I think if we train directly per-language models, we would indeed get better results. Oh, God. But what was your pull request number? Or... Pull request 16. Yay!
30:20
Let's look at the one where it worked. Maybe. Sort of. Is this the right PR? Oh. Do you know where it is? We can just search for Frank. Okay.
30:50
Okay. It was this one? Okay, cool.
31:04
That's exciting. That's cool. I'm glad it worked once, maybe. Three times better than guessing, but guessing is not very good.
31:35
How much does it cost monthly?
31:41
So we're running everything in Kubernetes. We both work for cloud providers, so we're both kind of spoiled. Yeah, everything is free. But I mean, realistically, I think the most expensive component here would be getting the Spark side of things up and running. Yeah, so the model training part is expensive, relatively speaking. If you wanted to train it for just your project,
32:02
you could probably do that really cheaply. Or if there's a family of projects which are similar to the things you care about, for example, if you trained it on all ASF Java projects, you could probably train that very inexpensively if cost is a concern. And serving it would be like, I don't know,
32:21
the cost of one node or maybe two nodes. Yeah. As far as the data processing and the GitHub endpoint, that's almost nothing. It's a very lightweight server with a few hundred lines of code. Question.
32:54
Sure. Yeah, so the question is how we represent the code and if it's all represented as a single vector.
33:00
And yeah, so we use the word Tevec embeddings. We do cap the length of the input that we consider on a given line. Normally lines are huge and if they are, that is in and of itself another feature. Like line length is another thing which was a strong predictor for PR comments
33:21
anyways when things start to get out of the scope where doing that is reasonable. We end up just commenting anyways. And so we have a one vector representation. It's not great. I was hoping it would perform a bit better. I think we probably need different lexers in front of it.
33:41
I think we need per language lexers and then we could probably get better representations but for now it's okay. No, that means Frank. Oh, so the question is if I didn't get any comments, does that mean Frank is satisfied
34:00
with my pull request? And the answer is it probably means Frank crashed. Which PR? Oh, okay. Oh. Oh, I think there's probably Trixie. Trixie.
34:21
So no. So this did not drop the database. But it actually I think it might ignore gitignore files. I should take a look but I think it ignores pure dot files right now. So I think
34:40
I'd have to double check. Okay, cool. Are we out of time? Awesome. So thank you all for listening. If you do want to give us feedback by thumbs up or thumbs down on the pull request comments, that would be greatly appreciated and we'll use it to train
35:01
a less shitty model.