Analysing GitHub commits with R
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 133 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/48779 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
NDC London 201640 / 133
2
6
10
12
15
17
23
24
28
30
31
32
35
36
39
40
43
44
45
47
51
52
55
58
59
60
61
62
63
64
67
69
71
73
74
75
82
84
86
87
97
103
107
108
111
112
114
115
117
120
123
126
128
129
132
133
00:00
Interactive televisionGraph coloringPlastikkarteDecision theoryDifferent (Kate Ryan album)GradientSelf-organizationCommitment schemeMathematicsLaptopTheoryOnline helpJSONXMLUMLComputer animation
01:27
MathematicsSoftware developerInteractive televisionProgrammer (hardware)MathematicsMathematicianProgrammer (hardware)Process (computing)BitComputer clusterCodeProjective planePower (physics)Presentation of a groupNP-hardComputer programmingPhysical systemMassGraph coloringComputer animation
03:04
Data analysisFunction (mathematics)Event horizonMereologyBitGoodness of fitVirtual machineData analysisMultiplication signComputer animation
04:04
Data analysisExploratory data analysisAudiovisualisierungProcess (computing)1 (number)Multiplication signData analysisProcess (computing)Source codeBitPredictabilityAudiovisualisierungCovering spaceMereologyComputer animation
05:23
Natural languageRepository (publishing)Java appletStatisticsInformationAudiovisualisierungStatisticsFormal languageTwitterProgramming languagePosition operatorStudent's t-testOnline helpPoint (geometry)AudiovisualisierungFile archiverProjective planeDatabaseReal numberHost Identity ProtocolElectronic mailing listFrequencyDiscounts and allowancesDifferent (Kate Ryan album)Visualization (computer graphics)Repository (publishing)DiagramComputer animation
07:32
Motion captureSource codeOpen setSoftware developerOpen sourceNewton's law of universal gravitationComputer fontEvent horizonProjective planeComputer fileGroup actionData structureComputer wormElectronic mailing listDifferent (Kate Ryan album)File archiverGoodness of fitRepository (publishing)2 (number)Computer animation
08:44
Source codeOpen setSoftware developerOpen sourceComputer fontMathematical analysisFormal languageElectronic mailing listRow (database)Library (computing)Web pageLine (geometry)Event horizonComputer fileReduction of orderBasis <Mathematik>Table (information)Computer animation
09:55
Source codeTime line <Programm>GoogolSoftware developerRepository (publishing)Multiplication signFile archiverSelectivity (electronic)Statement (computer science)Formal languageDifferent (Kate Ryan album)Data storage deviceProgramming languageInformationQuery languagePole (complex analysis)OracleServer (computing)NumberTorusRepository (publishing)Cellular automatonEvent horizonSoftware repositoryRegular graphComputer animation
12:28
Open sourceRepository (publishing)Multiplication signFormal languageNumberOpen sourceProgrammer (hardware)Workstation <Musikinstrument>TheoryBell and HowellOpen setXMLComputer animation
13:43
Integrated development environmentText editorProduct (business)Video game consoleMUDInclusion mapPlot (narrative)Online helpComputer fileInterpreter (computing)Computer fileIntegrated development environmentVideo game consoleSoftware developerVariable (mathematics)Computing platformMereologyTable (information)Fitness functionForm (programming)CodeWindowImage resolutionData structureVideo projectorError messageRelational databaseText editorExterior algebraFreewareControl flowLine (geometry)Cross-platformPlotterIntelOnline helpVideo gameComputer animation
16:50
Vector spaceElectronic mailing listFrame problemVector spaceFunctional programmingElement (mathematics)Poisson-KlammerElectronic mailing listSingle-precision floating-point formatDoubling the cubeMultiplication signCombinational logicQuantileFormal languageRange (statistics)Java appletType theoryDivisorFreeware2 (number)Different (Kate Ryan album)StatisticsSoftware developerProcess (computing)Level (video gaming)Data structureRight angleEmailTesselationString (computer science)Arithmetic meanDiscrete element methodHand fan1 (number)ResultantNumberComputer animation
21:13
Sinc functionElement (mathematics)Order (biology)Condition numberVector spaceNumberWordElectronic mailing listFunctional programmingCodeFormal languageDistribution (mathematics)Poisson-KlammerComputer animation
22:47
.NET FrameworkDistribution (mathematics)Natural languageRepository (publishing)Total S.A.Gastropod shellFormal languagePosition operatorBitRepository (publishing)Multiplication signEvent horizonProcess (computing)Function (mathematics)File archiverCondition numberDifferent (Kate Ryan album)Decision theoryCodeComputer clusterComputer animationDiagram
23:56
Source codeEvent horizonCorrelation and dependenceVacuumQuery languageSoftware repositoryComputer wormMaizeFormal languageDecision theoryInformation technology consultingUniform resource locatorSoftware repositoryGame controllerRepository (publishing)Event horizonComputer wormInformationData structureProgramming languageCodeComputer fileMessage passingRight angleWeb crawlerFile archiverTheorySign (mathematics)RepetitionScripting languageSpecial unitary groupAudiovisualisierungTraffic reportingBeer steinComputer animation
27:05
Software developerINTEGRALPoint (geometry)Formal languageMultiplication signInformationSource codeDistribution (mathematics)Programming languageDistanceDecision theoryLine (geometry)Event horizonSoftware development kitTable (information)
28:35
Reading (process)Event horizonNatural languageCountingDistribution (mathematics)InformationTask (computing)Event horizonFormal languageRight angleLine (geometry)Multiplication signComputer fileTable (information)Decision theoryQuery languageRepository (publishing)GoogolDifferent (Kate Ryan album)Data structureExecution unitBit rateComa BerenicesComputer animation
29:53
Data structureEvent horizonFunctional programmingLatent heatLine (geometry)Uniqueness quantificationReading (process)Formal languageData structureLine (geometry)ResultantFunctional programmingObject-oriented programmingLibrary (computing)Different (Kate Ryan album)Electronic mailing listComputer wormEvent horizonComputer fileCovering spaceTable (information)Labour Party (Malta)Computer animation
31:19
Java appletScripting languageNatural languageUniqueness quantificationReading (process)Event horizonRepository (publishing)Functional programmingTorusResultantFormal languageProgramming languagePosition operatorInformationComputer animation
32:21
Natural languageInformationFunction (mathematics)Shift operatorGroup actionFatou-MengeRepository (publishing)Gastropod shellObject-oriented programmingJava appletConvex hullCountingProgramming languageFigurate numberFormal languageSource codeFunctional programmingObject-oriented programmingFunction (mathematics)Data structureInformationAreaNumberTable (information)MathematicianSoftware developerGrand Unified TheoryStatisticsPattern languageOntologyBit rateComputer animation
34:22
InformationFormal languagePoint (geometry)Pointer (computer programming)InformationGoodness of fitPlanningComputer animation
35:17
Repository (publishing)Natural languageInformationFunction (mathematics)Gastropod shellFatou-MengeJava appletConvex hullAxiom of choiceInheritance (object-oriented programming)InformationTable (information)Programming languageDisk read-and-write head1 (number)Formal languageFunctional programmingFigurate numberPlotterComputer animation
36:20
Java appletCross-site scriptingPlotterRepository (publishing)Fehlende DatenSoftware developerPlotterProgramming languageQuicksortGraph (mathematics)Functional programmingEvent horizonFormal languageInformationComputer animationDiagram
37:10
Repository (publishing)Natural languageInformationSource codeProcess (computing)Event horizonGoogolQuery languageView (database)Computer configurationPersonal digital assistantDifferent (Kate Ryan album)Formal languageRepository (publishing)Uniform resource locatorKey (cryptography)File archiverEvent horizonQuery languageCASE <Informatik>Multiplication signGoogolSource codeTheory of relativityForm (programming)Computer animation
38:49
Repository (publishing)GoogolInformationReading (process)CountingDistribution (mathematics)Natural languageTask (computing)Event horizonFunction (mathematics)Uniqueness quantificationJava appletScripting languageLine (geometry)Axiom of choiceInformationComputer fileUniform resource locatorDisk read-and-write headRepository (publishing)Process (computing)Element (mathematics)Film editingSanitary sewerFormal languageGoogolFunctional programmingProgramming languageNumberString (computer science)Web pageProgrammer (hardware)PlotterMultiplication signSimilarity (geometry)Rule of inferenceNormal (geometry)Set (mathematics)System callCodeComputer animation
41:49
Source codeNatural languageRepository (publishing)Function (mathematics)Java appletScripting languageDisk read-and-write headModule (mathematics)Application service providerInformationFunctional programmingFormal languageSlide ruleRepository (publishing)GoogolSource codeLatent heatProgramming languageCasting (performing arts)Decision theoryString (computer science)Function (mathematics)DivisorCategory of beingDefault (computer science)Point (geometry)Table (information)Error messageMultiplication signIdentifiabilityRow (database)TorusCodeData structureTelephone number mappingSet (mathematics)Matching (graph theory)Computer animation
45:19
Repository (publishing)Plot (narrative)Application service providerJava appletNatural languageInformationSoftware repositoryContent (media)Workstation <Musikinstrument>Presentation of a groupAxiom of choiceRepository (publishing)Sign (mathematics)Content (media)Electronic mailing listFormal languageCodeProgramming languageTraffic reportingComputer animationDiagram
46:37
Parameter (computer programming)Web pageRepository (publishing)Content (media)Order (biology)MiniDiscLocal GroupQuery languageFile formatParameter (computer programming)Repository (publishing)MetreQuery languagePower (physics)GoogolMultiplication signEvent horizonContent (media)Programming languageStatement (computer science)Line (geometry)Table (information)MiniDiscFile archiverTraffic reportingRight angleRow (database)Formal languageProjective planeDatabaseAudiovisualisierungWeb pageCountingVisualization (computer graphics)WordClosed setUniform resource locatorSelectivity (electronic)Computer animation
48:47
Point cloudPoint cloudComputer file10 (number)NeuroinformatikProcess (computing)Multiplication signRemote procedure callEvent horizonLaptopFile archiverComputer animation
49:50
DatabaseRead-only memoryProcess (computing)Semiconductor memoryFormal languageQuery languageProcess (computing)Data storage deviceTerm (mathematics)Connected spaceElectronic data processingSubsetException handlingComputer animation
51:11
Natural languageModule (mathematics)Library (computing)outputModul <Datentyp>Machine learningContent (media)Scripting languageVariable (mathematics)Function (mathematics)Sample (statistics)Military operationSet (mathematics)EmailEndliche ModelltheorieWindowStreaming mediaComputer fileMultiplication signoutputDefault (computer science)Source codeGradient descentFormal languageSet (mathematics)Virtual machineProjective planeStaff (military)Decision theoryProgrammer (hardware)Computer programmingCodeFlow separationDatabaseIRIS-TFitness functionMachine learningIntegrated development environmentAlgorithmAttribute grammarParameter (computer programming)File archiverModule (mathematics)FreewareForm (programming)Function (mathematics)Computer animation
54:59
Data typeMach's principleoutputForm (programming)Natural languageScripting languageOpen setLaptopVisualization (computer graphics)Point cloudComputer fileScripting languageAudiovisualisierungGraph (mathematics)Programming languageDistribution (mathematics)Event horizonSet (mathematics)InformationBitFormal languageSystem callFigurate numberFunctional programmingCodeLoop (music)Graph (mathematics)PlotterComputer animation
57:30
Scripting languageoutputVariable (mathematics)Content (media)Plot (narrative)Set (mathematics)Function (mathematics)EmailSample (statistics)Military operationSource codeBlogComputing platformCuboidPlotterTime travelInformationStatisticsVirtual machineWeb pageComputing platformE-learningMereologyBlogMultiplication signFreewareWave packetControl flowFile archiverWordCAN busRight angleCodeThread (computing)Formal languageError messageComputer animation
59:38
Control flowComputer animation
Transcript: English(auto-generated)
00:07
So, at the beginning, I will remind you that there are cards with different colors outside this room. So if you want to grade my session, please do it. It will help the organizers and, of course, me to improve my talk.
00:22
So thank you very much for joining me today, this afternoon. I know there are interesting talks in the meantime, so I fully appreciate that you are here with me. I'm Barbara Fuchinska. I'm originally from Poland, now I'm based in the UK, and today I'll be talking to you
00:40
about what I've learned about R while I was analyzing GitHub commits. Although commits is more than exaggeration, I was analyzing GitHub events, more or less, more than commits themselves. I, of course, forgot my change slider, so I need to use my laptop.
01:02
Oh, thank you.
01:26
Let's try it on. Yes. And it works. Thank you very much. So as I said, I'm Barbara, and I've been math enthusiast since I was a little kid, and I'm not kidding, I was like six when I found out and created my own theory, which
01:46
I get to know when I was like 20 that there is actual guy mathematician that put it into an actual maths. I was a kid trying to understand powers in maths and how to do it.
02:02
So I am mathematician in hard, but my daily job is a programmer. I'm a C sharp programmer, not R programmer, C sharp programmer. So what am I doing here talking to you about R? I actually used R for my pet project. I wanted to learn R. I wanted to learn a little bit of data science because of math, of course.
02:27
And I think you're supposed to say maths, because in UK we invented it. I mean, I'm saying we because I'm living here now, so I'm British now. And I'm a big sweet tooth, so you'll see some candies during this presentation.
02:43
And my favorite color, you won't say from my outfit today, is pink, of course. All of the code I will be showing you, you can see here in this GitHub repository, because, you know, inception or going back and forth. I will be analyzing GitHub and putting this into GitHub.
03:04
So what are the goals of this talk? I will take some GitHub events, like pull requests, create events and so on, and I will try to make sense of it. This is actually a common thing I've been asked to do.
03:22
When people hear that I'm doing a little bit of data science, they say, here's my data, make sense of it. But the actual making sense of it and thinking of what your output is going to be is the hardest part of this thing. So what I'll be talking about?
03:41
I'll be talking about data analysis itself. Then I will show you how to capture some GitHub data, GitHub events. Then we'll go into R and how I actually analyze those events using R. If we have time, I will talk about Azure ML, Azure Machine Learning Studio.
04:02
On the same topic, GitHub. So how data analysis process look like? If some of you have done a little bit of a data science, like for real, you know, like 90% of your time is capturing data,
04:21
gathering them, cleaning them up, and then at the end, you can do actual science, visualization, making sense of the data, some predictions. But this whole thing here is what I heard and now using,
04:40
basically shaving the yak. Not fun, but very, very satisfying once you do it right and once you actually know that those data you've gathered are the ones that you wanted. But it's not that easy, it's never for the first time. First you gather them, then you try to explore them,
05:01
find out what's under cover, and then after two weeks, you wake up and it's like, no, it's not the data I was supposed to analyze. I should capture something else. So this part of getting, capturing data and finding out the proper source of your data is very, very important in this whole process.
05:24
So what this talk will be about, what I'm trying to find out. I found this tweet, it's language statistics. The guy from Redmonk put some statistics on languages
05:42
and GitHub repositories. And you can see different languages like JavaScript, Ruby, through across the years. It's very, very nice, like hip and then going down for Perl language. Anyone knows why?
06:00
So the thing is, Perl is not a very popular language nowadays. And it actually never was, it's just at some point, they decided to put all the Perl repositories to GitHub. So we have this hip and now nothing. But this is not like a real data science.
06:22
This is a guy who took the repositories and counted the languages, right? It's count, if we're talking about like databases, it's count. But it was a very, very viral tweet. And then we have projects like GitHub, and this was even more viral.
06:45
This was on Twitter like for weeks. And people were like, wow, that's great. It is great, isn't it? It's very impressive. So you have this list of languages, and you have different statistics. Again, they're not statistics, they're just counting.
07:01
But it's very nicely presented. This is very good visualization. You can go through different periods and see some stuff. And it goes much more with those visualization techniques. But again, it's just taking the languages and counting them.
07:21
So this is what I'll be talking to you today. I'll be talking to you about how did I count the languages based on GitHub archive. So how did I capture my data? There's something called GitHub archive, and the GitHub project that I've shown you
07:41
is actually one of the winners of GitHub archive contest. They offer you an API or hourly archives. You can get a JSON file from GitHub about public repositories within the hour you pick.
08:00
You can also pick the data for the whole day or for the whole month. It's a lot of data, like seriously, a lot of data. All those JSON files are not actually JSON files, but I will talk about that in a second. They have a list of events, different events that have the same structure,
08:22
but the common thing is, actually not common thing, is they have different payload because create event differs from pull request event and even differs from like a fork event or adding a new contributor event. So they have some specific stuff related to specific events.
08:43
As you can see, GitHub is one of the winners in the contest. So if you want to win a prize, you can always go to GitHub archive, try to make sense, visualize it nicely, and you can win and you can be here on this page.
09:03
The page I'm showing you is from 2015. Probably it just changed for 2016 because the contest, I think it was at the end of the year. Not really sure. This is how the file looked like.
09:20
So I tried to find out like different ways how to present those data. And if you can notice, this is the beginning of the file. It's not actual like a table, an array of JSONs. It's not a JSON array. It's just a list of JSONs without event. You don't see it, but there is no comma at the end.
09:42
So we have row JSONs. So if we use any of the libraries in any language to read this, we won't be able to do it. We need to read it line by line, and then we can make sense of it. And those are timeline data. So if you take an hour data, you only know so much,
10:05
only by the events within this time slot. So if something happened before this time, you won't know it. Of course, if something happens after, you wouldn't know it either. But you only know about stuff that happened during this time.
10:21
Even if the knowledge, like language for the repository, isn't already known, if the information isn't in this time slot, you won't make sense of it. So how you usually should work with data like this? You should take data periodically, some kind of storage,
10:42
and enable people to query it. And this is what Google does. Google has this big query thing, and it gets GitHub archive data, and you can query it using statements like select queries.
11:00
They're a little bit different, but very, very similar than like regular SQL Server or Oracle. And the problem is, first terabyte of data per month is free. And if you go beyond that, you need to pay. Not much, it's not like great sums, but still.
11:21
And you can run out of one terabyte very, very quickly if you do a wrong query. So you need to be very conscious about what you put in your select statement after select, so which data are you taking, and with your work closes.
11:42
Because one terabyte is through the data that search goes through. We can also use GitHub API. So every repository has its own API. So you can get information like languages and stuff from it.
12:01
So you have repo name and language, and then you can get information about languages. But if you have thousands of thousands of GitHub repositories, you don't want to do it one by one. And of course, GitHub doesn't allow you, because you can only do so many requests per hour.
12:23
You need to register to get more, but it's still too small a number. You can also use search. Search allows you to query multiple repositories. So you can get languages, for example,
12:41
repositories created from some time. So why R? R is considered to be number one language, and recently another buzzword in the data science, which is another buzzword, worked. It was created by two gentlemen,
13:01
Roth, Ihaka, and Robbo, gentlemen. So one theory says that the name of R came from their names. Another theory says that there was this S language at Bell Labs. They invented it. And it was a commercial thing.
13:22
If you want open source, it's R. And R is great for many purposes, but it was a language wrote by statisticians, not programmers, not people that were programmers. And you will see, if you haven't seen it already,
13:42
what I'm talking about. So development environment, you can write R in a console, and it's basically an interpreter. So console will be run anyway. But you can also use some files. If you want to use files, I recommend you to use RStudio.
14:04
And RStudio is not the best IDE I've seen. I've seen some, but this one is one of the worst. But it's the only one. It's the only alternative. If it's free and cross-platform, like seriously, cross-platform,
14:24
you can download it for any platform you want, and it will work consistently. So if something breaks on Windows, it will break on Mac. That's how it looks like. So we have an editor. So you can write your files, nothing special here.
14:43
This is Windows RStudio. It looks exactly the same on Mac. Then you have console. So if you want to run anything here, you need to run it line by line, or you can run the whole file. But it all will be translated into lines here.
15:02
So you can see this is something that runs this code. What else do we have? We have environment variables. This is a very, very useful part because whenever you work on data and you don't know the structure,
15:20
you have some kind of an intellisense here. But it's not very helpful. Much more helpful are those environment variables. So you can go here, open it, see the structure, sometimes even see the data. When you click on the data set, it brings you back to the editor
15:41
but shows you the data and the structure in form of table in relational databases. And this is all the rest. So whichever something, whenever something doesn't fit in those three windows, you have it in this fourth. So you have plots, you have file structures,
16:01
you have packages, you have help here, everything. And it's a very, very small window because you usually work here and here. So you push back the right side and it can go really messy. For example, I once gave this presentation
16:21
and with the live demo, that's why I'm not doing it anymore. And the resolution of the projector was too low for me to do a plot. But of course, R doesn't have very nice error messages. It said something that it cannot do but I didn't know what. And when I unplugged it from the projector,
16:41
everything worked fine. Plugged it in, then googled and resolution. Could not plot. Happens. Okay, R basics. This is the beauty of R. Told you, it's not a very pleasant language. You will see it in a second even more. So we have simple types like strings, like doubles.
17:02
And see this one thing? Who knows what it is? Why is this like one? Yeah, yeah? Okay, I will show you something more. We have free stuff here.
17:21
Why is this one? I didn't know for a long time. It's just everything in R is a vector. Or something else but there is no like simple value. So even though this is actual vector because I created it as a vector, C stands for combine.
17:41
But it would be silly to write combine every time I want to create a vector. Even though I wanted just one thing, it created me a vector anyway. Just one element vector. I still don't see the much sense of this one but if you want to work with R,
18:01
get used to it. There is nothing you can do to get rid of it. If you try to get a value of anything like small R vector, there will be one there. Not two, not three, one. And you can see here, I'm trying to access the second element of my vector
18:21
and have one, right? If anyone can explain to me why. Yeah? Great. Lists. Lists are a big thing in R
18:40
and horrible to look at. You see those double brackets and ones? So if you want to access the list, like second element of the list has those double brackets, if you want to access the element value, you need to use double brackets. Sounds horrible first time you look at it. Once you get used to it,
19:01
it becomes very helpful. So another thing is you can name your elements of the vector or of the list and it becomes much more helpful then. Now it looks nicer. You have named element of your vector. It looks completely different than not named vector.
19:21
And if you want to access an element, you get it with like a header or something. If we work on lists, we can access and we, maybe, if we work on lists and we have those named element, we can access them using names or using dollar. And dollar is kind of like a dot in C sharp or Java.
19:45
So if you want to access an element directly, you use dollar. So here you see how the list look like when we have named element. If you're accessing with single brackets, with double brackets and with a dollar.
20:00
So with a dollar or double brackets, it's basically the same result. It's just nicer to use dollar. Believe me, it's nicer to use dollar than brackets and names. And with dollar, you get IntelliSense in R Studio. The thing that data scientists in R work the most with,
20:22
the data structure, is data frame. So first, I created a list. This list has two columns that will end up in my data frame. So you can see, this is the first column, this is the second column. And summary is like the first function you would need if you want to make sense of your data.
20:40
And depending what kind of a data it is, are those numbers, are those strings, factors, you will have different stuff in here. Because for strings, you won't get mean value, right? Or you won't get quantiles. You won't get minimum value. You will get some other statistics.
21:01
You will get the range of the values. You will get the levels for factors. So if you want to do quick sense of your data, you just use summary function. How does filtering works? I'm showing this to you so you understand the code that comes next.
21:20
So it's kind of simple. If you have a vector or have a list of vector, you just put it into the brackets, your condition. This is number of character. So I only want those elements of the list that have two elements. So we can see this.
21:40
Another way to do it is use filter function. One nice thing, not one, but first nice thing about R is if you don't know how the function is called, you probably should use the first English word that comes to mind. So for filter, you use filter. For sort, you use sort, and so on.
22:00
So for lists and for filtering data frames, you can use also brackets, but for rows, you use a comma. So if you want to filter by rows, you need to put your condition before comma. If you want to sort or get some data by columns,
22:21
you need to put the condition after comma. And you can do weird stuff, like I just picked up the second column, which comes as a vector or a list, vector, even though I created it as a list. You can just change the order of the columns.
22:41
You can just pick up the numbers of columns you want, or you can mix and match. So let's do something. Let's analyze something. So I wanted to do language distribution, something like what GitHub did, like what are the active repositories per language
23:03
or like total pushes per language. I stopped with active repositories because it was a lot of work. I wanted something like this. I know it's like an Excel 90s graph, but it does the job. So this is my output,
23:21
and now I'll show you how I get there. So what is an active repository? Remember, our data are events. We have different events from GitHub archive. We need to pick up a time slot. We need to pick up some conditions that tells us the repository is active.
23:40
So what is an active repository? Yeah? Yeah, yeah. So it all depends. So data science again. So we need to make a decision.
24:01
What is an active repository? GitHub, I went to their code. They took the repositories that were created, that there were pull requests to them, and there were pushes. But someone else could say, well, if the repository was forked, it means it's active, right? Or if we assigned some person to the repository,
24:22
maybe it means it's active. So it all depends. Again, it's horrible consultant answer, but actually it does. If your definition is different, just put your data out there. Just be clear what your definition of active repository is. So what I did, I wanted to basically copy GitHub archive,
24:44
or at least their way of thinking because they wrote everything in JavaScript. So I'm not copying per se. And I went for create event. So I've put it into the JSON visualization, and I tried to find out the information about language.
25:02
So we have an actor. We have something called repo. And there is a ID name, URL. This is like a public API. And then we have payload. So I was expecting language will be in payload or even some somewhere up there, but no language information.
25:23
This is not surprising because when we create a repository, there probably is no knowledge about the language. GitHub goes periodically and assign languages to the repositories. So maybe like a push event, right? I went through a push event and I've seen the structure.
25:44
So there's again actor, there was again repo, and then I went to payload. So I went to payload. No, no language definition. There are commits. I wasn't expecting to get any language definition from commits, but you can see how it looks like.
26:04
You have message, you have again URL. So I went through the files like, you know, control F and found out the language in pull request events and only in pull request events.
26:21
So I digged up in and I didn't find anything here. But finally, it's deep, deep down there. If you go to, there is language URL so you can query the API.
26:41
Nothing here in the head, but there is something in base. That is huge again. I went to repo, again language URL. Yes, Ruby. This repository has language assigned as Ruby.
27:02
So I went online and just asked people where I can get this information. And it's not like I'm stupid or unlucky and just found this time slot and couldn't find actual languages. There is no language information in the other events.
27:22
So I wanted to do something. Do I go look for some other data source or do I, like for five minutes, I assume pull requests are enough for what I want to do. So this is like, for me, this is the hardest decision.
27:40
I cannot even look at this picture because I get very hungry for sweets. So you need to choose your integrity or you want to learn something. And again, it all depends on what you decide and you can say I've done GitHub language distribution
28:02
based on pull requests, as long as you're frank about this. If you say I only analyze those events, people might say, well, this work is worthless, but at least you've done some work, you've learned something and people might not look at your findings,
28:24
but some people can say those tendencies can be spreaded across and maybe you can get some missing information from somewhere. It's a starting point. It was a starting point for me. So what did I do? I read the file line by line because you know you need to do it now
28:41
because those are JSONs, it's not a table of JSONs. I picked up an ID and the language from all those events and then I counted, right? And I took only an hour slot. So 3 p.m. 1st of January last year.
29:02
So like almost a year ago. Where did I take ID and language? Why not just language? Yeah, right. So if within this hour there are more events
29:21
about the same repository, I don't want to count it twice or three times or 10 times. But someone could say they won't because if there are so many events applying to the same repository, this means this language is very popular. And maybe someone makes different decisions than I did.
29:44
Then it would be a very, very simple query to get a Google query and I will show you that at the end. So as I told you, I couldn't do just parsing JSON because this was my structure
30:01
without commas at the end and without the table thing. So what I did, this is the library, JSON. There are different libraries for JSONs. There are different libraries for everything in R. And I read lines. This is a built-in thing in R, very nice.
30:21
You just read lines and it puts it into the list. So you have a line, you have lines which are JSON structures, and then I apply to those lines function called from JSON. What I get at the end, I have my JSON objects
30:41
that I've read from my file. And what I need is I need to filter them by some events. So we already know I will just be taking pull requests. So I use this function here for pull requests. And I took those events and I only took ID and language.
31:04
And you can see now how dollar becomes handy because you've seen this nested structure of the object I've shown you of the payload. So I just wanted ID and the language. And then I will show you the result.
31:21
I've read those pull requests and I did summary. And I have a summary here. So I have like 12 occurrences and eight occurrences and five occurrences of the same repository. And this is exactly why I took ID with me
31:42
because there are several events for the same repositories. And what I need to do is just make them unique. Hence, I have a unique function. So I applied unique function. So I only have one repository in my results.
32:04
But when I did summary again, after doing this unique, I still have like two occurrences for two repositories. Why? Some repositories have multiple languages assigned.
32:22
So I have my language information and I can now count them. Unfortunately, count is not a function. I mean, it is a function, but it's not doing what I want to do. Counting languages, occurrences of languages. There is a table function.
32:41
Table is a weird function for like an object oriented developer, C-sharp developer, Java developer. It's not very weird for statisticians and mathematicians because tables in that area actually means something that counts. So when I put the pull request language,
33:01
I said count me the occurrences of the same languages and structure them in a way. Language, number of occurrences. This is what this table function is doing. And what I wanted to see is just the names. Which languages do I have? So I first printed the languages,
33:21
all the languages I got and their occurrences and then just the names of the languages. And this is my output. So you can see here languages names and you can see language and number of occurrences. And this was weird because I can see Titan,
33:41
I can see Ruby and I cannot see R. But there's even a C-sharp, but there are also those weird names of languages. Anyone heard about language name 20825841? No? So I knew something was wrong. So maybe I did something wrong
34:01
when I was reading those data. Those weren't such amount of data that I couldn't go through by hand. So I actually checked. Maybe I read from weird source and then I had a hint of genius. Maybe if they cannot figure out the language information, they put there the ID.
34:22
So I checked it. I had some incomplete data. So I just wanted to see which ID is of the language name here like this and named like this. So if you see the ID of the language name like this
34:42
is exactly the same. And I'm not really sure if this is GitHub or this is R that replaced those information. At which point it happened. But I think it's GitHub because R has no values. It would just bring up the null value if it's not there.
35:06
So what I did then was I just filtered only those poor requests that language isn't the same as ID because this is what I was interested in. If we go back, you can see I'm losing like 40% of my data.
35:23
Maybe not my data because it's like one occurrences per repository, but a lot. And again, choice. It depends on those kind of explanations. But it actually depends what do you want to do now. Either you go and try to get those missing information
35:43
or you just carry on. Just write down. The stuff you caved, right? So funny thing, when I actually did the table and see the head of languages after I picked them out,
36:00
the ones I didn't want, I still have them in my table. No one could tell me why. Now I have zero as occurrences, but it's really weird that it still remembers that there was something there. Don't know.
36:21
So this is how I figured this out. I only took the languages that had more than five occurrences, because I'm not interested in lower figures, and I bar plotted it. Very simple function, bar plot. That's why I used it.
36:41
And I wanted to have a nice graph, so sort another handy function, just sort it, put it right there, and I have a graph that shows me everything. Okay, but let's get real. This was just pull request, and we have some missing data.
37:00
We have some missing data for create events and for push events. How do we get there? Where can we get those data? So I decided, as GitHub decided, that I will take into consideration create, push, and pull request events, all three of them, and I need to get the language information,
37:23
either from BigQuery or GitHub API. I went for BigQuery, because GitHub API is inconvenient. You need to just ask and ask and ask and ask, and it just takes time. If you have a lot of data, it's just not practical.
37:40
Okay, so Google BigQuery. This is my query to Google. I picked up repository name and URL and repository language, for push events, create events, and pull request events for this exact hour that I was considering before.
38:02
And the problem I had is, at the GitHub archive, I was pulling out ID in the language, which was rare cases, and I could get an URL from, but I had it like this as an API. For Google BigQuery, I didn't have an ID,
38:20
and I haven't had an URL in this form in language. So if I want to combine those two data sources, I need to find a key, a relation. So the key was the repository name, which I had in BigQuery, but I didn't have at GitHub archive. So I decided to go for URL.
38:42
The problem was, not the problem, it's just not the same data. I only could extract name from URLs. So what I did, I again read the file line by line, I parsed it to JSON, I extracted ID and URL information, I read the Google BigQuery file,
39:02
because when you select it from Google, you can save it as CSV file or as any other file of your choice, and then you need to combine them. So you can see, oh, the beauty of our language.
39:21
If you now got used to R, you're a very, very lucky person. It took me weeks until I could look at this code again, because it's just a mess, and it's not that I'm a bad R programmer. Believe me, I'm quite a good programmer. It's just R doesn't have any patterns,
39:41
any good practices, and the rule is, if it works, leave it. Sounds silly for normal programmers, but for data science actually doesn't. You use usually those languages, not those languages for the code
40:01
you don't have to maintain in the future. You want for this code to do the job, and this code will have a few lines or maybe one page. You practically can get back to this code and look at it again if you need to tweak it.
40:21
So beauty of R is to transform data, to do some data exploratory, but if it goes to like counting elements, and this is actually something that replaced, that cuts off this URL from the place
40:43
from this, and I don't even remember, seriously. This is the number of characters in this at plus one. It's like in C when you need to replace,
41:02
if you need to substring from some string. So those times are way behind me. Again, I needed to use unique function because now I know what I'm looking for, and then I printed the head.
41:20
So I have an ID and the URL, just the name of my repository. I did similar thing for Google data. So I applied a function that also gets a substring, which will be my name of the repository. I read the Google data and I printed the head. So I have like repository name, URL language, and this URL.
41:43
So this will be the column, and this will be the column that I will combine those two datasets. So how can we combine them? This is what I did for combining.
42:00
So I applied a function that gets from Google data those stuff that URL is the same as second row in this repository, in this, yeah. So I applied to repositories the function
42:20
that gets the missing language information from Google data, and it matches the Google information by the second row, which is the name of my repository to my Google data URL. You could notice this as character. And when I was new to R, I kept leaving it on this slide.
42:41
It gave me like a weird error if I didn't put as character. As character is basically casting. So you cast one thing and you cast one thing. The code would be so much nicer without this casting. Why did I need it for? Because the language wasn't read as string
43:02
from my data source. It was read as factor, which is basically like an enum category, something like this. And if you don't know that, this is the default setting in R, it just cannot match two factors. So what do you need to do is cast it for string
43:22
or apply a setting, the default setting that it reads from files as strings, not as factors. I know factor thing could be a novelty for some people, but again, beauty of R and factor is actually something that you would be working with very often in R.
43:42
So I left it on purpose to show that sometimes you don't know some specific of the language and it can really, really mess up with your code. Because if you delete this and this, the code becomes much, much nicer.
44:00
And I only took the first language because Google data could return you several languages. So if you want to do it the proper way, you should iterate through what Google returns to you and create new rows. But I didn't. So what do we have now?
44:21
We have now a structure of ID, URL, and language. I actually could leave the ID, I could just forget about it because my name is unique and identifier anyway at this point. So what I did now, I use table, I sorted it, and I bar plotted this time every language.
44:41
And I found out something here. Have you seen a language like this? It's called null if someone doesn't see the back. So a lot of my data actually weren't merged or Google didn't know about those languages
45:03
for those repositories. And it happens all the time. And again, decisions. Do you put it on your findings and output of your research or do you just forget it and do this?
45:21
Your choice. This is for presentation purposes so I can show you both findings. But this is like a story of any data scientists or statisticians. You have a lot of data that are missing and you're not going to get them.
45:43
You can of course use GitHub API and I will just show you quickly how you can do it. So you just use the provided API for this repository. You can just use my rtalk. And you just use the API thing.
46:01
There is a method get. There is of course a package you need to use to use get. You can see the content but it's just signs. Signs. But there is a method called content so you can easily look what's inside. And the content is returned to you as a list.
46:21
So I have a list of two languages. My repository has R and rebel languages which I found out when I did this code. I didn't know my repository has a rebel language. Maybe it's the same as R. You can also use GitHub search.
46:40
And GitHub search gives you the opportunity to search from multiple repositories in one request. You can just adjust your Q parameter which stands for queering. So I used Q as created at that day
47:02
and I wanted page two. So I knew already that I just need to use the content and my content you can just queer by language, search URL or even Stargazer count. This is the select statement from GitHub
47:20
from Google BigQuery. If you just want to count the languages like from the time being like from all the data you have on GitHub, you just use this query. Just selecting repository languages, count the repository languages as pushes
47:40
and it takes GitHub timeline which is the dedicated table you can just get events from GitHub archive. It took create event but you can just put any event you want or not even put this word close. But if we just want create event
48:00
from the beginning of time, it will be all the repositories and there will be only one repository, right? One record per repository. We won't have multiple occurrences as we had when we were just counting events, any events.
48:22
So if you want to do it from the beginning of time and just use Google, this is the easy way and all those visualized very impressive projects are actually doing this. You're just queering Google, putting the data into their own databases
48:41
and then do a nice visualization. So they have already counted stuff. Okay, we have some time, yeah, quite a lot for Azure ML. So as you know, I only took an hour data from 1st of January last year.
49:02
What if I wanted to analyze the whole day? It's much more data and you need to go through several files. What if you want to analyze the whole year? So I tried to download all those JSON files from GitHub archive for one month.
49:21
It goes in like tens of gigabytes and those are just public repositories, just public events and I was like, well, my poor laptop won't make it. So it's not a revelation for any of you that if you have a lot of data, you should go to the cloud. You should not force your desktop computer
49:42
to do all this job, not to even store those data. So you need to go to the cloud. So what I did is I went to Azure ML. One digression, I forgot about this slide. Big data and R because that's the question that always pops in when someone asks about R
50:04
is how does R do with big data? Well, it doesn't. It's language that runs in memory. So if you can put your data into the memory, then it will. When it's more, it won't.
50:21
No language will except database languages but they don't run memory, right? They just go through your storage. They do some stuff. Someone says Hadoop but Hadoop is MapReduce and it goes through nodes and on this particular node, it runs in memory.
50:41
So it's not like R have some cons in terms of big data. It's just you need to know how to use R when you're handling big data. So either you process some stuff somewhere else. You download whatever you need and then you use R or you use MapReduce and maybe then you use R
51:01
on the node or you use streaming or some other techniques that big data processing is connected to. Azure. So Azure is amazing but first time if you go there, Azure Learning, Machine Learning Studio
51:21
which is different than normal Azure. And there is a free tire if you just want to look it up and check it out. You don't even have to be logged in to try it out. But if you're not logged in, your stuff won't be there.
51:40
It will just vanish but you can play around. If you have a Microsoft account, you can just sign in and have stuff for free. Not all of the stuff, not unlimited stuff but some stuff to learn, of course. So my first impression was like wow, this is a wizard.
52:01
This is a wizard, online wizard. I'm a programmer, I want to program. I don't want to drag and drop. But that's what you do there. And again, there is a resistance but once you go past it, it becomes quite nice.
52:21
So the basic workflow, how do you do stuff at Azure ML is you put some data source there. You can either upload file there or there are several files, data sets already there for you to play like irises or predicting house prices and stuff.
52:43
There are like tons of them. Or you can query database. You can stream data from somewhere else. There were several days to introduce data into your workflow and then you do something with those data. Whenever you go through some one-on-one tutorial online
53:01
on Azure ML Studio, it rarely talks about R. It will talk about how you can implement machine learning algorithm into your solution very easily and it is very easy how you can tweak some attributes and parameters. So basically you just take stuff from here
53:22
and you drag them here and they even tell you where because it actually fits. You take something here, put it there and it fits. With R, you will have only two nodes. I mean, you can have more of course but I will just show you having data and processing them
53:43
because that's what we were doing. I had GitHub archive and I was processing them and then I was visualizing them. So how does it look like? You have this data source here. I will put it there, my JSON files and you will have R code
54:01
and you have built in R language modules. There, just drag and drop. R code looks like this and if you drag this thing here, there will be like a text form here to put this code, this size.
54:21
But when you click, it just spreads but I don't really don't recommend to use this as an IDE. It's very small window. You don't have any of the capabilities to test it so you just go back to your R studio, do your code there and put it here. This is the code that is generated as a default.
54:44
When you start your project and when you put your R module there, this is what you get. So you have like an input data set, output data set and if you want to choose your files, this is the syntax you need to use.
55:01
Data sets, I just uploaded. I first uploaded my 3 p.m. first of January, 2015 and then I uploaded a zip file. You need to upload a zip file, there is no other way, at least for now. Then I uploaded a zip file with the files for the whole day, for first of January
55:21
because, you know, it's a cloud. I want to do some cloud stuff and a lot of data. I won't be playing with the same that I just did in my desktop. So this is how it looks like. When you go to experiments, I've created some of them. This is my zip file and this is my R script.
55:42
In my R script, this is how my experiment look like. And when I click here, execute R script, you will see something like this. This is basically the same code, a little bit tweaked because I loop through 24 files
56:01
for every hour of the whole day and I read the events and I do the same code as I did for just one hour. Of course, before this, there is this function there but I don't need to repeat it in front of you. And the only tweak I needed to do
56:22
to move this code from my desktop to Azure ML Studio was this. I just needed to point it to SRC folder because this is how Azure ML knows that it needs to find a data set in my data set.
56:41
And that's it. So what about graphs and visualization? And I didn't know how will this work but there is a nice thing, visualize.
57:02
And then I get this graph. This is for the whole day. This is the language distribution. And it's very squeezed so if you just zoom it in, you will see the actual languages that are there. And much more information. And I couldn't figure out how does it work.
57:23
I just put bar plot and it knows it needs to visualize but as it appears when I go back here, sorry, it just tells me. It just genius, just knows. And you can have all the plots you want
57:43
and all the other information. When you do some statistics, when you do some machine learning, it gives you out of the box much more information. It doesn't even know you need. It just does it for you to make your life easier.
58:03
Okay, to summarize, yeah, I made it on time. We've talked a little about data exploration in R. I've shown you some code on R and how I did it for GitHub archive and which problems did I have with GitHub data
58:21
and how to capture them. And last part was dedicated to Azure ML. So basically, what's next? If you go to those two pages, stat methods are our bloggers and you have any question, the answer will be there probably. Or if you Google it, it will point you to one of those pages
58:43
or two of those pages. Sometimes they overlap. Sometimes there is the same thread in both of those pages. And if you go to this page, you will see what I mean when I told you that R has very, very weird errors
59:02
and this is the page dedicated. Or if you just don't want to copy the link, Google weird R language errors and it will point you to the page and you're not gonna believe what the errors can be and what they can mean. I will write some more about R on my blog.
59:21
I do the data exploration in R workshop, which will appear in Katakoga, which is an online learning platform. It will be a free training, like a free tutorial about data exploration and more on the data science topic also there. So thank you very much.
59:41
If you want to contact me, don't hesitate. And if you have any questions, I think we should do it during the break. Thank you very much.