Spice up your website with machine learning!
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 133 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/48836 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Software developerMachine learningWebsiteUniverse (mathematics)Machine learningQuicksortVirtual machineFunctional (mathematics)Different (Kate Ryan album)TrailWebsiteCASE <Informatik>JSONXMLUMLComputer animation
00:38
Software developerMachine learningCASE <Informatik>Machine codeLatent heatShared memory
01:23
Software developerSequenceBlock (periodic table)Machine codeLink (knot theory)Descriptive statisticsSequenceCausalityWebsiteInstallation artComputer animation
02:01
Block (periodic table)SequenceSoftware developerInclusion mapElement (mathematics)WebsiteFormal languageIntegerTypinferenzType theorySequenceObject (grammar)String (computer science)Theory of relativityMultilaterationFunctional (mathematics)
02:51
SequenceBlock (periodic table)Software developerAlgorithmEmailLemma (mathematics)Data typeAnalytic continuationExtension (kinesiology)RecursionSynchronizationControl flowHill differential equationReflection (mathematics)Helmholtz decompositionProxy serverRepresentational state transferExecution unitRule of inferenceRegulärer Ausdruck <Textverarbeitung>Linear regressionRegular graphWindows RegistryAverageInternet service providerRandom numberSeries (mathematics)Computer multitaskingMathematicsService (economics)DialectMaxima and minimaScheduling (computing)Machine codeLatent heatModule (mathematics)Moment (mathematics)SimulationWebsiteElectronic mailing listLink (knot theory)Scripting languageGoogolShared memoryBitAsynchronous Transfer ModeDifferent (Kate Ryan album)Uniqueness quantificationDirected graphFunctional (mathematics)Computer animation
04:55
Physical systemSoftware developerMachine codeStandard deviationSearch engine (computing)Functional (mathematics)Type theoryIntegerContent (media)ResultantDirected graphMultiplication signCASE <Informatik>Computer animation
06:09
Software developerVirtual machinePhysical systemVirtual machineFunctional (mathematics)Level (video gaming)Type theoryMachine learningElectronic signatureElectronic mailing listResultantPhysical systemBit rateComputer animation
06:59
Software developerGoogolError messagePhysical systemError messageQuicksortTranslation (relic)Different (Kate Ryan album)Virtual machine
07:37
Software developerRational numberVirtual machineArtificial neural networkQuicksortInternetworkingPhysical systemBitSoftwareBoom (sailing)State of matterRight angleStandard deviationDigital photographyBenchmarkInternet forumComputer animation
08:27
Software developerComputer networkConfidence intervalPredictionComputer-generated imageryArtificial neural networkRight angleDigital photographyState of matteroutputObject (grammar)Boom (sailing)Virtual machineGreatest elementPhysical systemLatent heatNetwork topologySoftwareComputer animation
09:36
Machine learningSoftware developerInformation retrievalMachine codeTerm (mathematics)FrequencySynchronizationMachine codeMachine learningInformationDifferent (Kate Ryan album)WordDescriptive statisticsMultiplication signArithmetic meanType theoryVariable (mathematics)1 (number)Virtual machineElectronic mailing listBlack boxData structureTerm (mathematics)QuicksortPhysical systemNumberBitForm (programming)Information retrievalPairwise comparisonShape (magazine)Task (computing)Theory of relativityToken ringDirected graphProcess (computing)Computer animation
13:31
WindowCube (algebra)Software developerObject (grammar)Sanitary sewerDatabaseSQL ServerError messageDrop (liquid)Annulus (mathematics)AliasingFraktalgeometriePlane (geometry)Term (mathematics)FrequencySynchronizationMachine codeDigital object identifierInverse elementRepresentation (politics)Vector spaceTerm (mathematics)InformationDescriptive statisticsConfiguration spaceMachine codeDifferent (Kate Ryan album)FrequencyArithmetic meanQuicksortVirtual machineData structureConjugacy classForm (programming)Physical systemInverse elementLogarithmWordElectronic mailing listNumberMereologyRepresentation (politics)Information retrievalToken ringLatent heatReal numberMachine learningNichtlineares GleichungssystemSingle-precision floating-point formatRow (database)AreaHoaxComputer animationXML
16:56
Term (mathematics)FrequencyInverse elementRepresentation (politics)Vector spaceSoftware developerLengthMeasurementDimensional analysisDirection (geometry)Vector spaceQuicksortAngleTerm (mathematics)Multiplication signDifferent (Kate Ryan album)Representation (politics)Computer animation
18:55
Software developerAlpha (investment)Annulus (mathematics)Functional (mathematics)InformationSaddle pointFormal languageMultiplication signMereologyFrequencyDescriptive statisticsWordInverse elementDisk read-and-write headDifferent (Kate Ryan album)Search engine (computing)Musical ensembleForm (programming)Machine codeRepresentation (politics)Wave packetLevel (video gaming)Hand fanBitNeuroinformatikToken ring2 (number)1 (number)Virtual machineSemiconductor memoryRight angleQuicksortIntegrated development environmentAliasingResolvent formalismSet (mathematics)ParsingNumberTerm (mathematics)Linear algebraHash functionParsingSummierbarkeitMachine learningVector spaceTheory of relativityElement (mathematics)Letterpress printingSource codeComputer animation
24:49
Software developerVolumenvisualisierungScaling (geometry)Plane (geometry)FraktalgeometrieLink (knot theory)Line (geometry)InformationDescriptive statisticsMaxima and minimaComputer animation
25:29
Software developerString (computer science)Open setExecution unitWeb crawlerWeb pageUniform resource locatorNewton's law of universal gravitationLink (knot theory)WordMachine codeString (computer science)ExpressionSimilarity (geometry)Open setData structureGraph coloringWeb crawlerNumberBitComputer animationSource code
26:15
Software developerWeb crawlerUniform resource locatorWeb pageExtension (kinesiology)SynchronizationNumberDescriptive statisticsLine (geometry)Chromosomal crossoverExtension (kinesiology)Web crawlerComputer animation
26:52
Web crawlerSoftware developerExtension (kinesiology)Computer wormWeb crawlerDirected graphExpressionQuicksortWeb 2.0Process (computing)Graph coloringMachine codeData structure
27:36
Software developerVirtual machineMeasurementRight angleMachine learningSpacetimeSimilarity (geometry)Software testingWebsiteLogical constantMachine codeFunctional (mathematics)WeightInformation retrievalMereologyPredictabilityTheory of relativityPoint (geometry)ResultantNumberComputer animation
29:11
Software developerStrategy gamePattern languageSimilarity (geometry)Software developerRevision controlShape (magazine)Right angleStrategy gamePattern languageShared memoryState observerResultantFactory (trading post)InformationObject (grammar)Event horizonMachine codeProcess (computing)Source code
30:04
Software developerParsingSimilarity (geometry)TimestampInformationTwitter1 (number)Streaming mediaParsingMultiplication signWell-formed formulaPresentation of a groupComputer animationSource code
30:47
Software developerPlanningVector spaceSearch engine (computing)Proper map1 (number)CuboidMereologyElectronic mailing listWebsiteLatent heatNormal (geometry)Exception handlingDirected graphComa BerenicesTerm (mathematics)Source codeComputer animation
32:02
Software developerProcess (computing)SequenceSynchronizationParallel portEmailComputer programmingMachine learningThomas BayesFluid staticsBuildingTerm (mathematics)NumberPredictionTheoryTheory of relativityOrder (biology)Endliche ModelltheorieMultiplication signWordDirected graphTerm (mathematics)Nichtlineares GleichungssystemNumberRevision controlDistanceParsingArithmetic progressionTheoryMappingService (economics)Process (computing)InformationException handlingShooting methodRepresentation (politics)Equaliser (mathematics)Virtual machinePhysical systemSequenceNatural numberCASE <Informatik>Bit rateEstimatorBitState observerMachine codeWave packetLikelihood functionDivisorFraction (mathematics)Product (business)Bayes-EntscheidungstheorieQuicksortMachine learningThomas BayesRegulärer Ausdruck <Textverarbeitung>String (computer science)EmailElectronic mailing listComputer animation
41:05
Software developerExecution unitThomas BayesPredictionTerm (mathematics)Machine codeFunctional (mathematics)SpacetimeProduct (business)Term (mathematics)Shared memorySummierbarkeitDifferent (Kate Ryan album)Computer animation
41:53
Software developerCorrelation and dependenceMatrix (mathematics)Machine codeVector spaceSpacetimeWebsiteDescriptive statisticsSource codeComputer iconWeb 2.0Computer animation
42:52
Software developerBucklingSynchronizationBlock (periodic table)SequenceError messageParsingSequenceStress (mechanics)Type theoryBlock (periodic table)Machine codeLink (knot theory)DemosceneString (computer science)Source code
43:39
Software developerProcess (computing)Right angleSelectivity (electronic)WebsiteMultiplication signMachine codeSequenceWindow
44:27
Software developerRepresentation (politics)Time domainMachine learningInformation retrievalTwitterNatural languageMachine learningRepresentation (politics)CASE <Informatik>TouchscreenWebsiteType theoryInformationGroup action1 (number)Endliche ModelltheorieMachine codeProjective planeWechselseitige InformationVirtual machineShared memoryMultiplication signSoftwareOpen sourceProcess (computing)Domain nameInformation retrievalFunctional (mathematics)Bayes-EntscheidungstheorieJava appletTheory of relativityQuicksort
Transcript: English(auto-generated)
00:07
All right, I believe it's three o'clock, so we should start. So welcome, I'm Evelina Gabasova. And during the day, I am a researcher and I work at the University of Cambridge and I work in cancer research.
00:20
But during the night, I have fun with machine learning and I try to apply it to all sorts of different and interesting problems. So this talk will be about how to spice up your website with machine learning specifically. And because we are in the functional track here, I'll be talking a lot about F-sharp. And I will start with a use case
00:42
that I used to apply machine learning. And this is a website called F-sharp Snippets. Who has seen F-sharp Snippets before? About half of the people. So F-sharp Snippets, it's something like gist on GitHub, only on steroids.
01:01
It's much better and it's targeted to F-sharp specifically. So it was started roughly five years ago by Thomas over there. So if you have any questions, you can ask him. He will be speaking later during the conference. And people are posting pieces of code there that they can share with others.
01:23
And I have a little example here. So this is an example snippet from the website. This is actually the very first one on asynchronous sequences. And you can see that it contains the code. It contains some title and some description.
01:43
And you can copy a link to the code. You can embed the code. You can look at just the raw code, et cetera. You can also open it in tryfsharp.org. And if you have Silverlight installed, you can actually run the code within here.
02:01
And you can also add tags to the code. I will come to that a bit later. So you can see that this is quite useful for everything F-sharp related. And one of the main features of this kind of website is that you can actually look at the types of things
02:23
within the website. So for example, this is an asynchronous sequence. And you can see the type. And this is a function. And you can, again, look at it. And you can see that it's a function that takes a string and an integer and returns an async of some other object.
02:41
So this is very useful for the type of language that F-sharp is. That means statically typed with automatic type inference. And this is really a very nice website. You can have a look at it. It contains quite a lot of snippets already. It's quite active. You can like snippets.
03:00
You can share them. The main problem comes when you decide to actually search for snippets, because sometimes you are looking for some specific piece of code. And as I mentioned, there are tags in the code that you can attach to it. So this is how it looks if you look at the tags.
03:23
So these are the tags that people attach to their snippets. And there are some tags like, for example, Q module UNCC. That's not a very informative tag, is it? Or simulation JavaScript HTML5 on F-sharp snippets.
03:41
It's not like a simple, nice tag that you could search for if you are looking for something. So my first goal was to make this a bit more manageable, because this is clearly not sustainable. And when I actually looked at all the snippets and the tags, there are 1,600 snippets at the moment
04:03
and 1,100 different unique tags. That's almost one new tag for every new snippet. That doesn't make any sense. It doesn't serve any purpose, actually. And it's not really useful for searching. So what do you do if you want to search for something?
04:23
Well, I go to Google, right? So I went to Google and tried searching for async on the fs-snip website. And what Google gives me is, basically, just a link to the list of the tags. So that's not very useful either.
04:41
Because for Google, this is very informative, but not for me as a person who's actually interested in the code. So why is this a difficult problem? Well, this is where we are with Google as well. And for example, this is just a little snippet of FSharp code.
05:02
And when I look at it on the fs-snip website, I can actually see that, for example, this is a function that takes an asynchronous workflow and some kind of function and an integer and returns an asynchronous workflow. And I'm not interested in how the user called the function.
05:22
For example, this is the amazing function. I'm not interested in that. That doesn't tell me anything about the actual contents of the code. So in this case, this will be probably asynchronous piece of code. So I would like it to have the async tag
05:41
or be searchable under async. Although async here, if you just look at the text, is here just once here. And the amazing function is there two times. The result OK function is there three times. So based on the standard search engine behavior,
06:01
this would be much more important. And Google doesn't see all the types that are underneath. So this looks something like this. For example, Haskell has something like this. You can search for types of functions in Google. For example, here I am looking for a function that
06:22
has the type signature that takes a function and a list and returns another list. And the first result is the map function. And you can do this for Haskell. Can we do something like this for F sharp? Well, this looks like a great opportunity to create some custom-made machine learning system, right?
06:43
Who here has worked with anything machine learning? Yeah, quite a few people. But yeah, so always beware this kind of situation. For example, there is this other company that does a lot of machine learning. It's called Google.
07:00
And this happened last week. Turns out that if you are trying to translate something from Russian to Ukrainian, then it would translate Russia as Mordor. That's not the sort of mistake you want your system to make.
07:22
And what I really like here is the automated error, the quotation marks, because this is an error created by machine learning purely. And this same error had different aspects as well. For example, this is the Russian foreign minister,
07:40
Sergei Lavrov. And if you tried to translate his name, or his surname, from Russian to Ukrainian, it's turned into a sad little horse. So if you are Google, you really don't want this sort of mistake. And the problem here was that they were learning from data.
08:01
And they have machine learning systems that learn constantly and then update to what they see on the internet. And on Ukrainian online forums, people started calling Russia Mordor. And Sergei Lavrov was a sad little horse. So Google picked it up. And the system learned.
08:21
It learned correctly, but you don't want this sort of behavior. And on a bit more serious note, deep neural networks have a really big boom right now. And they are basically state of the art for recognizing pictures. If you have, for example, a photo of a horse,
08:42
it will tell you, oh, this is a horse. And on standard benchmarks, they can achieve 98% accuracy of distinguishing animals or objects. But if you actually take these very well-trained neural networks and try to fool them by basically changing
09:02
the inputs in a specific way, you can get something like this. The examples on the top are labeled by the neural networks with very big certainty as Robin or Cheetah or Peacock. Although for us, it looks nothing like that. And the neural network is very certain that that is a Peacock.
09:23
And similarly, in the bottom, there is electric guitar and starfish and baseball. And it doesn't look anything like that. And still, it learns from data. So anytime if you feel like you need some kind of machine learning system, you have to be aware of these kinds
09:42
of aspects or these kinds of behaviors. Because machine learning systems depend on data. And especially if the data are generated by users, they can put all sorts of things in there. They can start translating Russia into Mordor, and it will mess everything. So machine learning sometimes can be a really black box.
10:02
And you have to be prepared for this sort of behavior. So this was just a bit of warning, because I see there is quite a lot of hype around machine learning. But sometimes it's just easier to create something manually. But I am a machine learning researcher, so enough with the warnings. And let's dive into some proper machine learning.
10:24
So with the F Sharp Snippets website, the first thing I wanted to do was to actually find related snippets. Because that's a fairly simple task. If you look at a specific snippet, you want to see the snippets that are similar to it in some way.
10:42
It's not the same thing as the Google did for Haskell to look for the same types. But it's just trying to compare code from different way. So the simplest way of this is called information retrieval,
11:01
where you basically just process your code or descriptions or documents, turn them into tokens, and then look at, oh, does this document use the same words as this one? And this can be misleading sometimes. For example, I skipped over a bit.
11:23
If we look at some terms that are appearing in the code, some of them can be quite important. Like for example, this is comparing two snippets. One doesn't have any mention of async, and the other one has three mentions of async. And although, for example, X is in both of them quite often,
11:42
that doesn't mean that these two pieces of code are similar. Because X is something that everyone uses all the time to denote for variables that you are not really interested in. So the simple information retrieval approach in the simplest way is not very useful.
12:00
Because we can look at, oh, do these snippets use the same types of words? But it doesn't give us the most relevant ones. So if you are doing anything related to information retrieval, the first thing you would do is to just turn your documents into bags of words.
12:21
This is like a simplification, and it literally means that you take your document and put all the words into a bag, basically. Just what you get out of this is a list of words and the number of times they are in the bag. And this gets rid of a lot of information.
12:41
If you have a piece of code, it's quite important if something appears at the beginning or not, at the end. And this is often done for normal textual documents. And for them, text can be quite unstructured. So it doesn't have a very fixed structure
13:02
that would be the same for every piece of text. But for code, it's much more formalized. So this is actually a very big simplification, but it works reasonably well most of the time. And I could, of course, use some already available system for information retrieval, but I actually want
13:24
to separate the textual information in the snippets, which is there in the form of, let's look at some snippet, for example, this one. We have quite a lot of textual information in terms of the title and description.
13:43
And also, there are comments that sometimes explain what's actually happening in the snippet, or not in this one. So there is quite a lot of different information
14:00
in the actual snippets. And I want to treat the code and the text differently. Because in text, you have to deal with things like different conjugations in verbs and plurals and singulars, et cetera. But for code, this doesn't matter, because code is very formalized. So every keyword will have always the same kind
14:23
of structure or the same form. So I decided to go for my own custom-made system. And the first thing you obviously do is you turn your snippets into some sort of collection of tokens and their frequencies.
14:42
So I was showing this before. With these two snippets, they are exactly the same except for the async, but the async is actually the important bit here. So we will have to deal with this in some way as well. So it wouldn't be a talk about machine learning
15:00
if I didn't use at least some equations. So this is a very simplistic one. And in information retrieval, there is something called the inverse document frequency that we can use to actually estimate how different terms are important in documents. Because if some word appears in every document,
15:22
that means it's not very relevant. And in normal text, this is usually called stop word. For example, is or a or the will be a typical stop words. They don't bring any new information into machine learning system.
15:41
But I have no idea what are the stop words in code. For example, in F sharp code, that will be probably the let keyword because it appears everywhere. But it might be relevant in some way in some inner parts of my machine learning engine. So I don't want to create a fixed list of words
16:01
or terms that are not relevant in the code. So I went through this way and I tried to estimate the relevance of different terms through the inverse document frequency. And it's relatively simple. It's just the logarithm of the number of snippets all together divided by the number of snippets
16:23
with a specific term. So this will be the IDF of the term. And if the term appears in many snippets, then this value will be low. If it appears in few snippets, then this value will be high. And I can use this to weigh the frequencies of occurrences.
16:42
And this is called in information retrieval, the TF-IDF representation. The information retrieval people really like these sort of acronyms. So this is TF-IDF, meaning. And this gives me a very handy representation for snippets
17:04
because now for every term in every snippet, I can get weighted importance of that specific term. And through this, for each snippet, I will get a vector of values.
17:20
And these values will be actually roughly comparable between different snippets. And then I can use these to compare the different snippets, how similar they are or how far from each other they are. And because I have vectors, I can either compare how far they are from each other, or I can also look at how different they are
17:44
based on the angle they have between themselves. And this measure is really nice because it doesn't depend on the length of each vector. So if you have one snippet that's quite short, then the vector will be quite short as well. If the snippet is quite long,
18:00
then it will be much more complex. So best way probably to do it is to compare the angle the vectors for each snippet have with each other. If the angle is small, that means they are heading into the same direction and they are probably describing the same thing. And if they have a large angle,
18:22
that means they are very different. And if they are perpendicular to each other, that means they don't have anything in common. So this is quite a nice way to look at this sort of representation. And just to give you some idea of what I'm talking about, this happens in roughly 10,000 dimensional space
18:44
for the F-sharp snippets. But it's reasonably fast to compute and I will show you. It's demo time. I actually have all the snippets here in my memory right now.
19:03
And the first thing I will do is to create the bag of words representation. This is probably one of the more computationally demanding parts. And you can see that right now, if you can see the code, for every snippet I have the ID,
19:21
number of likes, the snippet received, and then tokens, both for code and both for text. And I will go a bit more into this function because that's actually where the magic happens. This is where I, as a data scientist or a machine learning person, spend most of my time.
19:41
Because here I have to tweak what information I actually want to use to represent the vectors for each snippet. So here, for example, I decided that I will extract a description of each snippet which is composed of the title, description, the tags, subheadings,
20:03
and then also I edit comments in the code. And I decided to treat those as extended text which means I ran a tokenizer on that and tried to get just stems of each word. It means if there will be, for example, the word parse,
20:22
I will get just P-A-R-S without the ending. So this will basically map all different forms of parse, for example, parsing parser onto the same stem. And I can do this for the text, but it wouldn't be right to do something like this
20:42
for the code. So I don't do this on the code and I get two sets of different tokens. And then I can compare different snippets based either on the code or either just on their description. So this is quite a long function
21:01
that basically just parses the snippets and tries to extract the information that's actually important. And in F-Sharp I can also put a timer into my F-Sharp interactive just by saying hash time. And now it will time everything I do.
21:23
And now I just computed term frequency for the all 1,600 snippets and it took 48 microseconds. And I can also just extract all the possible terms
21:41
and all the possible code items that I have there. And it's really, really messy actually. For example, in the text there is a guitar, there is file name not resolved, there is an accident, there is an accessible account.
22:00
It doesn't sound like a normal text, but I don't really care because I have quite a lot of data and this will all get smoothed out basically. And in the code I have a lot of things as well. I have an account, add cancellation, add days, all sorts of different function names.
22:21
And one other thing I didn't mention when I was constructing the back of words representation for each snippet, I actually decided to throw away all names of functions that the user defined within the snippet. I decided to keep all the ones that are global that exist in the environment.
22:41
For example, array.map is a function name. But if you basically, if you create some kind of alias for this and call it L-A array map map, then it doesn't get included into this sort of representation.
23:01
And now let's compute finally the inverse document frequency, the IDF. Now this was a bit slower, this took two seconds for the 1,600 snippets and 10,000 possible code and text tokens. And as I said, the inverse document frequency
23:24
gives me the relative importance of elements in the text or in the code. So now I can see which elements are not important in the text. For example, here I just extract the 10 least important textual words.
23:41
And they are a, the, to, of, and, I, us, for, in, and with. This kind of makes sense. These are the words that don't add much information to a text. And how does it look in the code? Well, in code in F-sharp, print fn is not very informative.
24:06
String, seek, map, int, list, f, sum and none and array. These are the 10 least informative pieces of code. That kind of makes sense, but still I don't want to exclude them
24:21
from the search engine completely because for example, seek, it's quite an important subpart of the language that I want to be able to search for. And now I can just calculate the tfidf for each snippet
24:40
and use this to find related snippets to any other snippet. So for example, let's take the you snippet. I will just show you which one I was searching for. This is the minimal XML DSL.
25:02
This is probably one of the most popular snippets. It had 355 likes. And it basically just describes a DSL for XML. And you can see that there is not much information in the text because it has just a very short title and just one line of description.
25:22
So let's look at which snippet is the closest to this one. And it's the 5J. And this is an open XML Word document from string array. So it's related.
25:40
It is about XML and it uses similar code structures and expressions. So this looks fairly okay. Let's look at another snippet. For example, I had 3K here, which is quite nice.
26:03
3K. This is a web crawler. Again, there's quite a lot of text or quite a lot of code, not that much text. And the closest one is number 65
26:23
and it's web crawler extensions. And actually the first line in the description says that this snippet extends the web crawler from snippet 3K, which is the one I was searching for. So this is really nice. It means that these two snippets
26:40
are showing me something very similar. And just to give you some more idea how it works, actually, the fourth most similar one is 8V and that's a caching agent that doesn't have that much similar with a web crawler.
27:02
But the thing is that this uses the same sort of code expressions. It uses async, it uses mailbox processor, so it behaves in a similar way, although it doesn't do the same thing. It's not a web crawler, but it is similar in the code. So this is something I'm interested in.
27:22
If I go to FSharp snippets and I'm looking for something, I find something that looks related to what I'm trying to find, and then I can look around how people use the similar structures in their code. So this is quite good. And also, what I can do with my custom machine learning
27:42
is that I can actually define my own similarity functions. So I can, for example, penalize snippets that don't have anything in common in code or that don't have anything common in the text.
28:03
And I can also, for example, here I have another constant. Right now it's set to 0.5. I can change the relative importance of code versus text. Right now it's half and half. They both have equal contribution.
28:22
But I can also say, oh, the text is more important because that's actually saying what the user wants there. So I can increase the importance of this part in the machine learning, in the information retrieval engine. Otherwise, this is not very important. It's just a function that computes similarity
28:42
based on the measures that I define. And there is quite a lot of space for experimentation. And quite a lot of work in machine learning is trying to get the right values for constants. So I could, for example, test my predictions
29:01
on some labeled data and see what weights give me better results. And let's see how it actually works on the website. I have a development version of FSharp snippets running here. I believe it will be released soon, right, Thomas?
29:24
So now I can, for example, look at, let's see, a strategy pattern in FSharp. And it gives me the snippet. It's quite short. It doesn't have many information in it.
29:40
And here I get similar snippets. And it's the observer pattern and FSharp strategy pattern and factory pattern, et cetera. So it's giving me quite reasonable results. So hopefully in the future, you'll be able to search for FSharp code much more easily. Let's look at something else.
30:03
Creating objects with events. And yeah, I will hope this is relevant to that. It's harder to see the relevance if it's not something exactly the same.
30:21
But for example, this is a very short snippet that parses Unix timestamps. And the related ones are friendly date formatting, Twitter stream API, because Twitter gives you time in the Unix timestamps, et cetera, and parsing Excel formulas.
30:42
So this already gives me quite a lot of information on the snippets, and it looks useful. But let's get back to my presentation. So this is already quite a valuable contribution
31:00
to the FSharp snippets website. But just looking for similar snippets is not enough. And actually, one plan I have for the future is to extend this into proper search engine. Because right now, it's just looking, comparing snippets with each other. But if you just search for a specific term,
31:22
you can treat this as a snippet as well, and then compare it to all the existing snippet vectors and identify the ones that are similar. So this can be extended into a normal, proper search engine. But part of the problem with the FSharp snippets was that when the website was originally created,
31:43
people could just put in tags however they wanted. This is just the search, the text box that appears on the website, and it just says, please give in a comma separated list of tags, which means that, for example, for tags containing async,
32:03
it was more something like that. Looks like you are writing about async. Add an async tag or add some completely unrelated tag. So I basically just extracted all the tags that had async in them. So it's async, hashtag async, async mail processor,
32:24
async parallel, so typos, async sequences, async seek, asynchronous sequences, and asynchronous sequence. So these are all basically the same, but people were putting in different tags
32:40
because they didn't have any incentive to use tags that were already used before. So normally you, well, properly you should have someone that would go through it and curate the snippets and curate all the tags. Well, but where's the fun in that? So I decided to do it at least partially automatically.
33:03
It didn't work that well. I tried to look at edit distance between tags. This should map the tags that differ in one or two letters onto the same tag, which works quite fine for some. For example, regex or regexp.
33:21
That's basically the same tag, just people write it differently. But there were things like sports versus sports and pi versus API. So these are not the same things, but they are very similar. So I had to basically manually go through them
33:41
and identify those that map to each other, even though they shouldn't. And I don't think I've identified all of them, so this is still a work in progress. Now back to machine learning. So right now what I had was basically set of 1,600 snippets and their tags.
34:03
And most of the snippets actually have tags. I think there are about just 300 snippets that didn't have any tags. So this is quite a lot of information. So we should be able to train some proper machine learning system to actually give a tag, given a snippet to predict a tag
34:21
that would be relevant to the snippet. And I decided to use a very simple, naive Bayesian approach because I am a Bayesian statistician. And well, this is Reverend Bayes asking us why we are calling him naive.
34:41
And the naive word in that comes from a very naive assumption. And naive Bayesian methods work with the assumption that all the words in the document are unrelated to each other or they are independent. So for example, sequence and exception,
35:01
they are unrelated. They can occur together or they can occur separately, probably with equal probability. They are not related to each other in any way. But on the other hand, a string and a parser, they are very much related because you are usually parsing a string. So the model should account for this.
35:22
And the same for async and mailbox processor. They very frequently occur together. But we are simplifying our lives because if we were trying to include these sorts of relations, that would make the model very, very difficult to estimate or to learn. So we are simplifying everything
35:41
and we are just using the naive assumption that these are independent. And it's a similar thing I was doing before with the back of words representation. I was also just assuming that the order of the words doesn't matter, even though this is clearly not true. So this is the naive thing in naive Bayes.
36:03
And the key idea behind this is that if we look at some piece of code, we can actually see that, for example, this piece of code has async in it many times. And probably the snippets that have async in them will have the async tag.
36:20
So we can use this to actually estimate the probability. And because it's a probabilistic method, I will come with some more equations. Who remembers the Bayes Theorem from school? Ah, not that many people. Yeah, I apologize for equations,
36:41
but this is a machine learning talk and there should be equations. So this is a simplified version of a Bayes Theorem, where I'm interested in a probability of a tag assigned to a snippet. And this is actually proportional to a prior probability of a tag.
37:01
For example, the prior probability that something has the tag async times a product of probabilities that each of the terms from the snippet appears given that the snippet has a tag. I will go through the terms individually.
37:21
So first there are the prior probabilities. And that's basically just a probability of a tag even though we don't see any data on our training data set. And we can estimate this from data. For example, the prior probability of tag async will be the number of snippets with the async tag divided by the number of snippets overall.
37:43
By the way, in my data, this is roughly 5%. So this is the most common tag and it appears in 5% of the snippets. So this is just the probability. If we don't see any data, this is the probability that the snippet will have a tag async.
38:01
And the more complicated one is the likelihood. And this basically just says how frequent a term is among snippets that have a specific tag. So for example, the term mailbox processor under the tag async will be much more likely
38:21
than the term let's say parallel or maybe not parallel but maybe parse. So we can estimate this from data as well. You can just look at the number of snippets with the term and the tag together and divide this by the number of snippets with the tag.
38:40
This will give me a proportion of snippets that have the term under the tag. So we can all estimate everything from the data. That's simple then, right? We have the whole equation. And if you know the standard bias theorem, it also has a denominator but I don't care about that.
39:02
That's really hard to estimate. In this case, it would be the probability of a snippet. And I have no idea what that is and I don't have to estimate this at all because all I need to do is to get the probability of a tag given snippet and divide this by the probability that the snippet doesn't have the tag.
39:20
And because these two things have the same denominator in the bias theorem, this cancels out and I'm okay, I'm okay to go. I don't have to care about the denominator at all. So this is very useful when you are working with probabilities because quite often the denominator is the hardest thing to estimate
39:42
because you would have to come up with a probabilistic model for your data and that's hard. So in this case, when I estimate these probabilities, if this ratio is larger than one, then the tag should be assigned to the snippet and if it's smaller than one, then it shouldn't be.
40:02
That's fairly easy, isn't it? Looks very straightforward. The problem is that the theory is always nicer than it is in the practice. And there are things like snippets that contain some kind of, that are tagged with specific tag
40:20
but don't contain some term. For example, imagine that there is no snippet with a tag async that would contain list. But clearly these two are not related. This shouldn't matter. But if there is no snippet like this, then the probability will be equal to zero. And then in the bias theorem,
40:41
this whole equation would be equal to zero. Even though it's not related. So we have to do quite a lot of tweaking and introducing pseudo-counts, basically introducing pseudo-observations that basically every tag is associated
41:00
to every term in some way. And it's time to show you how it actually looks in practice. I won't really go through the code itself. This is a function to compute the log likelihoods.
41:22
I'm actually working in log space because whenever you are working with probabilities like this, it's important to note that this is always smaller than one because it's a probability. That means that if you multiply a lot of these terms, you will get very, very small values very quickly.
41:43
So the normal thing is to work in log space which transforms the product into a sum. And that's much more manageable suddenly. So this just, it's not very interesting if sharp code because it just deals with a lot of vectors and summing different vectors and matrices.
42:05
And unfortunately right now it's not very quick, but there is a lot of space for improvement. But I will show you how it actually looks on the website. So I have it running here.
42:22
And if I want to insert a snippet, I come here. If the snippet is hidden, that means it's private, it doesn't get onto the main website, then I can't put in basically anything apart from the source code. But if it's public, I can certainly put in a description
42:41
and I can add tags. So I will just use this, the first snippet on the FSharp Snippets website. Copy the source, put it here, and let's see what it does. Now what this is doing behind the scenes
43:01
is that it's actually parsing the code and checking it. So here it gave me a check that the snippet parses and the type checks with no errors. And this filled in some guesses on the tags that should be attached to the code. And it's about asynchronous sequences.
43:21
So the first tag is async, that's really nice. And the other thing is that this code, basically what it does is that it compares two files based on blocks and it reads them asynchronously. That means it's dealing with a lot of strings. And the second tag is string.
43:42
And although this code doesn't use the mailbox processor, it's still predicting it should have this tag because a lot of asynchronous snippets use the mailbox processor. So I would probably do, cross this one out and then I can add other tags, for example, sequences.
44:07
So now this is another improvement on the website. Right now this is a nice JavaScript text window where you actually get the selection of the tags that are already existing on the website.
44:21
So hopefully people won't be adding their own tags all the time. So hopefully it will help. Okay, so that was the little demo. And just to give you a little summary, what's really important when you are dealing
44:42
with anything machine learning is to come up with a good representation of the domain you are trying to represent. So for example, for me, it was important to take account of the types in the snippets, not just about the text and what appears on the screen, but the types of the functions.
45:01
So I had to include this into the model. And it's also important to find out which features are actually important in the predictor. That's another thing I would still like to do because in the Bayesian model I can actually see which snippets contribute most to the classifier.
45:24
So for example, I can look at something called mutual information. So what's the mutual information between snippets and the tags and identify the ones that are most important. And when you have probabilities, you can do basically anything.
45:41
And I hope you have seen that machine learning is quite fun. So this is just a small open source project for the community for sharing F-sharp snippets. And for me, this was the first time working with anything really information retrieval related.
46:02
And it was really interesting to see how it behaves differently from the sort of data I work with normally. And if you have seen anything that you would like to improve on F-sharp snippets, you can contribute. This is still an ongoing effort and the final release is not happening yet.
46:21
And if you want to learn more, well, you can obviously go to F-sharp snippets and put in your own snippets. If you are interested in F-sharp, you can go to the F-sharp Software Foundation website. And because I was doing a lot of machine learning here, I was using the FSLab package, which is like a nice F-sharp packaging
46:41
of all machine learning and data science related packages that you might need. And if you are interested in information retrieval, there is a very nice book online on informationretrieval.org. And it's published by the Stanford NLP Group, which is very good.
47:01
They have a lot of code in Java for doing natural language processing. And you will find the code on my GitHub and this is my website and my Twitter. If you have any questions, just ask. Thank you.
Recommendations
Series of 2 media