The F#orce Awakens
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 96 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/51849 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Multiplication signComplex (psychology)StatisticsType theoryVirtual machineForcing (mathematics)Model theorySoftwareUML
00:42
Scripting languageSoftwareMultiplication signScripting languageDemosceneType theoryInteractive televisionComputer fileFile formatArithmetic meanInterior (topology)Descriptive statisticsRight angleSet (mathematics)Repository (publishing)Standard deviationProgrammer (hardware)Data structureComputer animation
03:34
Local GroupInterior (topology)Scripting languageWordParsingDemosceneExt functorCodeKolmogorov complexityWide area networkComputing platformWindowForm (programming)Pattern languageImplementationWritingFunctional (mathematics)CASE <Informatik>Different (Kate Ryan album)ParsingLevel (video gaming)DemosceneString (computer science)TouchscreenElement (mathematics)Operator (mathematics)Video gameQuicksortoutputRegulärer Ausdruck <Textverarbeitung>Matching (graph theory)Type theoryPhysical systemWordScripting languageParsingData structureFile formatRight angleBlock (periodic table)Descriptive statisticsComputer fileLine (geometry)Multiplication signComputer animation
08:44
Service (economics)Interior (topology)Raw image formatGoodness of fitOffice suiteNumberLink (knot theory)Annulus (mathematics)Computer networkNichtlineares GleichungssystemBEEPProcess (computing)Multiplication signNichtlineares GleichungssystemWord1 (number)AlgorithmScripting languageType theoryNumberDemosceneForcing (mathematics)Standard deviationMathematical analysisComputer animation
11:21
TheoremProof theoryForceUniform resource locatorRootAuthenticationBit rateJava appletSpeciesSource codeOpen setProcess (computing)Interactive televisionError messageStructural loadString (computer science)GenderCompilation albumWebsiteParsingContent (media)ParsingInternet service providerTheoremCodeRow (database)Library (computing)WebsiteQuicksortInformationTouchscreenAlgorithmData type1 (number)Wrapper (data mining)Line (geometry)String (computer science)MassFunctional (mathematics)Type theoryCASE <Informatik>Java appletParameter (computer programming)System callNormal (geometry)Poisson-KlammerWeightStructural loadElement (mathematics)Set (mathematics)RandomizationFormal languageComputer animation
17:00
String (computer science)Execution unitUniform resource locatorLengthParsingMusical ensembleSpeciesPort scannerGenderMatrix (mathematics)Vector spaceSeries (mathematics)Physical systemNP-hardSimultaneous localization and mappingLandau theory19 (number)MassOrder (biology)Link (knot theory)RadiusEmulatorWindowVisualization (computer graphics)Functional (mathematics)Computer programmingCompilerSource codeGraph (mathematics)Internet service providerDemosceneElectronic mailing listLink (knot theory)TouchscreenMultiplication signQuicksortSet (mathematics)Visualization (computer graphics)CodeProjective planeWritingTerm (mathematics)Cellular automatonBitType theoryAverageMathematical analysisInformationPrisoner's dilemmaFunctional (mathematics)Group actionWebsiteOrder (biology)RiflingMetropolitan area networkWeb 2.0Power (physics)Library (computing)RoboticsGenderGraph coloringRight angleWeb pageTable (information)Casting (performing arts)Letterpress printingSoftware repositoryComputer animation
23:56
Clique-widthGraph (mathematics)Link (knot theory)LaceMiniDiscFunction (mathematics)Default (computer science)Execution unitTotal S.A.Local area networkRule of inferenceFunctional (mathematics)Computer programmingCompilerSource codeGame theoryInheritance (object-oriented programming)Population densityCoefficientPopulation densityFunctional (mathematics)CodeSoftwareNichtlineares GleichungssystemGraph (mathematics)InformationNumberDifferent (Kate Ryan album)CoefficientTypprüfungGroup actionVisualization (computer graphics)Connected spaceMeasurementTerm (mathematics)Data structureForcing (mathematics)Total S.A.VarianceSystem callBitType theoryLibrary (computing)Pattern languageRight angleHydraulic jumpTwitterWeb 2.0DemosceneComputer animation
30:51
CoefficientDegree (graph theory)NumberLink (knot theory)SoftwareTelecommunicationNichtlineares GleichungssystemNumberDegree (graph theory)TypinferenzFunctional (mathematics)Programming languageCoefficientConnected spaceQuicksortBitLink (knot theory)AlgorithmMeasurementDirection (geometry)Message passingMereologyCentralizer and normalizerPreprocessorDifferent (Kate Ryan album)Library (computing)Formal languageComputer programmingGene clusterLattice (order)CASE <Informatik>Characteristic polynomialParsingInternet service providerTouchscreenCodeComputer fileGraph (mathematics)Interactive televisionComputer animation
37:47
Computer networkMathematical analysisSource codeEmailTelecommunicationReal numberLink (knot theory)Degree (graph theory)ProteinUniverse (mathematics)Sampling (statistics)Keyboard shortcutDialectTerm (mathematics)TwitterSoftwareTelecommunicationNumberSet (mathematics)DemosceneConnected spaceSimilarity (geometry)Order (biology)Slide ruleEmailGradientMereologyMeasurementCentralizer and normalizerInformationMathematical analysisScripting languageAreaOpen sourceFactory (trading post)Message passingArithmetic meanSign (mathematics)AuthorizationMultiplication signMathematicsLimit (category theory)Computer animation
45:21
Computer networkData typeFunctional (mathematics)Graph (mathematics)ForceDemo (music)Scripting languageInternetworkingDatabaseInformationOverlay-NetzPattern languageLink (knot theory)System callScripting languageDifferent (Kate Ryan album)Demo (music)ParsingFormal languageInternet service providerAlgorithmSlide ruleQuicksortSequelFile formatSet (mathematics)Java appletType theoryBlogComputer animation
47:33
Slide ruleMultiplication signWebsiteComputer animation
Transcript: English(auto-generated)
00:05
OK, I think it's time to start, so welcome to The Force Awakens with F-Sharp. I don't even know how to pronounce this correctly. So I'm Eweryna Gabashova and I work as a post-doc researcher
00:20
at Cambridge University in cancer research. So I deal a lot with DNA mutations, things like that, and that's an incredibly complex type of data. And when you look at it, it's really hard to understand what you are doing and even if you fit some kind of statistical machine learning model, it's really hard to see what's actually happening there.
00:41
So, this is an example of metabolic networks and you probably can't see anything. I can't see anything because I'm not a biologist. So, in my free time, I like to play with other data.
01:29
I suspect they won't be using this title for the actual episode 8. So, let's talk about Star Wars. Who likes Star Wars? Yeah, thanks for coming, you are in the right stock right now.
01:42
So, some time ago I decided, well, let's analyze Star Wars. I actually started doing this right before the premiere of episode 7, so there was a lot of hype around Star Wars. So I was thinking, so what kind of data can I actually analyze? And, well, there are obviously the movies, but it's kind of hard to analyze movies, right?
02:02
So, the next dataset that's available about Star Wars are actually the screenplays or scripts. And there are multiple different repositories online where you can go and download script of almost any movie you are interested in. And they publish it very quickly. So, for example, the script for episode 7 was there I think in January.
02:23
I'm not really sure what's the legal status of that, but if you want to look at it, you can. So, I looked at the scripts. And this is an example of a script from episode 4, the original Star Wars. And the great thing about this is that they have a very standard type of format.
02:42
So you can probably see that there is some title of a scene and they always start with int or ext, meaning interior or exterior. Then there is some description of a scene and there is a name of people speaking and what they are saying. And because we are programmers, we know that this is actually quite easy to parse
03:03
or if it has this type of structure, we can parse it. So, I decided, well, let's look at interactions in Star Wars and maybe I can even extract social network by looking at who speaks with whom in the scenes. So, I went on and downloaded all the script files of all the 7 movies right now.
03:24
I don't know if I will continue with this in the future, but I have 7 of them now. And I looked at them and because they are published online, they are usually in HTML and this is how it looks. And that's also pretty nice, right? There is like a pre-formatted HTML block with the scene name in bold
03:45
and then there is some description and again, name in bold and in capitals. Oh, that's easy to parse. Well, usually. So, when I first started doing this, I ended up with a bunch of regexes
04:00
because I wanted to match all the different types of how a name can appear in a script. But, because I'm working in F-sharp, I would like to show you another way how to parse these things. I know if you have been to Scott Vashin's talk before me, he was talking about parsers, I'm not doing anything that fancy. But, when I looked at the structure of the script,
04:25
it's actually quite simple. This is just an example of the standard format of the script, how it looks, but I want to really write something that I can read. I don't want to deal with how it looks underneath.
04:41
So, I used something that's called Active Patterns in F-sharp and there you can write something like this. You just parse a screen, you split the script into elements and then you can just match each element here with a scene title or a name, and that's very readable
05:02
and you don't even have to know what's happening underneath. I'll show you what's happening underneath. This is something called the Active Pattern in F-sharp. So, when you want to match something against some regex, you don't have to put the regex into a function that's actually doing the matching. You can just look at, you can define something like this.
05:24
This is called the Banana Clip Operators. So, you can say that a piece of text is either a scene title or a name or something that I'm not interested in. And you can hide all the ugly regexes in here.
05:40
And by the way, the regex for names is very complex because in Star Wars, characters are named anything. You don't get people named C-3PO in real life. So, this had to be quite complex because it had to match all sorts of dashes, slashes, everything. And I don't want to care about this.
06:02
For me, regexes are something that it's write-only. I don't want to look at it again. So, I can hide it in this kind of thing. So, what this does is it takes the string as an input and then it matches it against each of the regexes and returns this kind of thing, like scene title and text
06:22
or the name text or a word. And then when I do pattern matching over it, I don't have to care about what it does underneath. And this is readable. When I went back to my original implementation with all the regexes all over the place, I had no idea what's happening there. So, for me, this is the way to write parsers in F-sharp.
06:46
Or, at least for very simple cases. And if you want to play with F-sharp, definitely look at active patterns because they allow you to hide the implementation details and then you don't have to care about it in the more high-level functions.
07:03
So, I thought, now I'm set. But, it wasn't the case. So, I tried to run it on all my seven different screenplays and, yeah, it was a trap. Because some of the files have completely different structure.
07:22
For example, the names are not in bold, are not centered, they are at the beginning of a line, followed by a colon, etc. So, I had to write another parser for this type of screenplays and it was much more work, as usual. And that's a general case in anything data-related.
07:43
You spend 90% of your time just cleaning up the data. And now, this also still has some kind of structure because all the names are in capitals, at least. Well, you would think so. For example, this one. By the way, it's this scary guy.
08:03
But, if you look at the second letter, it's not an I. It's a lowercase L. I don't know if it came from some OCR system and it just decided that I is a lowercase L. And for things like these,
08:21
I actually had to put in explicit modifications because I didn't find a way how to deal with this systematically. And that's every time you have to deal with data. So, I went through this and I thought, OK, now I'm all set. I have all the characters and all the scenes.
08:41
It's all nice and easy from now on. Well... No. Because some characters don't even speak in Star Wars screenplays. They are mentioned there, but they don't speak at all. For example, R2D2. If you actually go into the script, they say things like these.
09:00
Like, R2 frantically beeps something. Oh, thanks. Or, Chewbacca doesn't speak at all. So, this is Chewbacca's comment and Han replies, Boy, you said it, Chewie. No, you didn't! So, actually Peter Mayhew, who plays Chewbacca,
09:20
I think he did a very good job playing him, because with this type of instruction in the script, what can you do? So, when I extracted all the names in all the scenes, I decided, well, I actually have to include these characters because they can't be left out, right? You can't just have Star Wars social network or something
09:42
without R2D2 or Chewbacca. So, I went on and actually extracted also all the mentions of all the characters in screenplays and tried to estimate how many times would they have spoken if they had spoken there. So, I came up with very, very scientific equations.
10:03
Well, you don't have to care about this really. It's basically just scaling the number of times a character is mentioned by the times they actually speak. So, on average, actually every character is mentioned about two times as much as they speak explicitly in script.
10:22
So, I just counted how many times, let's say, Chewbacca is mentioned and weighted it by how many times Han is mentioned and tried to compute how many times Chewbacca would have spoken in the script. So, everything is good now, right? Well, sorry I didn't include some of the characters,
10:41
because there are actually so many characters that don't say a word that Evokes didn't make it, for example. So, the only ones that I included manually in this way were R2D2, Chewbacca and BB-8 from Force Awakens. And then, when I look at the characters
11:01
that my algorithm extracted from all the screenplays, I found out that actually some of the other characters they appear there, but they don't say a word in some of the episodes. So, they messed up my analysis again. And if you have seen Episode 7, you probably know who I'm talking about.
11:21
So, now I decided, well, I should somehow check if all the characters, because there was such a mess, actually appear in the screenplays and if they are actual characters, if it's not just some artifact of my algorithm. And I proposed a theorem that there is an API for everything.
11:43
And there is an API for Star Wars. You can go to SWAPI, starwarsapi.co and you can find an API for Star Wars. So, I looked at the website and it's quite nice, quite well documented.
12:01
You have all sorts of example requests and what it returns and you can get information on characters, on starships, on planets, on vehicles, I think, as well. Everything. And they have a lot of wrapper libraries. So, when I looked at them, there were C sharp, Python, R,
12:21
anything, Java, Go, Ruby. There was no F sharp, but well, maybe I can just use one of the other ones. But then, no, no, no, well, I'm doing this in F sharp, let's do it properly. And then I looked at, for example, this is the C sharp one
12:41
and you can see that the code is not very difficult. These are all just get and set methods and things like that. So, there is not actually much going on. And this is just an example of one of the wrappers around, I think, people
13:01
and you can see it's 150 lines and most of them don't actually do anything. So, I wanted to show you how we can do something like this in F sharp. And if you have seen any of the F sharp talks, some of them mentioned type providers. So, this talk is partly me saying I love type providers.
13:23
If you are doing anything with data, type providers are amazing. So, I wanted to show you how I can use type providers to write something like all the wrappers that were there in all the different languages in a very small amount of code. So, I'll just load the F sharp data library
13:41
which contains some of the type providers. And now this is actually just an example request to the Star Wars API. And what I get from this is some document that basically just answers the request.
14:03
And what I will do is I will just... I don't have to read the documentation, I don't have to do anything. I can just take this and create a type provider. And it will be a type provider for a Star Wars person and it will be a JSON provider.
14:21
And I will give it the URL. I will evaluate this and I have basically everything I need right now. So, now I can write a function to get me information about any person in Star Wars. So, get person and I will take an ID as a parameter
14:44
and now I will just call personLoad and I will give it the request that I want. And I will give it the ID.
15:00
Oh, I need normal brackets here. And that's it basically. Now I can have a look at the first person. I need to change this into string.
15:25
And now I get the information on the first person. And again, it returns a JSON which is not very readable. So, I can just type p. and get all the information here in my IntelliSense including the types of all the elements.
15:41
So, for example, I can look at who the first person actually is. So, I will look at name, I know that it's a string and it's Luke Skywalker. Do you know what's the... Well, let's say his mass, how much he weights. Did you ever wanted to know something like this?
16:00
Well, he weights 77 kilos. A lot of very important information. But what I wanted to show you particularly is that basically in these three lines I have everything that was there in the full big wrapper. And I don't have to care about it at all. I don't have to read any documentation, write any methods.
16:22
This is all I need. So, I actually went on and wrote a full wrapper around the API. And you can see it's not very imaginative code either. It's just defining the type providers and defining some of the functions. And that's it. That's everything.
16:42
And on the same day that I discovered the Star Wars API I actually just went back and sent a pull request and now there is an F sharp wrapper. Thank you. And in case you were wondering other random things about Star Wars
17:05
I can show you some more information. For example, do you know what's the most common eye color in Star Wars characters? You always wanted to know this, right? So, the first most common eye color is brown.
17:22
So, for example, Yoda has brown eyes. Who knew? The second most color is blue and the third most common color is yellow. So, Darth Vader has yellow eyes. All the important information.
17:40
And you can also see, for example, in Episode 4, the original Star Wars Luke Skywalker once walks into a prison cell and Princess Leia says, aren't you a bit short for a stormtrooper? So, we can see. So, is actually Luke Skywalker a bit shorter than average Star Wars character or not?
18:00
Wow. It's very easy. I just get his height, which is 172 centimeters. And I also downloaded information on all the other characters. And it's 174 centimeters on average. So, yes, Luke Skywalker is shorter than average. So, she was right.
18:22
So, all the important information that you always wondered about, right? So, well, let's move on. So, if you want to look at it, it's on my GitHub and you can get to it from the Suave website as well. And, yeah, play with it. It's a lot of fun information. So, now, what I had right now,
18:41
when I ran all the code that I was describing, I had characters, I had the scenes, and I knew that all the characters are actually appearing in the films. And then, Star Wars was bought by Disney and then I came across this analysis and they actually analyzed 2,000 screenplays
19:03
and they were looking at how much women characters speak and how much male characters speak. And they were just comparing scenes and how much screen times they get. And they specifically looked at Disney and it was sort of depressing because in almost all the Disney movies
19:21
men speak much more than women, et cetera, even if it's about princesses and things like that. And because I had almost the exact same data, I thought, well, maybe I can replicate this using my Star Wars dataset. And, again, type providers. Because what they did in this analysis is that they went onto IMDB,
19:43
extracted list of actors playing in the individual films and then they looked at them if they are male or female. And I can do exactly the same thing with my type providers because there is not only a type provider for JSON, there is a type provider for HTML as well.
20:03
So, an IMDB HTML type provider now. So, let's just open everything again. And, as you can see, I'm also giving it just the URL of some page describing a film.
20:22
And this is actually from episode 7, so I will load it now. And I don't even have to look at the website. Now I can do episode 7 dot and I know that there are some lists and some tables. So, I will look at the tables and the tables are these.
20:42
I don't have to go through the website at all. I see everything here in F-sharp and this is all the code I really need to access it. So, I can look at the cast and they have the cast in credits order verified as complete.
21:00
And here I have a nice printing function to actually look at it properly. So, these are all the people that play something in episode 7. And, for example, did you know that Daniel Craig plays a stormtrooper? He's somewhere here. Yeah, Daniel Craig plays a stormtrooper and he's uncredited.
21:24
So, this gives me quite a lot of information. And again, it's in my repo. I don't actually have to look at the website and go through it. So, with a bit more code, I put together this graph.
21:40
So, this is comparing episode 1 to 7 based on what's the percentage of dialogues with men and what's the percentage of dialogues with women. And actually, sorry, I put women and robots together because they are like other genders.
22:01
Because otherwise it would be even worse. So, episode 7 indeed has more women speaking but still almost 70% of the dialogue is men. Still better. The worst one is actually episode 4. But enough of this.
22:22
I had right now, as I said, all the characters and their relationships and where they speak. So, I decided to put together a social network by putting together characters that speak in the same scene. If they do, they are connected by a link.
22:42
And I can visualize this. Because, well, what do you do if you want to visualize something nice and put it on the web? You use JavaScript. So, I went for D3JS which is an amazingly powerful library in JavaScript for visualization.
23:01
You can write anything there. The downside is that you can write anything there because it's so powerful that you can't really learn it. You have to go online and look at various examples of what people did with it before so that you can copy it and put your own data into it. Anyway, I don't really want to use JavaScript
23:21
and, as I said, I want to do everything in F-sharp here. So, there is this new project called Fable. So, it's using Fable. This is the current logo. And that allows you to write code in F-sharp and translate it into JavaScript. I think it's called transpiling. Is that the correct term?
23:42
So, it's a transpiler and it's actually really neat. It's a very new project. I think it started about six months ago. And here on the right-hand side you can see the code in JavaScript from an example on D3JS that I just downloaded from the web.
24:02
And you can see that most of the code on the left-hand side is the code in F-sharp that's actually calling the D3JS library. And you can see that it's very similar, actually. Right? Well, instead of var you have let. But otherwise, all the function calls are very similar.
24:23
And sometimes you have to deal a bit with types but it's not very painful. And it allows you to basically just translate your JavaScript code into F-sharp and you can call any functions from F-sharp and have all the type safety and everything. And I just wanted to show you also the translated code.
24:44
So this is how it translates the code into JavaScript. And you can see it, for example, this is calling something called the force layout of networks. And it's actually very readable. There are no thousands of underscores everywhere.
25:04
And you can go there and see what's happening. And that makes any debugging so much easier. So I actually went on and did all the things in Fable. And if you want to play with it, there are many examples online
25:21
by playing games from F-sharp, translating it into JavaScript. There are also some examples with Node.js. And it's really nice, really. So let's go to the actual social networks that I promised. So this is the social network of all the Star Wars movies put together.
25:45
And because I promised JavaScript, here it is. So this is an interactive visualization of the Star Wars movies. Let's make it a bit bigger. So you can see that the big black node is actually Darth Vader.
26:03
And I guess you can already see some patrons emerging. Whenever you are working with data, try to visualize them because that gives you so much information. So I guess right now you can probably guess that on the left-hand side are the new prequel episodes.
26:20
And you can see that the social network is a mess. You can see that there are so many nodes and it's very dense. Then in the center are the original episodes and on the right-hand side is the episode 7. And we can even look at the episodes individually. So this is the first episodes of the prequels.
26:42
And you can see it's quite densely connected. It has several main characters. I tried to color them. So this is, for example, Qui-Gon. This is Anakin. This is Jar Jar. And when we compare it with the episode 4, the original Star Wars,
27:06
you can probably immediately see some difference. The network is much sparser, it has only a few major characters and they are connected to each other. And if we compare it with Force Awakens,
27:22
the social network is bigger but there are not that many characters as in the prequels. So that already tells us something about the structure of the story and how understandable it is probably. And if we go even further, we can quantitatively compare the different episodes.
27:42
So the first thing that you can think of is how large networks are. So this is just a graph showing the number of characters. And these are only the characters that speak at least in two scenes. And only the characters that are explicitly named.
28:00
So if someone is just a stormtrooper, I didn't include them there. And episode 7 has the most characters. And then the original episodes have fewer, about 20 main characters. And then in episode 7 it jumps up again up to 27.
28:21
So let's hope they don't continue with this trend, otherwise they will get to the episode 1 territory and we probably don't want to go there. And then there are various scientific methods to compare networks. And the first one is called density,
28:40
which sounds like a density of a network. But what it really does, it just tries to compare the number of connections that are in the network with the number of connections that could be potentially there. So if you want an equation, this is it. Just divide the number of existing connections in the network
29:00
by the number of total connections that could be there. And when we look at this, actually the episode 1 and 2 have the lowest density. Because they have a lot of characters that are only vaguely connected and they have all these debates in the Galactic Senate that are not very interesting.
29:20
And actually interestingly, episode 6 has quite a high density and that's maybe because I didn't include the earwax in there. And episode 7 actually has about the same density as the original episodes and so does episode 3. And actually if you look at IMDb, you will see that episode 3 has the highest rating out of the prequels.
29:40
So maybe density has something to do with the quality of the story as well. And you can look at other measures. One of them is the clustering coefficient, which tells us how locally connected the network is and how much the characters actually speak to each other.
30:01
So if you look at this green guy, he has three neighbors. And you can see if the three neighbors are all connected to each other as well or not. So for this one, there are two of them connected. For example, for this one, all his friends are connected.
30:21
So that tells you how connected your networks are. And in terms of the story, this basically means if the story is following one character that just interacts with other people and the other people don't talk to each other, then that will have a very small clustering coefficient. And if the story follows a group of people that talk together
30:41
and interact with each other a lot, then it will have a large clustering coefficient. And this is the equation. We don't actually have to care about it very much. I don't want to go into details. But if I plot it for all the different episodes, again episode 1 has the lowest clustering coefficient
31:02
because there are all these weird characters that don't really talk very much. And what's nice is that episode 7 has about the same clustering coefficient as the original episodes. And episode 3 here has quite a small clustering coefficient as well,
31:21
which is interesting. It tells us something about the story. And I'm not claiming that small clustering coefficient means a bad story. I think it makes sense in something. For example, if the main hero is going through some obstacles and meeting other characters that are helping him on the way but don't really interact with each other, that can be a good story.
31:41
But I think in Star Wars, it actually tells us something about maybe how the network is structured. Actually, I think in this case it roughly correlates with the quality as well. And then we can look at local characteristics in the network.
32:03
So the first one, the most basic one, is degree. And that's just by looking at how many connections a character has in the network. So for example, this guy, he's very important because he has six connections in this network, quite small one. But this guy is less important because he has just three connections.
32:22
And that tells us something about centrality of the characters in the social network. So this just represents how many characters each one of them speaks to. It's just basically number of links that are outgoing or incoming into a node.
32:42
And I want to compare it with another measure of centrality in a network which is called betweenness because there are many measures of who's the most central in a network. And the degree just tells us if someone talks to a lot of other characters.
33:00
But betweenness tells us how important a node is for communication in the network. So I will explain again. So if you look at this guy and these two other guys, the only way they can communicate with each other is through the green guy because they don't know each other directly.
33:21
And these guys, on the other hand, if they want to talk to each other, only one half of the communication would go through the green guy because they know each other through some other person as well. So I can for example ask, so if Princess Leia wanted to talk to Jar Jar Binks,
33:41
who would she have to go through to pass a message to him? So that tells me how important a character is within the story. Because some characters may be important just in one part of the network but if a character speaks to a lot of different characters across the whole episode
34:02
then that means he's probably more important to the story. So again this is the equation. And now it's a bit more complicated because for each pair of nodes we look at the number of shortest paths between them and then look at how many of these shortest paths go through a specific node.
34:22
And we sum it over all different pairs of nodes in the network. And it's already quite hard to compute, right? Because you have to compute all the shortest paths between all the nodes in the network and then look at how many of them pass through the node that you are interested in.
34:40
And if you are doing anything with data there is always a package for it in R. So there is a library in R called igraph and if you are interested in betweenness there is a function called betweenness. If you are, as I was, working in F-sharp
35:03
you can do something like this. Just call rprovider.igraph and call the betweenness function from R. And again you get intellisense and everything. And this makes anything like this very easy because you do all the heavy pre-processing
35:21
in language that you are more comfortable with or that's a bit more safe than R and call the algorithms that are already implemented there. And I actually do this quite a lot because R is a nice programming language but just for data science. If you want to do any general programming
35:41
then it can get really, really painful. And in F-sharp you get all the safety of type inference so it really helped me when I was writing my parser, for example. And I just knew all the heavy data science in R and I can pass data to it directly from F-sharp. So let's look at the centrality.
36:01
So who is the most central in episode 7, let's say? So actually the person who's having the most connections is Paul. He's one of the resistance pilots and that's because he talks to a lot of people across the whole network and he talks to all the other resistance pilots as well
36:20
and there are quite a lot of them. And the most second central is Finn because he also talks to a lot of other people and then Han, Chewbacca and BB-8. And for BB-8 this is just an estimated number because I didn't find him in the screenplays as I was explaining. So does this tell you something about who's actually important in the story?
36:45
Well, sort of, but you might be missing some of the main characters in this. So this is actually between us. And now you can see that suddenly Kylo Ren and Rey jumped up a lot. And that's because they are more important by
37:07
connecting different communities within the network. So for example, Kylo Ren is very important because he is one of the few people that talk to Han and Chewbacca and Rey but he also talks to Snoke
37:21
which puts him into the center of the network a bit more. But also Poe talked to him which already gave him quite a boost in betweenness as well. So this tells you a bit more about who's important in the actual network. And now you might be asking who's the most important across the whole Star Wars, right?
37:43
So, before the new episode this is just Poe celebrating that he's the most central. So, who is the most central overall across the whole Star Wars universe? Before episode 7 it was this guy
38:03
because he was talking to quite a lot of people in the prequels as well as in the original trilogy a little although he appeared physically in only the first one of them. And also what I did in this analysis when I got this guy it was because I just looked at the names in the screenplays
38:25
and actually they are making a very big distinction between Anakin and Darth Vader because they never appear in the same scene so they appear like two completely different characters. And when I added the new episode and also merged Anakin and Darth Vader
38:42
well, still Darth Vader is ruling the galaxy he is the most important one. Everyone else is just lying there dead. So, also I made a quick sample with Neo4j
39:01
so if you want to learn something about Neo4j I have a link at the end you can go there and play with it because it's a very nice dataset and you can just extract connections between people. And what's important here is that you actually understand what's going on because all the people have names that you recognize
39:21
and that makes any learning so much easier than if you have just some anonymous customers or something like that. You can understand what's happening. So this is for example just looking at characters that play in the same movie and how they are connected to each other and computing the degree.
39:43
Yeah, degree just with a single movie. And you can see that Neo4j is actually quite readable because here I'm just looking at names of characters that appear in... Yeah, I'm looking at Episode 4 in New Hope and I'm looking at characters that talk to other characters
40:03
that appear also in New Hope and looking at the number of scenes they spoke together in. So this is how you would do it in Neo4j and you can play with it. And it's not just completely a toy example. I did something similar by analyzing Twitter
40:21
and I actually analyzed the social network around F-Sharp, the F-Sharp Software Foundation which is the home of open source F-Sharp and it has the Twitter handle fsharporg and I looked at all the other Twitter handles that are connected to it.
40:40
And then I was looking for the most central people there. And actually this is the order based on degree and you can see that the first one is Don Seim who is actually the author of F-Sharp. So if you are an alien and come to Twitter and you say, oh, so who is the most important in C-Sharp, you can do something similar.
41:01
I don't know who you get. But it tells you something about the actual network and some meaning about it. And the second one, I think it's the official Microsoft side of F-Sharp. The fifth one is the official side of the community of F-Sharp. And the fourth one is Tomas who is speaking tomorrow
41:22
if I'm correct in the morning. So if you want to see some important people in F-Sharp, Twitter says you should go see him. And these are information from November 2014. So it might have changed. But the problem with analyzing data from Twitter
41:40
is that it takes an awful long time to download because there is all the rate limiting. So I didn't replicate it afterwards. But I might do that because it tells you something about how people communicate with each other and what's happening. So if there are any changes, I can see it probably. And also another example,
42:02
this is quite a famous example in network science, and it was a company in Hungary where they had two factories and some headquarters. And they were having problems because there are always these rumors spreading in the factories and they had no idea why and what's happening
42:22
because the headquarters, they were just issuing orders and issuing messages. But there are always rumors and they were not true. So they actually called, in these olden days, they called social scientists who talked to people and said, so who do you talk to
42:41
if you want to get more information about anything? And they named a few people. And this is the social network that they constructed from it. And what's interesting is that you can probably see that I think the pink one is the headquarters,
43:00
the other one are the factories. And the most important note is here, it's not in the headquarters. And that was because there was this guy who was doing health and safety. And he was actually traveling around the factories and talking to everyone. So he was the most important person in the company for communication.
43:23
So now they knew that if they want to spread some kind of information they have to talk to him because he will spread the correct information to the others. And what you would do now, you wouldn't get a social scientist to talk to people, you would just explore the emails that are sent within your company
43:42
or Slack messages, like who replies to whom, things like that. And because you have access to all this information, you can actually do that. And then you can look at these simple measures like betweenness and degree centrality, things like that. And you immediately get some information that might be actually useful for you.
44:01
So, as I mentioned, you can do social network analysis and it's actually a lot of fun. And you can look at how you communicate on Slack, who sends emails to whom, and you can also analyze supply grids. If there is a blackout in some part of your network, what parts will get affected as well.
44:23
So network science is actually very important. And at the end here I have biological networks and that's actually the area where I work because I look at how genes interact with each other, like what protein binds to what region of DNA, and then I can see what's happening,
44:42
how they are interacting with each other. And this is really not just a toy example because I was reading a paper the other day and they were actually looking at betweenness. And I was like, ah, I know what betweenness is. I did it in Star Wars. And here they were claiming that genes that are important in cancer,
45:02
that are tumor suppressors or oncogenes, that they have higher betweenness in the biological networks. And because I was analyzing these Star Wars networks before, I knew almost exactly what it means in terms of the actual network because I got the feeling for it by looking at a very fun dataset.
45:21
So I want to encourage you to actually play with data because you get more of it than just some fun fights about Star Wars. And right now I went through quite a lot of things. I went through script parsing in a functional way. You saw the active patterns. I was calling R and JavaScript from F-sharp.
45:43
I think that's actually the future of data science, just call everything from everything because there are different tools in different languages. I was showing type providers. I showed you the HTML type provider and JSON type provider, and there are many more. There is one for SQL, there is one for...
46:02
I can't even think of all of them. I use the CSV one because CSV is the format for data science. And I would really like to encourage you, if you are interested in data science, if you have seen some of the talks on R, for example, then you might think, well, maybe I should go into data science
46:20
or start with fun datasets because by analyzing them, you know what's happening there and you get insights when you actually see some of these algorithms applied in the real world. And this is, I think, a great way to learn data science. If you want to know more about F-sharp, I've put some links there.
46:41
I have the slides online already. And these are some of the Star Wars resources that I put together. So, yeah, you can read all the scripts, even if you are not interested in any quantitative analysis. You can play with the Star Wars API. You get all sorts of very important information. And all the information that I was showing you
47:03
is actually on my GitHub, and you can even play with the social networks. And I saw some people doing actual social network data science, playing with it and trying to, for example, overlay the different episodes against each other to see who corresponds to whom in the social network in episode seven, et cetera.
47:21
And I have some blog posts about this, and as I mentioned, I have a Neo4j demo. And... Play with data. This is actually the website where I put the slides. So go to evalingig.com slash starwarsetalk,
47:41
and they are all there. And thank you. Are there any questions? So, any questions? I can't actually see you properly.
48:06
I can't see anything. Well, if there are no questions, then it's time for lunch, I guess.