Real time scalable graph analytics
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Teil | 33 | |
Anzahl der Teile | 110 | |
Autor | ||
Lizenz | CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/30944 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
FOSDEM 201633 / 110
4
6
10
11
13
15
17
19
20
23
25
27
30
32
36
38
39
41
42
43
44
45
46
47
48
50
52
54
58
61
62
69
71
72
75
76
78
79
80
82
87
88
91
93
94
95
96
97
101
103
104
106
107
110
00:00
Diskrete-Elemente-MethodeGraphParallele SchnittstelleProgrammierungArithmetischer AusdruckSchlussregelBitWort <Informatik>MereologiePunktArithmetisches MittelOrientierung <Mathematik>SichtenkonzeptZahlenbereichMathematikStabDatenstrukturProgrammierungGraphFramework <Informatik>SoftwareentwicklerTopologieComputerspielEnergiedichteRechter WinkelRechenwerkStreaming <Kommunikationstechnik>BenutzerschnittstellenverwaltungssystemStrömungsrichtungGanze FunktionQuick-SortGroße VereinheitlichungTelekommunikationTwitter <Softwareplattform>Ein-AusgabeSpeicherabzugParallele SchnittstelleCASE <Informatik>Fortsetzung <Mathematik>XMLVorlesung/Konferenz
10:05
MathematikEin-AusgabeDifferentialDatenflussProgrammierungParallele SchnittstelleIntelKugelkappeMultigraphLokales MinimumDiskrete-Elemente-MethodeThumbnailNichtlinearer OperatorFunktion <Mathematik>RFIDInvarianteDämpfungFigurierte ZahlKategorie <Mathematik>InformationsspeicherungEinfach zusammenhängender RaumDatenflussZusammenhängender GraphNeuroinformatikAbfrageMathematikTwitter <Softwareplattform>MomentenproblemGraphRechter WinkelBitAdditionQuick-SortPunktMAPComputerunterstützte ÜbersetzungLokales MinimumVariableVerschlingungGraphentheorieEinfügungsdämpfungGewicht <Ausgleichsrechnung>DatensatzProjektive EbeneKonditionszahlDifferentialNichtlinearer OperatorDifferenteAggregatzustandAlgorithmusInvarianteCASE <Informatik>Ein-AusgabeFunktion <Mathematik>QuaderProgrammierungWurm <Informatik>MultiplikationsoperatorMultigraphSchnittmengeWiderspruchsfreiheitGrenzschichtablösungSchlüsselverwaltungWeb-SeiteNatürliche ZahlMinimumTypentheorieFormale SpracheFunktionalLesezeichen <Internet>IterationProgrammschleifeDeklarative ProgrammierspracheDeskriptive StatistikDatenstrukturMapping <Computergraphik>GruppenoperationFilter <Stochastik>Arithmetischer AusdruckPropagatorBildschirmfensterDimensionsanalyseVektorraumPartikelsystemBefehl <Informatik>Likelihood-FunktionTextur-MappingObjekt <Kategorie>SoftwareentwicklerResultanteKontextbezogenes SystemNotepad-ComputerReelle ZahlCoxeter-GruppeArithmetisches MittelMetadatenInternetworkingGeradeBitrateKreiszylinderValiditätWeb SiteFrequenzMereologieGrundraumBeobachtungsstudiePhysikalische TheorieMittelwertInverseEntscheidungstheorieBildverstehenGesetz <Physik>InterpretiererDruckverlaufComputeranimation
20:05
Reelle ZahlDifferentialSkalierbarkeitMathematikUnrundheitTeilmengeIterationDifferenteMathematische LogikQuick-SortMultiplikationsoperatorStreaming <Kommunikationstechnik>CASE <Informatik>CoprozessorAutomatische IndexierungRechteckDatensatzSummierbarkeitPunktgitterGewicht <Ausgleichsrechnung>ProgrammierungSummengleichungKategorie <Mathematik>HypergraphBitDimensionsanalyseMinkowski-MetrikFunktionalMAPProzess <Informatik>NeuroinformatikDatenflussEndliche ModelltheorieFunktion <Mathematik>PunktMulti-Tier-ArchitekturEin-AusgabeNichtlinearer OperatorZählenEinfache GenauigkeitSchnittmengeGüte der AnpassungEinfach zusammenhängender RaumEinsMailing-ListeKontextbezogenes SystemGraphentheorieParallele SchnittstelleLokales MinimumImplementierungMultigraphParametersystemWechselsprungGruppenoperationMereologieTypentheorieZweiTVD-VerfahrenDerivation <Algebra>RechenschieberWasserdampftafelEinfügungsdämpfungTabelleWort <Informatik>Bildgebendes VerfahrenFlächeninhaltObjekt <Kategorie>SichtenkonzeptFigurierte ZahlOrientierung <Mathematik>InstantiierungVerzeichnisdienstNatürliche ZahlStandardabweichungRichtungExogene VariableRechter WinkelZentrische StreckungGerichteter GraphResultanteMomentenproblemZusammenhängender GraphGraphMetropolitan area networkComputerspielSymboltabellet-TestGebäude <Mathematik>SystemaufrufLeistung <Physik>Computeranimation
30:05
Nichtlinearer OperatorDifferentialDiskrete-Elemente-MethodeTwitter <Softwareplattform>GraphZufallszahlenUniformer RaumFinite-Elemente-MethodeMultiplikationsoperatorKartesische KoordinatenDifferenteTermZahlenbereichLaufzeitfehlerDialektDatenstrukturHypergraphProzess <Informatik>Zusammenhängender GraphParallele SchnittstelleWort <Informatik>MathematikQuick-SortNichtlinearer OperatorAggregatzustandNeuroinformatikIterationVarietät <Mathematik>Mathematische LogikVektorraumTwitter <Softwareplattform>TopologieMultigraphBildgebendes VerfahrenMereologieGeradeNatürliche ZahlEndliche ModelltheorieSichtenkonzeptKategorie <Mathematik>Virtuelle MaschineInstantiierungRechenwerkCodeResultanteLeistung <Physik>Generator <Informatik>Lesen <Datenverarbeitung>PunktMomentenproblemRechter WinkelBildschirmmaskeRichtungDateiformatArithmetisches MittelFlächeninhaltPropagatorSchreiben <Datenverarbeitung>TeilmengeFigurierte ZahlAbfrageBitGesetz <Physik>BeweistheorieGraphSchlüsselverwaltungEinfach zusammenhängender RaumMini-DiscInformationZufallsgeneratorDatenflussStochastische AbhängigkeitDatensatzDifferenz <Mathematik>TeilbarkeitTotal <Mathematik>MedianwertKurvenanpassungZentrische StreckungUnrundheitMAPImplementierungDämpfungLoopStreaming <Kommunikationstechnik>VerschiebungsoperatorZweiReelle ZahlEin-AusgabeVisualisierungBildschirmfensterZehnFunktion <Mathematik>SummierbarkeitDiagramm
40:05
SpeicherabzugGoogolComputeranimation
Transkript: Englisch(automatisch erzeugt)
01:05
Oh
01:35
Yeah
02:07
Oh
02:44
I
03:06
Oh
03:44
Oh
04:11
Oh
04:44
Oh
05:26
Oh
05:56
Oh Oh
06:29
Oh
07:10
You're wondering why So, let's start with with a motivating problem, hopefully it motivates people maybe it doesn't in which case
07:24
You are a person who likes to look at social signals This is a stream of data coming at you. I'm gonna pretend that it's tweets right because everyone likes tweets There's got to be money there somewhere I'm sorry, I'm historically from around Silicon Valley. So a lot of things I'm gonna say involve this fascination, right?
07:42
I'll have a big dump truck But I think it is a social good otherwise People when they tweet they talk about things right there might be topics that they they used to I
08:01
Mentioned of people so when you tweet you might talk about other Communicating and this is neat. This gives us a little bit of Some structure to it potentially some graph like structure and that people So here's some questions you might ask right and I'm gonna claim that some of these we know how to do with
08:24
So if someone wants to know what's the most popular hashtag the topic out there that most people are talking about well You know, we can rest assured that people are good for us We can make it a bit more exotic though, right? We could say I don't want to know about the most popular topic across the entire corpus That's boring is gonna be Justin Bieber
08:41
I want to know instead by some sort of emergent structure of the communication that's going if I look at this draft defined by Who's talking to whom and I'm gonna pick connected components, but you know, we can we can argue about whether it's a good example or not What are the most popular topics within each of these connected components? So some people start talking about something we don't know but let's try to find out by looking in there and seeing what's popular and
09:05
You know everyone here is a gut person or interest in gaps. So let's imagine that that you have an idea someone plopped a Few gigabytes of Twitter data in front of you. You should pick that up and turn it into a graph and go and figure Then
09:22
The place we're going to go in this talk is What if what if your your boss Kevin said no, I need this I need this now actually I want you to take this stream attached to it and keep all of this data current with As fast as possible You know if it turns out to be second latencies, that's fine But we're really going to be trying to look at how can we avoid recomputing things over and over from scratch?
09:44
How can we react more or less immediately to changes in our input data collection? So that's what this talk is gonna be What I'm going to show you for most of the talk is this data parallel programming framework, it's cool framework It's gonna be really simple
10:01
Unlike a little bit of the earlier talks. It's actually a little bit more like sequel There's no capital letters. So It's gonna be general enough to express graph computations relatively. Naturally. It's not going to be 200 pages of anything There's a comma here this property has a really cool property that
10:21
Everything automatically updates when you change the inputs. So you write your programs in a particular way. It's great They go and they run. They're not particularly slow They all have the property that if you change the inputs the outputs change automatically to the new correct insert With absolutely no ambiguity no weird race conditions
10:43
It's a project you can go and grab it The main thing that I would warn you about is it's all in rust That's different than what most people use but it's awesome Let's walk through
11:02
That we heard about earlier in the last talk Providing the technical components in a graph. So if someone gives you this is this label propagation algorithm style state verbally first Every node in the graph gets a label typically it's their own name And then repeatedly until we find some sort of fixed point
11:21
each of the nodes proposed their label to each of their neighbors and say look at this neat label and Then collects all the proposals that have come and say what's the what's the best label? We're gonna take the minimum What's the smallest label? And if you do this often enough repeatedly have all the nodes pose to their neighbors have all the neighbors collect Proposals you eventually converge to the connected component structure because the smallest label in any component
11:50
So I've written that down in this particular language that we're looking at Differential data flow and it's basically, you know, these sort of SQL but things like joins and concats and group bys maps and filters and a pretty cool
12:07
Operator that shows a very cold iterate which is Not too complicated I think it's it's an operator that starts from a collection You have a collection of labels for house to be started with and it takes a function
12:22
the function from some initial set of labels to Taking the set of labels we're hitting the labels We're doing a joint to propose for each node the label that it has to each of the neighbors We we throw back in the initial set of labels because you should be allowed to keep your old label if you want
12:45
and Then we take a minimum to try to figure out breach node in parallel What label would it like to keep? So this is you know, it's a program You can write that you could imagine running this in your favorite big data So we're setting this wouldn't be too exciting
13:02
It'd be delightful because it's so simple and concise and everyone understands exactly what's going on But but let's let's talk about this a little bit Here's a different way that I could try to describe the computation I've basically taken each of the operations that I've done here and turn them into these little boxes I'm gonna build a data flow graph
13:20
So I'm just gonna start putting edges in each time We use one of the one of the bits of the graph. So for example, we use edges both The example book to How we get our initial labels We also use a set of edges in this this joint operation when we communicate labels to other nodes in the graph And if you sort of follow through I think I've got each of them each of them
13:40
Right, you know the map goes into this labels variable that we use both in the join and in the cat nation There's the minimization and then that feeds back to the links Often we like putting together data flow graphs because then we can deploy the data flow graph to a cluster and we can do a Big data thing and that's great But there's another really cool thing that goes on here, which is the data flow graph tells us
14:00
Who's actually using each of these bits of data? So if one of them changes for example, like what's my name change happens to edges? We can track down what actually needs to get fixed or corrected in the computation an edge change We can say ah, well We'll need to fix this joint for sure only to fix the map and we can start pushing updates through this data flow Not rerunning from scratch, but just fixing the things that are broken
14:24
That's pretty informal and I'll try to be a bit more precise But if you imagine this magical world in the future, we're gonna get there in the next 40 minutes or so That has this magical property that for each of these operators We're able to maintain the invariant that the output of the operator is always equal to the input to the operator
14:41
Even as you start changing the inputs, so the operators are self correct or self repair automatically We get a pretty cool a pretty cool property Um, if I've done a lot of a lot of loops and weird edges and stuff here But if that were true, then we could have this nice What data flow description declarative I like to call it but description of connecting components where we're allowed to change the edges arbitrarily and the
15:04
Labels that come out the bottom get automatically repaired as we go Sounds a little magical maybe but we'll build up to that some of these things people know how to do already Some of them are new and exciting Let me tell a little bit more of a story
15:21
So One of the things I want to try to convince people about is it's actually valuable potentially valuable To think about your query using this type of language these these very simple operators That are automatically updated. I'm gonna give a few examples that we're gonna do something really cool Basically, we're gonna get back to that motivating example from the introduction. Just using a few more boxes up here
15:43
So imagine if you will we start with some tweets the tweets have each of them have at least three things Let's say a user some topic and you might mention someone else So we can pretty quickly pull out the users and dimensions from each of these tweets feed them into our connected components Computation from the previous time and coming at the bottom. We'll get a whole bunch of user and label pairs
16:04
These are both of these are users, but for each user some label indicating what component So that's great. That's great as tweets come and go and you know We add tweets and maybe tweets get retired because some window has expired We'll see changes coming at the bottom of the connected components computation telling us how these labels have changed
16:21
What do you do with that? Well, one thing that you can do about it is take the tweets again up to the side and ask for each user What topics have they mentioned? So this is again just a map operation where for each tweet that applies By we pull out the user in the topic and then we can join these two and say oh well we have we have for
16:43
These users and topics and have users and labels join we get a whole bunch of now labels and topics At which point Start to do things like aggregation. We can go and ask complicated type of things where for each label
17:02
What are the most popular topics with that label? So this is not just by user anymore, but across the entire connecting component What's the most popular let's say top five things that people are talking about? Great, okay, so that you know, I mentioned that this works And we'll get there but this would be this would be a lot of fun right as tweets fly by you get to see as they change the most popular topics for each of the
17:24
Each of the connected components out there, but let's let's make things a bit more exciting and We're gonna sort of back fit some of the capabilities Whatever describing to cause a weird new thing to happen or cool cool new thing So in addition to this this computation hung up of all these tweets
17:40
I'm gonna make a new data set new collection called queries We're of course going to imagine that these are going to be People fire at the system, but just different as a collection for the moment containing Twitter handles that we're actually interested in
18:00
What what's the most popular topic in this person's connected component So we can do some of the same things we've done before we can join The components computation to figure out the label of the component that I'm in And we can join that again with the label topics from the top
18:20
Well, you're the most five most popular topics for the component If we believe all of the the magic that I'm trying to tell you about as the queries collection changes things happen automatically All you need to do is throw in another query. Just add one more user name in there It will trickle through the data flow draft
18:40
We'll figure out how what should we join it against to get the label? What should we? Join the resulting label against to get the top topics and it all happens surprisingly quickly We're not rerunning, you know, big horrible queries from scratch. We're essentially just doing a few lookups in value stores That are hidden behind some of these operators Moreover lest you actually go and try to use a key value store. We have some delightful properties of consistency here
19:04
So I've done two lookups into the value stores first to find my label then to find them And you'd have a horrible nightmare of a mess if you actually kept these in two separate key value stores and tried to get consistent Reads out of these things and make sure not to accidentally screw up and show someone something they shouldn't have seen
19:21
But this will all this will all just work out. It's nice So this is what I'm promising at the end of the talk, they'll be just a quick little demo, so There's that but I thought I'd try to walk through some of the the explanation for how this how this works out What makes it possible? So I'm going to build up to this differential data flow thing starting out with incremental data flow, which is a fairly well understood
19:46
Or at least it's been around for a while The general idea is you have a data flow graph it's big and there's lots of it But we're gonna look for the moment just at an operator And the way the operator works is that some data shows up
20:01
And the data in our case is gonna be some actual payload like a record and a weight which says something like yeah There's one of these Lots of these show up, so this isn't just meant to be a single datum, but lots of them show up and The operators then has some logic. So the operators. Well, let me let me think
20:22
I'm gonna produce an output called Y here, which is also some some data and some counts. So sorry It's a set of these pairs Could be Downstream to the next operator
20:42
So this would be great if we were in some We're not going to do that we do something a bit more exciting where now Changes happen. So, you know a new bit of data arrived
21:02
Our model is going to be it's a it's a difference It's a DX. It's a change to the original collection We had it's still gonna be a record of a bunch of records and weights though The way it's gonna be positive and negative pretty flexible But any old change shows up and we're now meant to think of the the input as the sum of these two things
21:23
So I've added my depth And I want to figure out what sort of dish that I produce on the output to to bring this into into balance And I'm just gonna do some some math by pictures here, right? So we're trying to produce some some output that has the property that when you add it up again Because that's what the downstream person expects when you add it up
21:41
It should be equal to the operators logic applied to the initial collection And this is the the cunning bit of like preschool mathematics. I'm using to describe some complicated data flow stuff so basically we add these things all together and we apply our logic to it and then we Subtract off anything we've sent already because we've already sent it and we don't need to we don't need to amend anything
22:01
That's already right and we solve for the question mark We have a DUI that we can So this repeats, you know, I get some more differences we'll do the same sort of logic again You know, we'll accumulate everything up now and subtract away the first two things and get a new DUI
22:22
And you know this repeats as much as is necessary Each time data show up some corresponding update goes out Sometimes the update might be might be empty. We might get a some of the differences coming in that don't actually require us to change Our output at all. That would be nice. It's not usually the case
22:40
Cool, so if you're writing a stream processor This is the sort of thing that you would do as you get more data You you write your little operator to respond appropriately and then all's well and good and it actually works If if the only thing you had to do is accept new differences coming in to the input of the subject We're going to have to do more than that
23:01
Let me just say a little bit about this connected components example because this is a good bit people have also exploited this particular approach to designing computation to deal with iterative computations You know iterative fixed-point computations where as the computation runs Essentially redoing the same logic on slightly different inputs. So with our connected components computation
23:21
We had labels for each node and as the computation proceeded around these iterations the labels for the nodes Some point they stop changing But you can rig this basically in the following way this there might be some sort of dot-dot more data flow nodes out there But but roughly speaking and the differences that come out from the from the operator make their way through some graphics
23:44
Come back to be the next round of differences for the for the input So we found for example, if this is the the argument operator the one that's picking from all of the candidate labels making the minimum one As it produces new interesting changes to its output Those changes will flow around the graph go through the join and get proposed now to new people
24:03
If I change my label, I'll tell all my friends say I changed my label They'll pick up all of those changes and say that's exciting. Maybe that causes me to change my label Maybe it doesn't but the process essentially continues Keeps running until people stop changing basically until the dips evaporate and go away
24:20
So this is a fairly standard approach to how you would implement iterative incremental computations on big data flow graphs There's sort of a problem though, which is this is great. We've co-opted the incremental updating machinery to do this iterative computation What happens now if someone? changes the or wants to change or should say wants to change the place we started from
24:44
Someone says I changed my mind. We don't have that edge set or you don't have these initial labels. You've got some other ones It's not unfortunately, it's not as easy as just taking the difference and throwing it on the end of the list
25:00
Right if for example, we circulated for a long time and everyone got the label zero and then vertex zero says Oh, I changed my mind. My label is actually five my mistake It can take on label five that's fine, but it all of its neighbors are gonna really get tell it Hey, I've got a zero for you and we're just gonna go back to the it'll be a fixed point But it won't be the correct answer that we wanted from starting from the amended input
25:24
so There's a nice thing that you can do which as far as I'm aware. No one else says has done before but it's Basically, you know, there's no reason that we had to put all these in online There are other spaces that we can fill in a
25:43
Dimensional slide If we want to change the input for computation we can think about doing that but just offsetting it a little bit So yeah, this is indeed a change to X, but we're gonna put it below X instead of at the end of some long list and the intuition here now is that We have one dimension. This is dimensions gonna correspond to iterations perhaps of connecting components. We can have another dimensions
26:06
Corresponding for rounds of input data Change has a variation there. We can vary with both of them We can talk about for example The second round of input as a zero first second third fourth iteration as well We can start to populate all of these parts of the picture
26:24
So well, what would we do if someone showed us this type of difference? The main thing that we need to do is to figure out what do we produce as output? We'll do what I think is sort of the natural thing which is to say well this difference here
26:41
changed the first iteration Let's think about how it changed it We can apply our operator to this sort of thing subtract Y often and we have now sort of space to put it Right because we've agreed we can use this extra dimension to put some notes about how things have changed Obviously, you know the map that we do is sort of natural I think subtract the right things apply the function
27:03
All right And the process continues, you know where this might show up and now we take a slightly different rectangle It's not totally a surprising rectangle, but it's different these these rectangles that I'm drawing now Have really relied on the fact that I've kept all these differences separate These are not they haven't been lumped up into one big pile
27:22
It's kept them apart so that I can take summations over different subsets of differences and that turns out to be really powerful, right? if we wanted to investigate the first and second iterations of this new round of input data and Someone had already gone and collapsed everything up in the top tier together. These are about of luck, right?
27:41
I Mean you have to figure out how to pick things apart You probably end up subtracting off all the work that you've done But by keeping them differentiated like this or an index by the round of iteration and the round of input We can pick and choose what things we want to add together and get a fairly more concise differences so so everything continues
28:10
Good so, um what I've what I've shown visually at least and there's There's as much math as you can swallow to try to make this More technically compelling is that we can take an iterative computation this sort of this top row here and do these fun
28:25
Incremental sorry updating strategies, but we can also then Incrementally update the incremental updates. It's sort of like a second derivative. You want to think of it that way Yeah, you probably don't want to think of it that way actually nevermind But but it all works out like the math supports it then it's it's a lot of fun
28:41
And what it allows us to do this is sort of nice Is if not very much change here in X, right if the first Maybe there aren't really a lot of changes in the second step either give them that we're Reusing the changes that were we're observed the first time we went through So these changes can be really sparse which is really helpful if you want to go really fast
29:03
It means we don't have very much work to do Let me draw some pictures about that When we're thinking about how these things get implemented Sorry jump transitions a little too fast Just to bake everything in as much as possible
29:20
The bits of data that we're shuffling around they used to be records in weight With the implicit sort of understanding that as soon as you saw some records and wait They were good to go. You should act on them right away We're not going to do that. We're going to do records in weight with a very explicit time associated with them This is not wall clock time But it's some notion of logical time through the execution of the program and the previous lattice would have been pairs of iteration
29:41
And round of input And by keeping the time explicit there as opposed to sort of implicit in the arrival order We can take these weird summations over different subsets. This is the really important thing that we So now some pictures
30:01
So with data parallel operators, these are at the heart of all of these large-scale Draft computations that we end up doing are these data parallel operators and say, you know, I've got a million nodes in my draft I don't really need you to execute logic for them one by one by one by one by one In parallel in principle, that's that's great fun because we get lots of computers and things go fast and everyone feels powerful
30:22
But there's actually a better property that they have with respect to incremental computation so We're used to sort of seeing this picture where I go and I put each of these on different machines And I tell you how much I spent on Amazon and it's a disaster But here's a different property that I have which is sort of cool, which is that as new diffs arrive
30:41
Each of these books can operate independently. That's that's good Right if the diffs arrive and there isn't really that much work Yeah, well that's great then nothing changed so we don't have to have to do any of the work there So if someone explicitly shows us the diffs, there are only five diffs Across all the million, you know, it's only have five units of work
31:01
It didn't have to be true We could have designed non data parallel operators that were total disaster You'd have to reevaluate the entire if someone had asked for example for the median All the nodes across the computation median is not a friendly operator, but we didn't do that We asked for much simpler things sort of per node logic or these joins on keys
31:23
More differences than a very big Great so, let me say just a little bit about what's actually going on What do we need to do in terms of maintaining state and thinking about what information we really need to keep our paws on
31:41
We've got all these discs for one particular Operator one particular key of this data parallel operator. Think of this as one node in the graph computation That one node got some diffs coming in maybe some rounds. I didn't get any diffs. That's fine We're gonna sort of stash all this stuff in some per node state for key state more generally and
32:01
It's really simple actually we just mash up inputs and outputs and we annotate them with the time this logical time and maybe pair of iteration and Round of input I wish they were received Some new things arrived says I'm here at time D. What do you all think about?
32:21
differences and stuff like that Although we really need to do when we do this is go through and ask for each of the existing times How does it relate to time D is is the time less than time D? meaning when we think about producing new output at time D, should we include X and Y in the summation and then differencing And maybe not, you know this the second one might correspond to the the first round of input
32:44
But the 17th iteration right and we're only doing the maybe the third round of input fourth iteration. So it's not relevant We hold back these differences. We don't pull them into our Logic you leave them here. They're gonna be important in the future. But Yeah, so in this case, we would add up X
33:01
Of this DX down here and then we add up the new Apply the operator subtract off Y and the third dy down here and produce a new output time And then we go ship it and it's all great. We need to stash the Data that we produce of course in the future time II when it shows up
33:21
We'll have time D around to do the right thing. We need to remember essentially what differences we've sent downstream So Um, that's it for the the visual bits of mathematics I thought I'd show you a bit now about some actual numbers when we turn this on on the Twitter firehouse
33:52
So this is just some data that we collect it's actually from a while at 2013 So that's a long time ago now
34:01
When we did the the map for this in the paper originally, we took a Twitter trace that we had I was at Microsoft Research at the time And we plotted a few things here this is the connected components computation we're just trying to understand How large these differences are how many changes are there at different points in the computation?
34:22
So we have in the very top this this stateless Implementation the stateless one is the one that doesn't try to do anything incremental at all each time around the loop It starts over from scratch it picks up all the dibs and thinks about things And it has a static as we go through the numbers of it rounds of iteration. It has a static Doesn't get any better doesn't get any worse as we
34:43
If we incremental eyes this we get this this red curve and you notice that sort of ticks up above the stateless thing This is because the way we're counting things at least If your label changes you have both your new label with a little plus one indicating It's coming up by one, but you also have a your old label with a negative one Thanks to it. So you both you disavow your previous value and you have some new data
35:04
So we count that as two because it actually corresponds to two units of data that we're looking at But you'll notice that after the you know, the upwards hiccup initially it goes down exponentially This isn't a log scale on the y-axis And that's really exciting that means that you know at the very end at least we're going quite fast and you know Final
35:21
15 iterations basically There's a cute trick the green line is a really cute trick that everyone should eventually learn about but maybe not right now where you You run this label propagation algorithm, not just blindly having everyone chat, but you have people have small values. They get to talk first Which sort of makes sense, right? People are gonna take them in maybe have those who's holding on to zero put them in charge of talking first
35:43
They run for a while and only after after quite a while if you have label seven trillion Whatever do you get to talk? So that that makes things quite a bit better But the really exciting thing is we went and we took this 24-hour window over Twitter trace and shifted it by one second
36:02
So about a thousand tweets drop out and get added in And the amount of differences that happen in that whole process is down here is this blue We call it line except you can see it's not even connected, which is awesome. But the third iteration literally nothing happened Which which is great we didn't even do any work there and everything else is just sort of in the tens of differences at most
36:24
so we've sort of nailed slightly a good way to Really thin down and isolate. What is the actual work? We need to do to correct our computation given a small thousand or so to each shift
36:40
These correspond Really nice execution times. So if we ask how long do each of these things take, you know Some large number of seconds to do the computations in various different ways But when we actually get down to doing this differential shift actually processing this data Differentially it ends up being I can't remember exactly but anyone the order of 27
37:01
Milliseconds to respond to one second for the new Twitter data. So that's that's great fully fully updating this connected components computation, right? So not just sort of updating it to whatever the correct answer Guaranteed comes up the other end and we now know if something fabulous has happened So this is meant to get you sort of excited I have some other
37:24
Care or don't care that's fine, but something different is going on, right which which is I Have so that was only the connecting components computation I don't have the real Twitter data anymore because of laws and stuff like that but I what I did do instead was make up a Random generator for some purpose and I was going to show you a bit of I'll show you some some proof numbers first
37:46
But now let's walk you through a bit of the code this is exactly that data flow example from the Talk where someone is firing a whole bunch of users mentions and topics And each time in this particular picture each time they do that they they're only adding so they load up 10 million initially
38:05
And then we're gonna repeatedly do plus 1 minus 1 plus 1 minus 1 adding new random tweets subtracting the the oldest one Just because that gives the smallest numbers And that was that was fun But if you do this you see if you can count the number of zeros properly this is I don't even have units on here. It's a few hundred microseconds to do each of these updates
38:24
So if you want to someone hands you a new tweet and says hey figure out The corrections to the connected component structure and all of the implications that has For the particular people we have in our query set at the moment in terms of their most popular topics It's a few hundred microseconds, so you can do you like with this you can you know?
38:42
Crank up the number of records that you're putting in by a factor of a thousand if you like But basically it's a very responsive computation very responsive computation that Is also fairly rich right we didn't have to special purpose build any particular data structures
39:04
The efficacy of this depends a lot on the computation that you write if you write medium But please just don't do that So what I thought I'd do next is just show a bit of the code that produces the code is actually Brand-new I was bored in Zurich Airport yesterday, so I wrote it
39:24
It's not all that horrible. Well. Yeah, you know we'll talk afterwards whether it's horrible. It's rust right so It's about in total 170 lines some of which Is a bunch of lines of like this importing stuff and doing rust nonsense
39:44
Reading some some things from the input about how much data do we this is this includes the data generator as well so when we look I just sort of highlight the this part here is The left half of that data flow gap at the beginning we take a stream of tweets coming in
40:01
And we pick out for example the users and the mentions into the connected components computation