We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

FlinkML: Large-scale Machine

00:00

Formale Metadaten

Titel
FlinkML: Large-scale Machine
Untertitel
Learning for Apache Flink
Alternativer Titel
FlinkML: Large Scale Machine Learning For Apache Flink
Serientitel
Teil
25
Anzahl der Teile
110
Autor
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
19
20
Vorschaubild
44:46
23
30
Vorschaubild
25:53
69
Vorschaubild
25:58
76
78
79
96
97
Virtuelle MaschineMaßstabDiskrete-Elemente-MethodeE-MailMAPGeradeStreaming <Kommunikationstechnik>DatenflussDatenmodellElektronische PublikationLineare AbbildungNeuroinformatikIterationMaschinencodeStrömungsrichtungLokales MinimumKategorie <Mathematik>TaskStapeldateiFolge <Mathematik>AdditionKomplex <Algebra>Nichtlinearer OperatorPolynomProzess <Informatik>AlgorithmusProgrammierungSupport-Vektor-MaschineZahlenbereichNebenbedingungGraphCASE <Informatik>Ordnung <Mathematik>MinimalgradTabelleAlgorithmische LerntheorieProgrammbibliothekMathematische LogikVersionsverwaltungLineare RegressionSoftwareentwicklerDatenflussVerfügbarkeitPrototypingGruppenoperationGüte der AnpassungVirtuelle MaschineMultiplikationsoperatorService providerFunktionalSystemplattformSchnittmengeStochastikStreaming <Kommunikationstechnik>ImplementierungEndliche ModelltheorieTypentheoriePunktMereologieProgrammiergerätFramework <Informatik>EchtzeitsystemKünstliches LebenSkalierbarkeitZentrische StreckungSpeicherabzugStandardabweichungWellenpaketSchreiben <Datenverarbeitung>MAPFitnessfunktionPräprozessorBitMixed RealityHalbleiterspeicherTelekommunikationSchedulingMultiple RegressionDatenverarbeitungBildschirmmaskeGradientenverfahrenKarush-Kuhn-Tucker-BedingungenGeradePhysikalisches SystemComputerspielParametersystemSoftwarewartungCluster <Rechnernetz>Web SiteBildverstehenQuick-SortSichtenkonzeptQuelle <Physik>MinimumPhysikalische TheorieZellularer AutomatGlobale OptimierungResultanteLesen <Datenverarbeitung>UmwandlungsenthalpieAggregatzustandÜberwachtes LernenHasard <Digitaltechnik>SpieltheorieMusterspracheVarianzRegulärer GraphInformationArithmetisches MittelSchlussregelZusammenhängender GraphQuellcodeForcingXMLVorlesung/Konferenz
MaßstabVirtuelle MaschineSkalierbarkeitDiskrete-Elemente-MethodeDualitätstheoriePrimidealParallele SchnittstelleIterationOptimierungsproblemEndliche ModelltheorieGlobale OptimierungPrädikatenlogik erster StufeZentrische StreckungStochastikMultiplikationsoperatorVirtuelle MaschineZweiIterationFitnessfunktionDualitätstheorieAutomatische HandlungsplanungKoordinatenTransformation <Mathematik>VersionsverwaltungWellenpaketGradientenverfahrenSchnelltasteAnalysisTermFließgleichgewichtKonvergenzgeschwindigkeitInnerer PunktVerschlingungParametersystemPunktSoundverarbeitungFormale SpracheAlgorithmusKreuzvalidierungGrößenordnungParallele SchnittstelleLeistungsbewertungOrdnung <Mathematik>TelekommunikationDatenverarbeitungVorhersagbarkeitMereologieFeuchteleitungFlächeninhaltSynchronisierungZahlenbereichDifferenteFokalpunktSoftwaretestSchnittmengeSystemplattformNummernsystemVektorraumProgrammbibliothekDatenfeldImplementierungPrimzahlKonfiguration <Informatik>NeuroinformatikLokales MinimumGerichteter GraphLineare RegressionFramework <Informatik>p-BlockGeradeKlasse <Mathematik>Kategorie <Mathematik>HardwareStrategisches SpielPhysikalisches SystemFormale SemantikBildschirmmaskeNichtlinearer OperatorWort <Informatik>Divergente ReiheRandverteilungDienst <Informatik>Exogene VariableZeichenketteIntegralComputervirusComputerspielRelativitätstheorieResultanteSchreib-Lese-KopfWasserdampftafelWeb-SeiteInternetworkingWald <Graphentheorie>MinimumForcingElektronisches ForumAggregatzustandMAPGesetz <Physik>Kartesische KoordinatenDateiformatArithmetisches MittelPartitionsfunktionRechenwerkMinkowski-MetrikMatchingVorlesung/Konferenz
SpeicherabzugGoogolVorlesung/KonferenzComputeranimation
Transkript: Englisch(automatisch erzeugt)
Good morning. Thank you for coming. This is the HPCM Big Data and Data Science Dev Room. We're very happy to organize this Dev Room this year. So I will briefly introduce the organizers,
Kenneth here from University of Ghent, Roman from Vipondapp, and the Apache Society Foundation, and, well, Ewan is not here. And myself, I'm from QTH in Stockholm. So today we're going to have a mix of talks from Big Data,
HPC, Data Science. So we had really many submissions. We had 48 submissions. And we only accepted, well, we did our best to accept as many as we could. So we're going to have 20 talks, so 15, 20-minute talks, and five lightning talks.
So we'll start right away. We have a lot of things to present. And I'm very happy to introduce Kilder, who is going to talk about machine learning. So hello, everybody. Thanks for coming so early, even after the subject.
So I'm a researcher at the Swedish Institute of Computer Science. And the past year, I've been doing large-scale machine learning for the past year. So to start, I would like to focus a bit on what we mean by large-scale machine learning
and why we need it, and maybe to clarify it. People talk about large-scale machine learning or big learning. And the very common thing that we hear is talking about the data science. So we have big learning paths when our data is so big that it cannot fit into the memory
form when a single model is so big that it cannot fit into the memory of one machine. But they never say which machine. So for me, it's very important to clarify that the size of your computer is not an intrinsic property of the data. So that's not a good definition. If we want to be precise, we shouldn't be using our current computing capabilities
in order to define our learning task, but rather we should be talking about specific learning tasks. And that way, we can differentiate between small-scale and large-scale learning. And we hear about small-scale learning when the active budget constraint that we have is the number of examples. So in cases where the data is limited
and obtaining it is costly, we would have a small-scale learning task. On the other hand, we would have a large-scale learning task when our active budget constraint is actually the time. So you can imagine in any kind of big data scenario, we have so much data that we don't really need
anything else in order to do our learning. However, we are doing it from the time. As such an example, you can think of a company like Spotify, let's say that they want to provide recommendations for all the users every day. If that computation takes one and a half days, that is useless. So what they have is an actual time constraint. They need to be able to make their answers work within
the time that they have despite of the millions and billions of data points that they might have available. And this is the type of learning that we are going to share with you. So before diving into Flink and the large-scale machine learning excuse, I would like to introduce Apache Flink very briefly for those not familiar with the platform.
And for those that want to learn more, there is a talk coming up at 4 p.m. here by Till Rohnemann that is going to delve into more details about the engineering. So Apache Flink is a distributed data processing platform with a streaming data flow engine at its core. It provides powerful and easy to use
APIs for streaming and batch processing. And it is a technically advanced engine that has a number of features that make it very good for large-scale machine learning tasks. So here we can see the Flink task. It works well within the Hadoop ecosystem. It can use many data sources. It provides high
availability through Zookeeper. And on top of the data and stream processing APIs, it provides a number of libraries, like JEGI for graph processing, the Table API for SQL-like queries, and Flink ML which is the machine learning tool. So what makes Flink a good platform to use for machine learning?
So we'll take a look, a brief look into some of these features that make it a good candidate. So first, it's the Flink API that provides functional style programming, maybe with some SQL-like commands, like Group I, for example. And this allows for the quick development and prototyping of machine learning algorithms, and the
programmers get a familiar part to write their programming that is also intuitive. And at the core of Flink lies its streamed data flow engine. And using this engine, we're able to set up a set of operators at the beginning of deploying its op, and then continuously write data through, without having
explicit processing status. So this type of engine makes real-time stream processing possible in Flink. And what it also does is that it provides us with native iterations. Native iterations allow us to write test programs, like most machine learning programs, without materializing the engine.
In fact, with a batch engine, like, for example, Apache Spark, a new stage has to be submitted at the end of each iteration, and that creates additional schedules for us. Now what we are able to do with Flink instead, is that we are able to maintain a partial solution, that you can think, for example, that your machine learning, the model of your machine learning algorithm is a partial solution,
and we're able to iteratively update it within the same data flow. Now in addition to batch iteration, Flink also provides us with delta iterations. Now what delta iterations do, is that they allow us to shrink the size of the program as we near its solution. And this is something that is very useful in cases like graph processing, and
it could also be, the same principle could be applied as well for machine learning programs. So with that in mind, let's take a look into Flink ML. So the development of Flink ML started last spring, and the first version was released with version 0.9 of Flink, and I should mention that it is written completely in Scala.
So the library was designed with three goals in mind. The first one is to be truly scalable. So we want efficient algorithms and communication efficient algorithms that allow us to actually scale the processing that we do with web-scale or societal-scale data. The second one is the minimization of
glue code. And when we talk about glue code, we mean all the code in a machine learning program that is not machine learning. It's the thing that ties all the things together in order to make it work. And according to a recent publication from Google, a mature machine learning program might end up failing 95% glue code and only 5% is actual logic. And what we try to do with Flink ML
is actually try to minimize the amount of glue code developers need to write. And the third one is obvious. We focus on having familiar and easy to use APIs. We want to provide good documentation and examples and support for the users. So they're able to jump right into writing stuff. So we'll take a look at some of the
algorithms that are currently available on Flink ML. So starting with supervised learning, we provide a generalized convex optimization framework. It currently has an implementation of stochastic gradient descent that you can use for any learning task. We provide support vector machines
and multiple linear regression as well. And we also have a highly scalable ALS implementation that scales the data sizes that companies of the size of Spotify might have. And the growing set of common pre-processing algorithms for machine learning tasks.
And the one feature that we took care to include from the beginning was the support for machine learning pipelines. And these allow us to develop complex sequences of machine learning tasks with minimal glue code. So with that, we can take a brief look at the API. So we can see how easy it is to write a small machine learning program with Flink ML. So using the standard Flink
data processing moments, we can just read a training and a test set. And then we can quickly create and set properties for our learner. And you can see that we have some familiarity at least with the scikit-learn programming library, the machine learning library, which we do in order to provide our users with
more familiarity. We can come fit to train the model. And once we have done that, we can just start using our model in order to make predictions. So we can see that in less than 10 lines of code, you can actually have a fully working, simple learning system that will be scaled with your data. And I would also like to demonstrate how easy it is to create
the machine learning pipeline. So what we will do here is a training regression model after passing our data through a standard scaling algorithm and then augmenting our features also with a 30 degree polynomial. So our pipeline is simply created by chaining together as many operations that we need. We start with our scaling, we change the polynomial
features and transform, and then finally we change our predictor. And once we have that pipeline, we can treat it as any other learning. So we can just call fit on the pipeline itself and it will actually pre-process all the data with all the transformers that we have used previously and do the training. And the same applies for prediction. When we pass
a testing data set to the pipeline, it will actually perform all the transformations that we have ordered in our pipeline in order and then actually use the prediction. So next I would like to look at some of the cutting edge machine learning algorithms that we have implemented in Flink in order to make
it a truly scalable learning platform. So first I would like to talk about the relatively new communication efficient optimization algorithm. So stochastic learning descent or SED is perhaps the most widely used first order optimization algorithm that is out there and is very popular in areas like
deep learning that have seen major research now. And SED strength comes from the fact that it is actually very simple to implement and most distributed implementations use a very simple synchronization scheme as well. At the end of each iteration we see all the workers and then the next one. That has a high communication cost and it also
includes slow updates to the model only at the end of each iteration. So we can do better than that. So CoCo or communication is an algorithm by Yagi and others at Berkeley and this aims to reduce the communication costs in an optimization problem and achieve faster convergence. And if you use a number of things
in order to achieve that then we will see a couple of them. So first one it moves from the primal optimization problem to the dual one. So the framework of dual optimization is conceptually simple. So what we do is that we transform our original primal optimization minimization problem into a dual which is a maximization problem.
So once we do that what we have actually is that the dual provides a bound for the primal problem. And by having this bound it is very easy then to have a stopping criteria and when to stop iterating. So here is an example with the regularized regression problem. We take the primal minimization
problem we transform it into a maximization problem using a property that is called the LaGrange duality. And for such problems coordinate descent methods have been used for large scale problem. And they give stronger convergence guarantees at the same iteration cost as SVD. We also require no
step size which is a major parameter that we need to set for stochastic gradient descent. And it gives us that due to the duality that we have the bound it gives us a very well defined stopping criterion for our iteration. And another thing that we do at the stochastic gradient descent is that we update our model finally either at the end of its iteration or at the end of our run actually.
So by syncing all the workers. So that means that a lot of the workers might end up working with a stale version of the model for a very long time. So in CoCo the updates are happening locally within the workers without any communication. Meaning that the workers have a very fresh model that they are iterating on. They only sync as little as
possible at the end of its super step let's say. And that also reduces the communication cost. So this is basically an illustration of how CoCo works. Its work is a small small dual optimization problem. And at the end of its local iterations it communicates only the update vector that it needs
to the other workers. And we start the next super step. And this reduced communication as part of optimization can have great effects for the convergence rate of the algorithm. So CoCo here is the red line and we have a long scale here. So with this
stochastic gradient descent and coordinate descent by sometimes orders of magnitude in terms of performance. And this is what we use in order to perform this for support vector machines in Flink. So definitely CoCo has a good performance but it has one important disadvantage and that is the barrier synchronization that we
mentioned. If you have a straggler in your cluster which is a machine that is working much lower than the rest that means that all the faster workers need to wait for that straggling before moving on to the next iteration. So there is a way to deal with that as well. And that is recent from last year that is coming out from CMU by Professor
Eric Zink. And it deals efficiently with straggles and that gives actually great performance. So we can take a look at the different synchronization schemes that are available to us. So the first one is block synchronous parallel. In this model as we mentioned at the end of each iteration we wait for everybody to finish and then we move on to the next one. And of course this can have an adverse effect on
the speed of the convergence of the algorithm. Another option is to do something asynchronously and we let every worker work on its own and we don't release Sink and we just obtain the solution by combining them all at the end. So this is possibly fast but it can also lead to diverging solutions because many workers can have
very different versions of the model and that can result that we just diverge and we don't arrive at any solution ever. So as a compromise between the two is to have stale synchronous iteration. So with stale synchronous iterations what we do is
that the fastest workers can only be up to k iterations ahead of the slowest model. And if they are about to move further ahead we stop them and we tell them to wait until the slowest model gets caught up. So this allows the faster workers to work and update the model so we achieve faster convergence but ensures that the slow workers will also will not fall too far behind
in order to make sure that the algorithm will converge. So taking a look rapidly, so here worker 1 is the fastest one and if it manages to reach iteration 7 since we have a steady state folder 3 it will actually have to stop there and wait for worker 2 to catch up before moving on to the next iteration.
So here we can take a look at the performance of a steady synchronous int link. So the viewline is actually bulk synchronous and what has happened here is in the cluster at any point in time and one of the nodes has generated a lot of 100% for 12 seconds.
So that means that we basically have a strategy in the system that we create. And we can see that by using the stale synchronous parallel iterations we are able to converge much faster than using the bulk synchronous parallel and also achieve actually better optimization in the end. This will soon be merged into Flink and Flink ML
and when that happens it will be to my knowledge at least the first general purpose data processing platform that supports steady synchronous iterations. So finally I would like to talk about some more work that is currently being done in Flink ML and what we have planned for the future. So currently we have all the tooling in place for the
performance evaluation of the model as well as the cross validation and we are working on making it easier to persist your models and export them in a PMML language for example. And all of these things are pending pull requests that we hope we will be able to work on very soon. For our long term plans we include
two main areas. The first one is streaming and efficiency. So for streaming Flink already has Samoa bindings and Samoa is perhaps the most popular online analysis platform out there currently. And our plan is to actually kick start streaming machine learning library that will include a number of
machine learning algorithms in there. And the other area of focus for us would be computational efficiency algorithms. So the question is how can we scale learning into societal scale data sets but using only modus computing resources. Because not everybody is Google, not everybody has data sets with 10,000 machines.
So this is a major challenge for us. And we plan to use a lot of knowledge that is coming from the HPC field in terms of efficient computation using TPUs and the like in order to make sure that we are actually using all our hardware to fix maximum coverage. And this is it for me. Please take it down. You can go to Flink.com to find more about Flink.
And this is the link for our documentation. Thank you.
So that takes care of some of the original Cocoa Albert. We did scale well if you increased the number of machines that we were using. So the Cocoa class uses I think
instead of averaging in order to solve that kind of problem with scaling. And it is planned for us to actually use that. Because currently after 20 machines Cocoa does not actually perform that.
Thank you.