BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 155 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/42953 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
SIGMOD 201980 / 155
34
37
44
116
120
122
144
148
155
00:00
DatenverwaltungLokales MinimumSchätzungCoxeter-GruppeEinfügungsdämpfungPhysikalisches SystemGebundener ZustandBeanspruchungVirtuelle MaschineComputeranimation
00:13
Maschinelles LernenVirtuelle MaschineRegressionsanalyseZahlenbereichMathematische ModellierungSystemplattformStochastische AbhängigkeitGradientIterationÄhnlichkeitsgeometrieATMLogistische VerteilungHauptkomponentenanalyseApproximationMathematisches ModellPhysikalisches SystemSpeicherabzugLineare AbbildungSchätzungLokales MinimumSondierungPhysikalische TheorieKontextbezogenes SystemMultivariate AnalyseBeanspruchungAlgorithmische LerntheorieWellenpaketSchnittmengeResultanteVirtuelle MaschineDifferenteMultiplikationsoperatorMathematisches ModellMathematische ModellierungOrdnung <Mathematik>DreiStichprobenumfangLogistische VerteilungLineare RegressionFigurierte ZahlSchätzungParametersystemSpeicherabzugNumerisches VerfahrenKontextbezogenes SystemPhysikalisches SystemEinfache GenauigkeitRegulärer GraphEntropie <Informationstheorie>GradientMultiplikationUngleichungPhysikalische TheorieMaßerweiterungFortsetzung <Mathematik>HypermediaProzess <Informatik>Zentraler GrenzwertsatzAnalytische MengeWort <Informatik>StatistikAbfrageApproximationMaximum-Likelihood-SchätzungZentralisatorFormation <Mathematik>Inverser LimesLinearisierungTypentheorieZweiIterationKonvexe MengeLokales MinimumUniformer RaumNeuroinformatikFehlermeldungComputeranimation
04:01
Kontextbezogenes SystemMultivariate AnalyseMathematische ModellierungOrthogonalitätInterface <Schaltung>Prozess <Informatik>ResultantePhysikalisches SystemBenutzeroberflächeMathematisches ModellTypentheorieProgrammbibliothekVirtuelle MaschineRegulärer GraphAutomatische HandlungsplanungComputeranimation
04:17
Mathematisches ModellRegressionsanalyseLogistische VerteilungApproximationstheorieInterface <Schaltung>BenutzerprofilBeschreibungskomplexitätSchätzungStichprobeVektorrechnungApproximationArchitektur <Informatik>Mathematische ModellierungKonvexe MengeGradientSpeicherabzugBenutzerfreundlichkeitImplementierungPhysikalisches SystemComputerarchitekturMathematisches ModellMathematische ModellierungKontextbezogenes SystemDifferenteWellenpaketSchätzungTaskInformationVirtuelle MaschineMechanismus-Design-TheorieThreadOpen SourceKarush-Kuhn-Tucker-BedingungenTypentheorieKreiszylinderProxy ServerStichprobenumfangPunktZeiger <Informatik>ApproximationKomplex <Algebra>GradientSystemaufrufOrdnung <Mathematik>Nichtlinearer OperatorVorhersagbarkeitProfil <Aerodynamik>NeuroinformatikLokales MinimumZusammenhängender GraphArithmetisches MittelParallele SchnittstelleSchnittmengeComputeranimation
06:27
PrognoseverfahrenTaskApproximationstheorieMathematisches ModellLineare RegressionLogistische VerteilungParametersystemFunktion <Mathematik>Logischer SchlussSimulationSchätzungStichprobeF-TestGlobale OptimierungMultivariate AnalysePhysikalische TheorieBeschreibungskomplexitätVarianzEmulationMechanismus-Design-TheorieLogischer SchlussFunktionalVarianzOrdnung <Mathematik>StichprobenumfangParametersystemVorhersagbarkeitKomplex <Algebra>DifferenteKovarianzmatrixMathematisches ModellMathematische ModellierungDifferenz <Mathematik>A-posteriori-WahrscheinlichkeitLineare RegressionLogistische VerteilungRechenschieberEntscheidungstheorieFlächeninhaltRandwertSoftwaretestArithmetischer AusdruckGrenzwertberechnungCASE <Informatik>SchnittmengeMatrizenrechnungPhysikalische TheorieGlobale OptimierungNeuroinformatikSchätzungPhysikalisches SystemInstantiierungSpeicherabzugSigmoide FunktionErhaltungssatzMultivariate NormalverteilungInverseArithmetisches MittelDatenstrukturTouchscreenMetrisches SystemSymboltabelleAdressraumDemoszene <Programmierung>PartikelsystemPrimzahlZeiger <Informatik>DialektZellularer AutomatDivergente ReiheVorgehensmodellTypentheorieGewicht <Ausgleichsrechnung>DatenfeldSchwellwertverfahrenNegative ZahlFunktion <Mathematik>F-TestNormalverteilungComputeranimation
10:49
StichprobeGlobale OptimierungF-TestStichprobenumfangMatrizenrechnungKovarianzfunktionPhysikalische TheorieApproximationstheorieMathematisches ModellWahrscheinlichkeitsverteilungNormalvektorVarianzMultivariate AnalyseBeschreibungskomplexitätInformationMultiplikationNormierter RaumMechanismus-Design-TheorieParametersystemTeilbarkeitMathematische ModellierungNumerisches VerfahrenProzess <Informatik>Quadratische FunktionKovarianzmatrixStichprobenumfangOrdnung <Mathematik>SchätzungDifferenz <Mathematik>ParametersystemInstantiierungFunktionalMathematisches ModellMathematische ModellierungWellenpaketKonstanteApproximationKategorie <Mathematik>Mechanismus-Design-TheorieTeilbarkeitDifferenteMAPA-posteriori-WahrscheinlichkeitMultiplikationMultivariate NormalverteilungTaskMatrizenrechnungSupremum <Mathematik>Proxy ServerAdressraumInformationVorgehensmodellComputeranimation
12:38
Mathematische ModellierungVirtuelle MaschineMaßstabRegressionsanalyseLineare AbbildungLogistische VerteilungEntropiecodierungAdressierungHiggs-MechanismusProgrammfehlerSCI <Informatik>BeschreibungskomplexitätBeobachtungsstudieGebundener ZustandApproximationstheorieLaufzeitfehlerStichprobeRegulärer GraphMathematisches ModellPhysikalisches SystemMultiplikationsoperatorMAPZentrische StreckungNumerisches VerfahrenWellenpaketKomplex <Algebra>TypentheorieGeradeMathematisches ModellEntropie <Informationstheorie>Ultraviolett-PhotoelektronenspektroskopieRekursive FunktionKoeffizientRegulärer GraphInverser LimesMathematische ModellierungStichprobenumfangIterationSchnittmengeDifferenteResultanteOrdnung <Mathematik>CASE <Informatik>Lokales MinimumZählenVirtuelle MaschineProxy ServerTaskDimensionsanalyseRechter WinkelKartesische KoordinatenPaarvergleichProjektive EbeneComputersicherheitArithmetisches MittelProzess <Informatik>QuellcodeApproximationGanze FunktionSolitonComputeranimation
15:58
PrognoseverfahrenStichprobeBeobachtungsstudieUltraviolett-PhotoelektronenspektroskopieBeobachtungsstudieAnalytische MengeSpeicherabzugMathematische ModellierungKomplex <Algebra>Physikalisches SystemVirtuelle MaschineVorhersagbarkeitStichprobenumfangMathematisches ModellLokales MinimumParametersystemMereologieComputeranimation
16:34
StichprobeBeobachtungsstudiePrognoseverfahrenMathematische ModellierungFormation <Mathematik>Support-Vektor-MaschineMathematische ModellierungKonfigurationsraumSchießverfahrenLeistung <Physik>Formation <Mathematik>Mathematisches ModellKomplex <Algebra>TypentheorieRechenwerkVirtuelle MaschineGewicht <Ausgleichsrechnung>Computeranimation
16:59
Mathematische ModellierungFormation <Mathematik>Support-Vektor-MaschineTechnische InformatikDatenbankZweiInformationsspeicherungTypentheorieFortsetzung <Mathematik>VerschlingungDreiecksfreier GraphMereologieComputeranimation
17:22
Mathematische ModellierungFehlermeldungSupport-Vektor-MaschineFormation <Mathematik>Formale SemantikTypentheorieNeuroinformatikKartesische KoordinatenGanze FunktionKeller <Informatik>FehlermeldungComputeranimation
17:39
Computeranimation
Transkript: Englisch(automatisch erzeugt)
00:03
So, my name is Youngju, and today I'll present BlinkML, the system we developed to accelerate your machine learning workloads at the cost of a bounded accuracy loss. Machine learning workloads can be slow and costly. Typically, more data, when you have more data,
00:20
it leads to slower training. This chart is showing the relationship between the number of training examples and the time required for training a logistic regression classifier. And as you see in the figure, the training time increased almost linearly with a number of examples. For example, for 37 training examples, the training took one hour and 35 minutes.
00:44
And also, you want to often train multiple models, especially when you have new training data, or you want to try many different feature sets until you find a sufficiently accurate model. This later process is typically called a feature engineering process. So, we ask a simple but important question.
01:03
Using sampling, can you accelerate this often slow and costly machine learning workloads in a quality-guaranteed way? This approach has been extensively used for accelerating SQL-based analytics. Well-known statistical theories, such as the central limit theorem, or how things can quality tell us
01:20
how to compute the unbiased answers with a tight error bounds. And it looks like you can apply the same idea for accelerating machine learning training as well. Many machine learning models are trained using iterative gradient computation. And by using sampling, we can approximately and more quickly compute those gradients.
01:41
But when we try to pursue this idea more deeply, we found that there are some fundamental differences that make this approach much challenging. The first challenge is guaranteeing the accuracy or the quality of the approximately trained models. Typically, these models are trained using by finding the parameter values
02:01
at which your gradient becomes equal to zero, which means that there is no closed-form solutions. As a result, the well-known theories, such as the central limit theorem, are not directly applicable. The second type of challenge is generalization. Typically, different machine learning models are trained using different approaches, which indicates that it's hard to find
02:21
a common approach that are applicable to many different machine learning models. And the third challenge is efficiency. Even if you can somehow guarantee the quality of your approximate model, we must not train many different machine learning or approximate models in order to achieve enough speed-ups. If we have to train too many different approximate models,
02:42
it may take more time than simply computing or training a single regular full-trained model. So our core contribution in this paper is that we develop a system that can quickly train machine learning models in a quality-guaranteed way. And our system supports basically all convex models
03:01
that can be trained via maximum likelihood estimation. And this approach, our approach, supports many commonly used machine learning models, such as linear regression, logistic regression, and max entropy classifier, probabilistic PC, and generalized linear models. In order to achieve this, deep down, we had to apply the Fisher's theoretical work, but we had to apply it in a novel and practical way
03:22
in order to build the actual efficient systems. So in a more broader context, we show that uniform random sampling can be a really effective solution for quickly training many different machine learning models. Of course, uniform random sampling is very simple, and something nice about uniform random sampling is that it does not require any a priori knowledge.
03:41
And also, still, we can achieve significant speed-ups compared to training full regular models. And also, second contribution, the second nice thing is that we generalize this approximate query processing, or AQP, to multi-various statistical models that are enabled by machine learning techniques. And also, our work, this work is our solo to AutoML,
04:01
so we can actually use this work to accelerate AutoML process. And I will show you some empirical results shortly. So let me give you an overview of our system. The user interface of our system, BlinkML, looks almost identical to regular, other regular machine learning libraries. So user submits or specifies the type of the model you want to train, and the training set.
04:23
But additionally, user also specifies the required accuracy requirements. And then our system, BlinkML, performs some work, and then returns an approximate answer that satisfies this accuracy requirement, instead of returning the full regular trained model. And then, first of all, what does this accuracy mean
04:40
in this particular context? We formally define accuracy in this way. The expected prediction difference between the full model, or the regular model, and our approximate model, is bounded by a small constant, epsilon, with high probability. So for example, if the accuracy requirement is 99%, or the epsilon value is 0.01,
05:02
the prediction difference between the full model and approximate model is bounded by 0.01. And the remaining question is then, what kinds of operations does BlinkML perform in order to produce this quality guaranteed approximate model? The internal workflow of BlinkML is pretty straightforward. First, it profiles the complexity of the model and data,
05:23
and using this information, it estimates the minimal sample size that is still large enough to satisfy the accuracy requirements. The crucial component in this task is our quality guarantee mechanism, or the components that are the accuracy estimator. And this component basically estimates the accuracy
05:41
of an approximate model without training a full model. And I'll talk more details about this component shortly. And then once they figure out, identify the minimal sample size, then we train this model using a sample, and then return this model, approximate model, to the user. And this is the architecture of our BlinkML system.
06:01
Something nice about our architecture is that we implemented our BlinkML on top of commonly used open source machine learning models. In particular, this convex optimization and gradient computations are just our commonly used machine learning models. Nice thing about this is that it's really easy to implement our system on top of those models, and also it is compatible with existing ecosystem,
06:21
and we can also easily support the gradient distributed or parallel gradient computation. Then let me present how we can about our quality guarantee mechanism. So the goal of our quality guarantee is bounding the prediction difference between the approximate model and the full model.
06:40
So let this diff function indicates the expected difference, prediction difference, between the approximate model and the full model. And then our goal is to bound the value of this diff function using a small constant epsilon. And then if we can somehow compute the value of this diff function, then we can easily ensure that this diff function is bounded by this epsilon.
07:02
But the core question is how can you actually compute or estimate this value of this diff function? Our core idea is this. If we can somehow figure out the difference between the parameters, the parameter means that the parameter of the full model and the approximate model, then we can use this knowledge to bound the difference in predictions as well.
07:23
This approach is very general because all the models we are interested in are basically functions of the parameters. For example, a logistic regression classifier predicts a particular label if a value of the sigmoid function is over a certain threshold, otherwise it outputs a negative label. So one way to easily, and we can model
07:41
this prediction function as a function that involves the test example and the parameters. And one way to show this visual is that if your test example is on a certain side of the decision boundary, then this prediction outputs a positive label, and otherwise it outputs a negative label. And this decision boundary is determined by this prime value.
08:01
So our idea is that if we know this, the set of capital N, which is the parameter of the full model, and the set of lower case N, which is the parameter for the approximate model, and by using this knowledge, we can infer the prediction difference of the full model and the approximate model as well. The way to intuitively understand this computation is that we want to compute the size
08:21
of this green area, which is basically the area that the two different decision boundaries do not agree. But computing the size of this green area is challenging because of the two reasons. First of all, we do not know the parameter value of the full model because we do not train the full model itself. And the second challenge is that typically the prediction function is a non-linear function
08:42
of the parameters, so computing the size of this green area can be computationally expensive. In order to address these difficulties, we use probabilistic inference. So we estimate or compute this diff function using the samples obtained from the posterior distribution of the full model's parameters,
09:02
given the parameters of the approximate model. And I'm going to talk about how to obtain those samples on the next slide. But let's say, let's assume that we can somehow obtain those samples from the posterior distribution. And then using these samples, we can compute the diff function probabilistically. Let's say in this example, we have five samples
09:21
that we obtained from the posterior distribution. And then using these samples from the full model, we compute these five different instances of the diff function. In this case, four out of five cases, we can see that diff function is bounded by 0.01. So we can say that this diff value is smaller than or not larger than 0.01 with 80% probability.
09:42
And our system uses thousands of those samples in a conservative way for accurate estimation. And the remaining question I just deferred is how can you obtain those samples from the posterior distribution of the full model's parameters, given the approximate model's parameters? So we obtain those samples by combining
10:00
the Fisher's theory and our optimization technique. Based on the Fisher's theory, we can obtain that the parameter for the full model, which is a random variable, and the parameter for the approximate model, follows a multivariate normal distribution with a special covariance matrix structure. The covariance matrix itself may look complex, but the meaning is quite intuitive.
10:23
The inverse of this H matrix basically captures the complexity of the model, and this matrix J captures the data variance, which indicates that the gap between the parameter of the full model, which is unknown, and the parameter for the approximate model becomes larger if the complexity of a model increases or the variance in your data increases.
10:44
One thing, one challenging thing in using this expression for obtaining samples is that the size of this covariance matrix can be very large, because it's a quadratic function of the number of features. So essentially when the number of features is large, this covariance matrix blows up, and which can make the sampling process extremely slow.
11:03
So in order to address this challenge, computational challenge, we use a mathematical trick, a very interesting mathematical property of this multivariate normal distribution. We see that we can, if we can somehow obtain the, or somehow factorize this covariance matrix in a certain way, we can obtain those samples
11:22
very efficiently by simply using a matrix multiplication. And our paper describes how to obtain such a factor in a very efficient way as well. So let me recap our quality guarantee mechanism. So in order to estimate the quality of an approximate model, which is trained using a sample size N, we first obtain the parameter
11:43
of those approximate model, and they factor L. And using this information, we obtain the samples of the full model parameters using its posterior distribution. And then we can compute the many different instances of this diff function using those samples, and then we can ensure that this diff function
12:00
is bounded by a small constant. One problem of this stage is that you may notice that we had to train this approximate model itself in order to obtain the parameter value of the approximate model. But if we have to try, if we wanna try many different sample sizes N, then it will require
12:20
many computational resources. So in order to optimize that, we actually, our paper describes how to minimize the number of approximate models we have to train. So we can actually perform this task by training at most two approximate models. Then let me show you some empirical benefit
12:42
our system BlinkML can provide to you. So in this experiment, we used four different models and six different data sets, and these data sets are all gigabyte scale machine learning data sets that include millions of training examples. For industry scale experiment, we also included a data set whose feature dimension is up to one million.
13:04
First of all, BlinkML, our system, offers a large amount of speed ups in comparison to training the same type of model using the entire data sets. So the chart on the left side is showing the speed up numbers on the y-axis, and it's showing the recursive accuracy on the x-axis.
13:20
You can see that for lower level recursive accuracy, we are delivering a huge amount of speed ups over 1,000 time speed ups. But even for very accurate models, whose recursive accuracy is over 95%, we are still delivering a decent amount of speed up. And something very interesting about this result is that the speed up numbers automatically adjust
13:42
based on the recursive accuracy. It's because our system BlinkML automatically figures out what is the minimum sample size that is still large enough to satisfy the recursive accuracy guarantee. And also, the speed up numbers also automatically adjust based on the model and data complexity.
14:01
On the right side, I'm showing the speed ups for training for more complex models, max entropy classifier, using a NLP data set. We are still delivering speed ups, but you can see that the speed up numbers for this complex model is relatively smaller compared to the speed up numbers for the simpler model.
14:22
And second, the approximate models trained by our BlinkML system satisfy the recursive accuracy. So in these figures, the red lines are showing the observed accuracy of those approximate models, and the black lines are showing the recursive accuracy by the users. And the fact that red lines are above the black line
14:41
indicates that for most of the cases with high probability, the approximate model satisfy the recursive accuracy. And you can still see that there are some small gaps which indicates that the approximate model trained by our system was actually conservative. In order to show another type of benefit
15:01
that you can achieve using the BlinkML system, we conducted a hyperparameter search task as well. So in this chart, each dot represented a model. So each red dot represented a proxy model. The y-axis is showing the accuracy of the approximate model and the x-axis is showing the time when that model is trained at.
15:20
And the blue dot, each blue dot represented a regular model. In summary, our system BlinkML could train 300 times more models in a half an hour time limit. And during this process, the sample sizes we used varied between the 10K and nine million. And those sample sizes automatically changed depending on the regularization coefficient
15:42
or the different feature sets chosen. And in the end, BlinkML found the best model among those randomly tried models at iteration count of 91 in six minutes. But regular approach could not find the best model even in an hour. So in summary, we showed that BlinkML
16:02
extended sampling-based analytics approach to commonly used machine learning models. And we showed that our approach offers probabilistic quality guarantees for those commonly used machine learning models. The core idea in providing this quality guarantee is that if we can understand the uncertainty on those model parameters, then we can use that knowledge
16:21
to bound uncertainty on predictions as well. And our empirical study shows that the BlinkML system can deliver a huge amount of speed ups and also the internally used minimal sample size automatically changes depending on the complexity of the data and the model and some configuration requirements.
16:40
Then what's next? The first question we ask is that how can we extend this approach to other types of machine learning models? Although we already covered many commonly used machine learning models, we also tackle other types of powerful models such as SVM or ensemble models such as XGBoost or Deep Neural Nets.
17:00
And also second question is, can you actually run this BlinkML directly using on top of the SQL engines? It has two types of benefits. First benefit is that you don't have to migrate or to move your data out of the SQL engines. The second benefit is that we can put the compute engines very close to the store engine, which is a common theme in the database community.
17:21
And the third topic is that figuring out a way to properly propagate the errors to the downstream applications. This topic is not only important for this approximate pre-processing but also important for other types of processing computing that's happening across the entire computing stack. That's all, thank you.