Query-Driven Learning for Next Generation Predictive Modeling and Analytics
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 155 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/42919 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
SIGMOD 201972 / 155
34
37
44
116
120
122
144
148
155
00:00
t-TestMagnetooptischer SpeicherDatenverwaltungAbfragePrognoseverfahrenVorgehensmodellAnalytische MengeAnalogieschlussComputeranimation
00:11
GoogolComputerphysikE-FunktionNebenbedingungApproximationAbfrageverarbeitungBruchrechnungAbfrageStichprobeInformationsspeicherungDivergente ReiheMaschinelles LernenEndliche ModelltheoriePhysikalisches SystemPunktspektrumPunktwolkeDienst <Informatik>Funktion <Mathematik>Lambda-KalkülBitrateKonfiguration <Informatik>Potenz <Mathematik>Luenberger-BeobachterExistenzsatzAbfrageDatenbankZweiProdukt <Mathematik>Inverser LimesInteraktives FernsehenSchnittmengeInternetworkingInformationsspeicherungMultiplikationNeuroinformatikAbstandFunktionalCoxeter-GruppeVirtuelle MaschineEndliche ModelltheorieKlasse <Mathematik>Dienst <Informatik>Translation <Mathematik>RechenwerkArithmetisches MittelBruchrechnungMultiplikationsoperatorStichprobenumfangRandverteilungApproximationCachingRechter WinkelAnalogieschlussFünfParametersystemBefehl <Informatik>NebenbedingungMechanismus-Design-TheorieHalbleiterspeicherMini-DiscResultanteVektorraumZahlenbereichDatenanalyseAsymptotische WirksamkeitUntergruppeCluster <Rechnernetz>Cloud ComputingTouchscreenQuick-SortSelbst organisierendes SystemCASE <Informatik>Gewicht <Ausgleichsrechnung>IntegralTopologieWellenpaketSchätzfunktionMittelwertAnalytische MengeFehlerschrankeResponse-ZeitAlgorithmusMultiple RegressionVorlesung/KonferenzComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:04
Today, I'll be presenting my research on query-driven learning for next-generation analytics. So first of all, let's start with an observation. An exponential increase in data translates to an exponential increase in costs.
00:22
So we have both monetary costs and computational costs. We have computational costs because queries are executed over large data sets, with their answers being returned within minutes, hours, or days, depending on how big your data set is.
00:42
There is research proving the existence of an interactivity constraint, since somewhere around 500 milliseconds to two seconds, saying that any answer returned to the user over that limit might negatively affect the productivity of data analysts.
01:01
To examine the monetary costs, we've investigated Google's BigQuery pricing, and we've seen that as data set sizes increase from one terabyte to 50 terabytes, organizations executing 10 to 100 queries daily
01:21
have to spend multiple thousands of dollars to keep on leveraging this kind of technologies. So we need to find solutions that are able to address both these computational costs and the monetary costs.
01:41
One way to address the computational costs is using approximate query processing, or AQP in short. So AQP provides approximate answers at a fraction of the time needed for the computation of the actual answer. The way this is performed is that AQP engines
02:03
construct multiple samples to which the queries are executed against, instead of the queries being executed over the actual data. So the answers being returned are approximate answers, and any query that cannot be answered by the sampling-based AQP engine
02:21
is forwarded to the actual data store, and then the actual answer is retrieved. However, these kind of solutions make use of samples, so they still require huge amounts of storage. So for one terabyte of data, and with just 1% of sampling ratio,
02:40
we still need to store 10 gigabytes of data. You also have an inherent trade-off between accuracy and sampling ratio, meaning that if we want more and more accurate answers, we will have to sample more and more of the data set. This can end up breaking the interactivity constraint, as our queries have to be executed over larger and larger data sets.
03:03
They also make use of the same infrastructure as the data store engine, so the samples have to be stored in the same engine as the actual data. So a way we came up with to address these problems, these shortcomings, is to leverage past queries
03:23
and train machine learning models to predict the answers of any new queries instead of creating samples. So the queries are now executed against functions, and the answers are predicted using those machine learning models, those functions.
03:40
Using this approach, we make no use of data as we require no samples. We can get accurate estimations of the answers with just an average relative error of 3%. Our answers can be returned in an efficient manner, as the average response time is just 0.1 milliseconds,
04:03
and our solution is extremely lightweight, as what's being stored in this case are the machine learning models and not any of the data required. Our solution is also data store agnostic, meaning that we do not care where the queries are executed against,
04:22
whether they are executed against a no SQL database or an SQL database, as what's being used are the queries and their answers. So a way to implement this methodology is the following. Imagine an analytic query being expressed
04:42
with an SQL statement. This SQL statement is initially parsed, and the filtering parameters of this SQL statement are used to construct a feature vector. So in the end, we have a number of feature vectors with the corresponding answers,
05:01
which translates to queries and their corresponding answers that we've retrieved by actually executing those queries. We then use those feature vectors and partition them using a clustering algorithm. This clustering algorithm produces subgroups to which within each subgroup,
05:21
the queries belonging to each subgroup are more similar to each other than in other subgroups. We then train a regression model over each one of those subgroups, which is then more appropriate to answer those kind of queries. Then when a new query comes in,
05:41
its feature vector is extracted, and it gets mapped to the closest cluster. We find what's the most appropriate model to use, and we estimate its answer in an efficient manner. So a way to view the complete analytics stack now is the following.
06:00
Imagine that we have query-driven learning, the sampling-based approaches, and data stores. We can use all of them side by side, and the user gets to decide which one she wants to execute the query against. If she wants more accuracy, she can move from the left to the right
06:20
to get more and more accurate results, and if she needs more speed, she can execute the queries from right to the left. A useful analogy to this is the following. You can imagine this as the cache, RAM, and disk memories of a normal PC. Again, if you want more and more accurate results,
06:42
you move from the left to the right, and if you want more speed, you move from the disk to the cache memory. We've already addressed the computational cost. However, we have the monetary cost to address as well. So normally, we would use BigQuery,
07:02
or let's say Redshift, to perform all of these queries. The on-demand pricing for those engines is $5 per terabyte. Instead, using our technology, we can deploy this as what's now called function as a service.
07:20
Those function as a service are a much cheaper alternative, and we will only get charged for 40 cents per one million function calls, method evocations. So this translates to one million potential queries that we could execute using our query-driven learning mechanism than going to the BigQuery or Redshift.
07:41
So you can see that we can save a lot of money by executing a portion of the queries initially intended for the BigQuery clusters or Redshift clusters using our solution. So in short, we have managed to address both the computational cost,
08:00
as our solution requires no extra storage and can provide efficient estimations to queries in just 0.1 milliseconds, and we have also addressed the monetary cost using a deployment with the function as a service
08:20
provided by most cloud providers. That would be all for my presentation. Thank you very much.