Apache MADlib

Video in TIB AV-Portal: Apache MADlib

Formal Metadata

Title
Apache MADlib
Subtitle
Distributed in Database Machine Learning for Fun and Profit
Title of Series
Part Number
27
Number of Parts
110
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2016
Language
English

Content Metadata

Subject Area
Logical constant Multiplication sign Source code Database Mereology Number Power (physics) Medical imaging Nichtlineares Gleichungssystem Physical system Form (programming) Area Machine learning Enterprise architecture File format Relational database Projective plane Database Virtual machine Data management Word Computer animation Logic System programming Computer science
Machine learning Process (computing) Multiplication sign Database Quicksort Grass (card game) God
Point (geometry) Machine learning Connectivity (graph theory) Virtual machine Database Insertion loss Database Demoscene Neuroinformatik Revision control Digital photography Category of being Macro (computer science) IRIS-T Set (mathematics)
Machine learning Frame problem Projective plane Execution unit File format Expert system Parallel port Database Expandierender Graph Revision control Integrated development environment Function (mathematics) Energy level Electronic visual display Resultant
Application service provider Group action Building User interface Linear regression State of matter Multiplication sign Direction (geometry) Source code Set (mathematics) Database Insertion loss Mereology Likelihood function Mathematics Semiconductor memory Matrix (mathematics) Endliche Modelltheorie Area Predictability Machine learning Source code Collaborationism Algorithm Linear regression Software developer Open source Shared memory Mean free path Parallel port Mass Unsupervised learning Variable (mathematics) Process (computing) Sample (statistics) Hill differential equation Summierbarkeit Right angle Modul <Datentyp> Resultant Slide rule Functional (mathematics) Table (information) Open source Sequel Variety (linguistics) Discrete element method Rule of inference Scalability Field (computer science) Product (business) Power (physics) Wave packet Software Utility software Traffic reporting Computing platform Linear map Computer architecture Dot product Focus (optics) Scaling (geometry) Weight Projective plane Database Cartesian coordinate system Scalability System call Wave packet Word Voting Software Network topology Spherical cap Einbettung <Mathematik> Statement (computer science) Video game Identity management Library (computing)
Linear regression Multiplication sign Virtual machine Database Insertion loss Parallel port Mereology Discrete element method Scalability Variable (mathematics) Product (business) Architecture Natural number Different (Kate Ryan album) Single-precision floating-point format Operator (mathematics) Core dump Square number Circle Implementation Algebra Skalarproduktraum Linear map Computer architecture Machine learning Algorithm Theory of relativity Linear regression Cellular automaton Sampling (statistics) Bit Line (geometry) Scalability Product (business) Type theory Process (computing) Universe (mathematics) Right angle Quicksort Square number
Functional (mathematics) Sequel Link (knot theory) Database Inverse element Discrete element method Vector potential Theory Usability Term (mathematics) Operator (mathematics) Utility software Endliche Modelltheorie Game theory Form (programming) God Area Machine learning Support vector machine Scaling (geometry) Projective plane Sampling (statistics) Database Bit Kernel (computing) Nonlinear system Spherical cap Interface (computing) Website Resultant
Computer animation Googol Core dump
changing all the time and you can read and then at the end of the world and I think of the of the constants the in the end of the year and the use of a the title of the book and some of you did so and so on and on and and the and the Senate and the length of of the they were much if you don't like everyone was was a colony and that the only thing that we want to know how the tool that you know In the 1st half of the root of the word in a key tool for my good to start the times of that is that the 2 images you this is what I do for a living so they have quite a while the people from various backgrounds were distributed systems and a lot of the computer science and and I suppose that there were looking for interesting projects over the years and found that we heard you liked look in this area of logic of but you still have living so you that number 2 is that the very large commercial enterprise here using relational database using data which is arranged in light of form of which is what and so you could but for those of you who don't know that is also the goal so you put those 2 together is the equation that you know which is that of 1 plus money it was not so that's the material I like you so in particular the part that's part of to the talk about the during to part you were in format that is so a new source project and then look at the database machinery in which you know the the architecture of the management and then talk about some of the skills so the power in the of provokes a in the to get so let's start
with the and the touring to use of the that and I like the start of the history of the so this is
sort of the history of all post grasses it's hard in the and I felt a little celebrated multiple processes or so this is a time of the presence of God couple days that I think are interesting and this time I 1 wonder what is that was added the workers in all the years later the other thing that was the of the and so in the 2000 the by the there a budget of culturing from and who thought about all of that data inside of grass and a possible to make a distributed solution may be massively parallel processing and so
they both work of those that's the point to the version that doesn't have the feel this massively-parallel losses in the engine on top of now about the very of interesting and the problem that you were found about 2 here's years where the a lab and they realize that you have now hold this terrible computing capability
with the database and it is going to add machine learning component to it right so the idea is that you don't move the data out of the database operated on property owners in some
experts from that in fact if you want to do everything in it and so that is the advent of the environment which was launched in 2011 so shortly after that we have to do if it follows that I. Greenwald and later and I said why don't we take this matter with processing parallel processing engines on so local storage and
distributed I also the capabilities of this unit you're going to the ecosystem can be added to the whole heart which is a Apache hot later the this in the because of that you the knowledge about the heart and the continued I told you about in the useful version of these are all research projects that you can work hard and that there are no such thing you if so so there's we have the result
of the interesting part of collaboration with industry and then some the history as well as academia so the project was actually where he published University in Berkeley University 1st guy act indeed there is a lowering of the architecture and you realize that only because I couldn't stand with Stanford University School of Medicine so you were so why is the life that she might like to think that just of these the of wealth is because it is that this really a great place to be used by the developers come together to work in a collaborative way on software and clearance of the transparency of our dataset so if you have a research project that you can a share of this happens and you that here you know this is groups who what if you don't have it here the linking the of that the other using that as the knowledge of the so called back around that it's important community for a of this slide post-listening using while in the snow and projects and if you want to know about that should be the some sum of the village really the source of the building itself to appear rule has taken out of heart the with all the motion products would well through the the database and the Apache the the the when you have these are all the open source in the last of last year's go all of which the sum of the that was a little bit of
history is that there were about like a tale of that and the talk of because of the world we show that the use of scalable in From the runs in today's as in the rest of you agree in principle the base as well as touching the heart and to you in a matter of power and scalability for so if you have the physical memory of a the nodes of the utility of the lots of other solutions to the right and on the the of the variety the what do you what do you think that you know and every word in the sense that it's aware that is the other thing is that but if you want so the performance of the sense that if you have a large dataset work faster rather the of where the weights and the of any such that you know you're working on large datasets and you get your results but so these are the functions that exist in Annapolis they said severely injured libraries in the world of 35 40 possible functions of the problems of not less so now we lost 5 over 5 years you see the the expected of nearest provides thing while the unsupervised learning of involved in over all of the work that and in fact this is the focus of a recent development in the state of the art of the time that they stayed in the area of the feature extraction and what have you the words you using the likelihood of focus and the application and talking about the other solves owls however I think in general if you want to know a lot of work and getting ready to go to such we started doing more irritable models in particularly in the last 6 months or so that all of the matrix operations as well as the inside of that function the function of interest in the matter of creating it M so the features of the future power better parallelism and the key thing she said that it is a sequel based that is designed to take advantage of this method would parallel processing architecture in the as well as of distributed and so on the scalability and by designing the algorithm Scalability datasets that don't change your software as you it's a bigger so if you have a universal dataset you know your just want to test for example and then you can run that is well on its way in the direction of the to change the facts in some really wonder what it's running of the another key thing but the idea is that if you do look at the dates I just kind of like this company right in the area of data sets that that you mentioned that high cardinality by writing things that you want to find a way to write the embedding Portuguese example in 1st so that the idea of all the data so these are the supported platforms that matches the people talk about it was so all then we have a field of the culture of jobs lost in the elite elitist this is the scaling by what the size you will say so the exactitude variables and then on the Y axis is the use right so this is a linear regression and going over the top right hand side you have a guest segments the vote so the to 2nd the double respects so you come down to of the red dots dot on the right thank you that you might have and if you don't have a lot of reports and then to move or 1 of them again in your life is to have this just to shows a year of our scalability losses With respect to regression with respect to scaling by the size this shows linear regression scalability or 10 million of what we have in this this means that it will not be so the part of this is just the direction around the world so this is where the the the right of the sequel that this is the area of the book but this is how you came all for example the linear regression here and predicting the price of houses of given some historical data houses in out of the impact of that through size the train that then if I want you prediction the results from that was and again i call and again unless the statement by the the idea of prediction of based on the results of the tree it's very
easy to I like to talk a little bit architecture be so many machine learning problems are in in nature and the picture truth is that a bad so that was accommodation type the process that we have this and relations all layer of which I the the actual core of the the of the of the of the of the what's 1 of the simplest possible but if you don't later the don't look at how scalability and we were over this in this and regression so each of them we have crafted for distributed part the guys from will although that you can't just take it out of which for a single node time to the so we developed the need to think about how the cell to the loss of this and the so we have a few so that example of they don't have sort of a straight line at the heart of the universe of there we wanna find essentially seek to modify the some of the way and so ordinary squares of reason was like this so we set up the matrix of that we monitor and distributed to think well I like going to be like that nobody in the mind the Indian work to not just by the the transpose savings circle of the the so you from the research in future but I see that actually operating costs here because they can see the difference right so that's an increase of the only you that was used to work out what the would the kind of everything that's the problem with the if you have look at the algebra you can see that actually decomposable right you can see a square the square in the had you could be separated those out you could do everything every operation was 1 not all the operations on the other hand and then just like them it turns out you can do that using something called like of product the inner product the idea here is that you can see all the operations on the the inside of the 2nd line the node of the right side of the but this is kind of like to think about posing and learning algorithms for this the that kind of idea for a 2nd and they did that offers the 1 thing to the University of great together because the lecture this time not
every data science sample and
so many women in science is more so than the heart of the matter will allow you that your and that's what you're quite keen on how to actually solve this distributed and keep did back in the scale where we work is that the the right regular or light from the inverse of the sequel to execute in database has been there and then returns the results in fact all of the data you will the variance from of the results that we have to on so just to finish up so what's coming
out of the edges of them next most of the areas that we focused on here is the very fact that support vector machines and approval nonlinear kernels and we've added more material of utilities that you need the operations cost functions that means the text of a little of functionality in the future as a learning theory is a theory of the whole lot for nearly all these of actual predictive of models the year and terms of usability and thinking about the size of this set of so they were prepared and you're more than welcome to participate in the project in the links on the web site where the sample size is a bit of a list of and the check it out in the questions that you you the fact that if I'm going to think of it is way the it is and they have the same of the 1st and the last day of the land is in the form of a theory of the by I have to use have 1 of the of the of the thing you think of the the of the the and in some of the things that have that God I think that we do with him in the the an what users go in the
Feedback