Apache MADlib
Video in TIB AVPortal:
Apache MADlib
Formal Metadata
Title 
Apache MADlib

Subtitle 
Distributed in Database Machine Learning for Fun and Profit

Title of Series  
Part Number 
27

Number of Parts 
110

Author 

License 
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. 
Identifiers 

Publisher 

Release Date 
2016

Language 
English

Content Metadata
Subject Area 
00:00
Logical constant
Multiplication sign
Source code
Database
Mereology
Number
Power (physics)
Medical imaging
Nichtlineares Gleichungssystem
Physical system
Form (programming)
Area
Machine learning
Enterprise architecture
File format
Relational database
Projective plane
Database
Virtual machine
Data management
Word
Computer animation
Logic
System programming
Computer science
03:14
Machine learning
Process (computing)
Multiplication sign
Database
Quicksort
Grass (card game)
God
04:17
Point (geometry)
Machine learning
Connectivity (graph theory)
Virtual machine
Database
Insertion loss
Database
Demoscene
Neuroinformatik
Revision control
Digital photography
Category of being
Macro (computer science)
IRIST
Set (mathematics)
05:05
Machine learning
Frame problem
Projective plane
Execution unit
File format
Expert system
Parallel port
Database
Expandierender Graph
Revision control
Integrated development environment
Function (mathematics)
Energy level
Electronic visual display
Resultant
06:18
Application service provider
Group action
Building
User interface
Linear regression
State of matter
Multiplication sign
Direction (geometry)
Source code
Set (mathematics)
Database
Insertion loss
Mereology
Likelihood function
Mathematics
Semiconductor memory
Matrix (mathematics)
Endliche Modelltheorie
Area
Predictability
Machine learning
Source code
Collaborationism
Algorithm
Linear regression
Software developer
Open source
Shared memory
Mean free path
Parallel port
Mass
Unsupervised learning
Variable (mathematics)
Process (computing)
Sample (statistics)
Hill differential equation
Summierbarkeit
Right angle
Modul <Datentyp>
Resultant
Slide rule
Functional (mathematics)
Table (information)
Open source
Sequel
Variety (linguistics)
Discrete element method
Rule of inference
Scalability
Field (computer science)
Product (business)
Power (physics)
Wave packet
Software
Utility software
Traffic reporting
Computing platform
Linear map
Computer architecture
Dot product
Focus (optics)
Scaling (geometry)
Weight
Projective plane
Database
Cartesian coordinate system
Scalability
System call
Wave packet
Word
Voting
Software
Network topology
Spherical cap
Einbettung <Mathematik>
Statement (computer science)
Video game
Identity management
Library (computing)
16:14
Linear regression
Multiplication sign
Virtual machine
Database
Insertion loss
Parallel port
Mereology
Discrete element method
Scalability
Variable (mathematics)
Product (business)
Architecture
Natural number
Different (Kate Ryan album)
Singleprecision floatingpoint format
Operator (mathematics)
Core dump
Square number
Circle
Implementation
Algebra
Skalarproduktraum
Linear map
Computer architecture
Machine learning
Algorithm
Theory of relativity
Linear regression
Cellular automaton
Sampling (statistics)
Bit
Line (geometry)
Scalability
Product (business)
Type theory
Process (computing)
Universe (mathematics)
Right angle
Quicksort
Square number
20:07
Functional (mathematics)
Sequel
Link (knot theory)
Database
Inverse element
Discrete element method
Vector potential
Theory
Usability
Term (mathematics)
Operator (mathematics)
Utility software
Endliche Modelltheorie
Game theory
Form (programming)
God
Area
Machine learning
Support vector machine
Scaling (geometry)
Projective plane
Sampling (statistics)
Database
Bit
Kernel (computing)
Nonlinear system
Spherical cap
Interface (computing)
Website
Resultant
23:40
Computer animation
Googol
Core dump
00:05
changing all the time and you can read and then at the end of the world and I think of the of the constants the in the end of the year and the use of a the title of the book and some of you did so and so on and on and and the and the Senate and the length of of the they were much if you don't like everyone was was a colony and that the only thing that we want to know how the tool that you know In the 1st half of the root of the word in a key tool for my good to start the times of that is that the 2 images you this is what I do for a living so they have quite a while the people from various backgrounds were distributed systems and a lot of the computer science and and I suppose that there were looking for interesting projects over the years and found that we heard you liked look in this area of logic of but you still have living so you that number 2 is that the very large commercial enterprise here using relational database using data which is arranged in light of form of which is what and so you could but for those of you who don't know that is also the goal so you put those 2 together is the equation that you know which is that of 1 plus money it was not so that's the material I like you so in particular the part that's part of to the talk about the during to part you were in format that is so a new source project and then look at the database machinery in which you know the the architecture of the management and then talk about some of the skills so the power in the of provokes a in the to get so let's start
03:16
with the and the touring to use of the that and I like the start of the history of the so this is
03:26
sort of the history of all post grasses it's hard in the and I felt a little celebrated multiple processes or so this is a time of the presence of God couple days that I think are interesting and this time I 1 wonder what is that was added the workers in all the years later the other thing that was the of the and so in the 2000 the by the there a budget of culturing from and who thought about all of that data inside of grass and a possible to make a distributed solution may be massively parallel processing and so
04:22
they both work of those that's the point to the version that doesn't have the feel this massivelyparallel losses in the engine on top of now about the very of interesting and the problem that you were found about 2 here's years where the a lab and they realize that you have now hold this terrible computing capability
04:52
with the database and it is going to add machine learning component to it right so the idea is that you don't move the data out of the database operated on property owners in some
05:07
experts from that in fact if you want to do everything in it and so that is the advent of the environment which was launched in 2011 so shortly after that we have to do if it follows that I. Greenwald and later and I said why don't we take this matter with processing parallel processing engines on so local storage and
05:36
distributed I also the capabilities of this unit you're going to the ecosystem can be added to the whole heart which is a Apache hot later the this in the because of that you the knowledge about the heart and the continued I told you about in the useful version of these are all research projects that you can work hard and that there are no such thing you if so so there's we have the result
06:23
of the interesting part of collaboration with industry and then some the history as well as academia so the project was actually where he published University in Berkeley University 1st guy act indeed there is a lowering of the architecture and you realize that only because I couldn't stand with Stanford University School of Medicine so you were so why is the life that she might like to think that just of these the of wealth is because it is that this really a great place to be used by the developers come together to work in a collaborative way on software and clearance of the transparency of our dataset so if you have a research project that you can a share of this happens and you that here you know this is groups who what if you don't have it here the linking the of that the other using that as the knowledge of the so called back around that it's important community for a of this slide postlistening using while in the snow and projects and if you want to know about that should be the some sum of the village really the source of the building itself to appear rule has taken out of heart the with all the motion products would well through the the database and the Apache the the the when you have these are all the open source in the last of last year's go all of which the sum of the that was a little bit of
08:53
history is that there were about like a tale of that and the talk of because of the world we show that the use of scalable in From the runs in today's as in the rest of you agree in principle the base as well as touching the heart and to you in a matter of power and scalability for so if you have the physical memory of a the nodes of the utility of the lots of other solutions to the right and on the the of the variety the what do you what do you think that you know and every word in the sense that it's aware that is the other thing is that but if you want so the performance of the sense that if you have a large dataset work faster rather the of where the weights and the of any such that you know you're working on large datasets and you get your results but so these are the functions that exist in Annapolis they said severely injured libraries in the world of 35 40 possible functions of the problems of not less so now we lost 5 over 5 years you see the the expected of nearest provides thing while the unsupervised learning of involved in over all of the work that and in fact this is the focus of a recent development in the state of the art of the time that they stayed in the area of the feature extraction and what have you the words you using the likelihood of focus and the application and talking about the other solves owls however I think in general if you want to know a lot of work and getting ready to go to such we started doing more irritable models in particularly in the last 6 months or so that all of the matrix operations as well as the inside of that function the function of interest in the matter of creating it M so the features of the future power better parallelism and the key thing she said that it is a sequel based that is designed to take advantage of this method would parallel processing architecture in the as well as of distributed and so on the scalability and by designing the algorithm Scalability datasets that don't change your software as you it's a bigger so if you have a universal dataset you know your just want to test for example and then you can run that is well on its way in the direction of the to change the facts in some really wonder what it's running of the another key thing but the idea is that if you do look at the dates I just kind of like this company right in the area of data sets that that you mentioned that high cardinality by writing things that you want to find a way to write the embedding Portuguese example in 1st so that the idea of all the data so these are the supported platforms that matches the people talk about it was so all then we have a field of the culture of jobs lost in the elite elitist this is the scaling by what the size you will say so the exactitude variables and then on the Y axis is the use right so this is a linear regression and going over the top right hand side you have a guest segments the vote so the to 2nd the double respects so you come down to of the red dots dot on the right thank you that you might have and if you don't have a lot of reports and then to move or 1 of them again in your life is to have this just to shows a year of our scalability losses With respect to regression with respect to scaling by the size this shows linear regression scalability or 10 million of what we have in this this means that it will not be so the part of this is just the direction around the world so this is where the the the right of the sequel that this is the area of the book but this is how you came all for example the linear regression here and predicting the price of houses of given some historical data houses in out of the impact of that through size the train that then if I want you prediction the results from that was and again i call and again unless the statement by the the idea of prediction of based on the results of the tree it's very
16:15
easy to I like to talk a little bit architecture be so many machine learning problems are in in nature and the picture truth is that a bad so that was accommodation type the process that we have this and relations all layer of which I the the actual core of the the of the of the of the of the what's 1 of the simplest possible but if you don't later the don't look at how scalability and we were over this in this and regression so each of them we have crafted for distributed part the guys from will although that you can't just take it out of which for a single node time to the so we developed the need to think about how the cell to the loss of this and the so we have a few so that example of they don't have sort of a straight line at the heart of the universe of there we wanna find essentially seek to modify the some of the way and so ordinary squares of reason was like this so we set up the matrix of that we monitor and distributed to think well I like going to be like that nobody in the mind the Indian work to not just by the the transpose savings circle of the the so you from the research in future but I see that actually operating costs here because they can see the difference right so that's an increase of the only you that was used to work out what the would the kind of everything that's the problem with the if you have look at the algebra you can see that actually decomposable right you can see a square the square in the had you could be separated those out you could do everything every operation was 1 not all the operations on the other hand and then just like them it turns out you can do that using something called like of product the inner product the idea here is that you can see all the operations on the the inside of the 2nd line the node of the right side of the but this is kind of like to think about posing and learning algorithms for this the that kind of idea for a 2nd and they did that offers the 1 thing to the University of great together because the lecture this time not
20:03
every data science sample and
20:07
so many women in science is more so than the heart of the matter will allow you that your and that's what you're quite keen on how to actually solve this distributed and keep did back in the scale where we work is that the the right regular or light from the inverse of the sequel to execute in database has been there and then returns the results in fact all of the data you will the variance from of the results that we have to on so just to finish up so what's coming
21:01
out of the edges of them next most of the areas that we focused on here is the very fact that support vector machines and approval nonlinear kernels and we've added more material of utilities that you need the operations cost functions that means the text of a little of functionality in the future as a learning theory is a theory of the whole lot for nearly all these of actual predictive of models the year and terms of usability and thinking about the size of this set of so they were prepared and you're more than welcome to participate in the project in the links on the web site where the sample size is a bit of a list of and the check it out in the questions that you you the fact that if I'm going to think of it is way the it is and they have the same of the 1st and the last day of the land is in the form of a theory of the by I have to use have 1 of the of the of the thing you think of the the of the the and in some of the things that have that God I think that we do with him in the the an what users go in the