We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Apache MADlib

00:00

Formal Metadata

Title
Apache MADlib
Subtitle
Distributed in Database Machine Learning for Fun and Profit
Title of Series
Part Number
27
Number of Parts
110
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
19
20
Thumbnail
44:46
23
30
Thumbnail
25:53
69
Thumbnail
25:58
76
78
79
96
97
Machine learningDatabaseSystem programmingVirtual machineIRIS-TDemosceneDigital photographyMacro (computer science)Set (mathematics)File formatFrame problemExpandierender GraphLevel (video gaming)Function (mathematics)Electronic visual displayDiscrete element methodApplication service providerSoftwareSpherical capDatabaseProjective planeNichtlineares GleichungssystemShared memoryVirtual machineAreaProduct (business)MereologyQuicksortRevision controlComputer architectureBitScalabilityOpen setOrder (biology)Multiplication signProcess (computing)Goodness of fitAnalytic continuationSoftwarePhysical systemOpen sourceNumberConnectivity (graph theory)Parallel portScaling (geometry)Computer fileLocal ringData storage deviceComplex (psychology)Interface (computing)NeuroinformatikINTEGRALResultantCollaborationismForm (programming)Software developerFile formatEnterprise architectureRelational databaseComputer scienceMachine learningParallel computingIntegrated development environmentCategory of beingSlide ruleExpert systemGroup actionLogicPower (physics)Data managementExecution unitSource codeVideo gameGodGrass (card game)Insertion lossMedical imagingPoint (geometry)WordLogical constantXMLComputer animation
Open sourceMassUser interfaceMachine learningDatabaseDiscrete element methodSpherical capSource codeMean free pathSample (statistics)Linear regressionLinear mapScalabilityIdentity managementTable (information)Open sourceMachine learningNeuroinformatikFunctional (mathematics)Linear regressionAlgorithmResultantLaptopScaling (geometry)Set (mathematics)Software developerUnsupervised learningRun time (program lifecycle phase)Matrix (mathematics)QuicksortSoftwareNumberDimensional analysisFigurate numberVirtual machineProjective planeIntrusion detection systemPrincipal idealGoodness of fitGroup actionLibrary (computing)Order (biology)Row (database)Module (mathematics)Parallel portProduct (business)Computer architectureBitScalabilityCartesian coordinate systemoutputFocus (optics)Multiplication signPivot elementAreaSystem callSupervised learningSampling (statistics)Variety (linguistics)Process (computing)Green's functionStatement (computer science)Pattern languageKey (cryptography)SummierbarkeitBuildingState of matterVotingDot productRight angleWordTraffic reportingLikelihood functionDatabaseSemiconductor memorySequelMereologyDirection (geometry)Variable (mathematics)Power (physics)Endliche ModelltheorieRule of inferenceVideo gameInsertion lossMathematicsWeightUtility softwareEinbettung <Mathematik>Computing platformSource codeField (computer science)Lecture/Conference
Modul <Datentyp>Wave packetTable (information)Hill differential equationArchitectureMachine learningDatabaseDiscrete element methodScalabilityImplementationLinear regressionLinear mapVariable (mathematics)Square numberParallel portProduct (business)Spherical capInterface (computing)Game theoryVector potentialLine (geometry)Link (knot theory)Theory of relativityMereologySingle-precision floating-point formatProjective planeMultiplication signOperator (mathematics)AlgorithmDatabaseType theoryProcess (computing)UsabilitySampling (statistics)Inverse elementSequelTheoryGodUniverse (mathematics)Scaling (geometry)ResultantInsertion lossBitQuicksortForm (programming)Cellular automatonStatement (computer science)CircleNonlinear systemComputer architecturePredictabilityWebsiteDifferent (Kate Ryan album)Kernel (computing)Wave packetRight angleSystem callAreaVirtual machineSquare numberAlgebraUtility softwareSkalarproduktraumSupport vector machineProduct (business)Endliche ModelltheorieFunctional (mathematics)Natural numberTerm (mathematics)Core dumpNetwork topologyLinear regressionScalabilityLevel (video gaming)NumberAnalytic setIterationGame controllerInterface (computing)Integrated development environmentSet (mathematics)Distribution (mathematics)Front and back endsClient (computing)Connectivity (graph theory)EmailElectronic mailing listNeuroinformatikSoftwareOrder (biology)Medical imagingMatrix (mathematics)MultiplicationFitness functionNichtlineares GleichungssystemLecture/Conference
Core dumpGoogolComputer animation
Transcript: English(auto-generated)
OK, let's continue. I have a brief announcement first. So the talk about this one, which was about building open source with open source, it's actually cancelled. So the city has an accident, and it will be here next.
But we have a replacement already. So Olaf will talk about Apache BigTok. It is also building stuff with open source. So it's a very, very nice thing. So that's very nice. So now I will turn to the next speaker. This is Michael Gilliam, who will be with us.
And Alex, OK, it's another talk about machine learning from this time on the database. So please, thank you. Thank you very much. Well, good morning, everyone. My name is Frank Lisquelland, and it's a good pleasure to be here this morning with you. Thank you for getting up early on a rainy Sunday
morning to join us here. I'd like to speak to you about Apache Madlib, which is an incubating project that they have now. And the topic is distributed in database machine learning for open funds. So I'd like to start with a couple of facts.
The first fact is that machine learning and distributed systems can be fun, even for folks who are not necessarily attracted by complexity. People come from various backgrounds who work on distributed systems and machine learning, not just PhDs in computer science.
And I don't have a PhD in computer science, for example. So folks out there who are looking for interesting projects to work in, I would encourage you to look in this area, either this project or other projects that you learn about here at scale. But you still have to earn a living.
So fact number two is that if you look at every large commercial enterprise out there, they're using relational databases. They're using data which is arranged in relational form in some way, shape, or form. So for those of you who don't know,
that is the Wall Street bull. So if you put those two together, this is the equation that you end up with, which is that fun plus money equals MATLAB. So that's the area that I'd like to talk to you about today.
So in particular, the topics are, I'd like to talk about the journey to Apache for MATLAB because it's not a new open source project. Then look at database machine learning in a little bit more detail, talk about the architecture of MATLAB,
and then talk about R, making R scalable. So MATLAB also has R interface as well for folks who are interested in using R. Okay, so let's start with the journey to the Apache Software Foundation for Pivotal.
And I'd like to start with a little bit of a history lesson. So this is sort of the history of Postgres as it started in the mid-'80s by a fellow by the name of Michael Stonebraker who developed Postgres out of U.S. Berkeley. So this is a timeline sort of through to the present.
I'll just pick out a couple dates that I think are interesting on this timeline. One that I thought was interesting is that Postgres added support for SQL only about ten years later in 1995, which I thought was interesting since SQL had been around since the 80s.
So in mid-2000, around 2005, there was a bunch of folks at a company called Greenplum who thought about all of that data inside of Postgres and is it possible to make a distributed version of that, to make a massively parallel processing database which is based on Postgres.
So they formed Postgres in the 8.2 version around 2005 and built this massively parallel processing or MPP engine on top of Postgres. That was a very interesting and profitable venture for them.
About six or seven years later in 2011, they realized that they had now all of this parallel computing capability within the database, within this MPP database, but they wanted to add a machine learning component to it. So the idea is that
you don't want to move the data out of the database, operate on it in some kind of external format and move it back in. You want to do everything in database. So that was the advent of MATLAB which was launched in 2011. So shortly after that, with the advent of Hadoop,
the folks at Greenplum and later Pivotal said, why don't we take this massively parallel processing engine swap out local storage and add a distributed Hadoop file system. So take the capabilities of this MPP engine,
bring it to the Hadoop ecosystem, and the advent of something called Hawk which is now Apache Hawk which you'll hear about later today. So if we fast forward to 2015, all of these products, Greenplum, Hawk,
and MATLAB have continued to develop both with contributions from the academic community as well as commercial. And these are all open source projects now. In particular, Hawk and MATLAB are now Apache integrating. So as I said, MATLAB is the result of a very interesting
collaboration between industry and industry as well as academia. So the project was actually one of the key people involved was Joe Hellerstein from UC Berkeley.
He was one of the first guys who actually thought about it in database machine learning, came up with an architecture to realize that. Along the way, we've had a lot of contributions from Stanford, from University of Wisconsin-Madison, as well as University of Florida.
So why Apache? Why do we think that Apache Software Foundation is a good place to go? Well, it's because ASF is really a great place to be. It's a place in which developers come together, work in a kind of collaborative way on software and open and productive ways.
I think transparency is a really important part of ASF. So if you're going to have an open source project in Apache, you better share what your roadmap is, you better communicate with the community actively, share your technical discussions on the user groups. And if you don't have that kind of transparency,
you're certainly doomed as an open source project at Apache. So the other interesting thing is that these technologies are getting so complex right now that it's important for the community to collaborate in order to build them. It's fine for a few folks to be sitting quietly in a room to figure stuff out,
but as these projects get more complicated, you want to scale them out and add capability to them. You really need a community, right? So it takes a village to raise a child, it takes a good open source community to build this software product. Pivotal has kind of taken that at heart,
and effectively all of the commercial products that were developed through Greenplum, including Greenplum database, the Apache, the Cipro on Hadoop product, these have all been open sourced
in the last year or so. So all of the commercial data products are going to be open sourced, including Madlib. So that's a little bit of a history to Madlib and where Madlib comes from. I'd like to go into a bit more detail on Madlib
and talk about capabilities in Madlib, as well as the architecture. Okay, so Madlib is about scalable in-database machine learning. It runs in-database, as I said, in Postgres, in Greenplum database,
as well as Apache Hawk. And the two kind of key things about Madlib to remember are scalability and performance. So if you have data that fits in memory on a single node, then you don't need Madlib.
There's lots of other solutions out there, right? You can use R. You can use Scikit-learn. You can use a wide variety of other open source projects. So if you want to do things that scale in very, very large data sets, that's sort of what Madlib is designed for. The other thing is that
you want them to be performant, in the sense that if you have a large data set and you have a large cluster to run it on, you don't want it to run forever, right? So it needs to be implemented in such a way that even though you're working on large data sets, you get your results back increasingly.
So these are the functions that exist currently within Madlib. As I said, it's a fairly mature library. There's something in the order of 35 to 40 principal functions, or principal algorithms within Madlib.
It's been developed now over the last five or so years. So you see sort of the expected algorithms in the era of supervised learning and unsupervised learning. There's also a number of other support modules.
In fact, this has been a focus of the recent development work because if you talk to any data scientist, they will tell you that a lot of the time that they spend is preparing the data, doing feature extraction or what have you,
in order to input into machine learning algorithms. So although really the focus in people talking about machine learning, they're talking about the algorithms themselves. However, and I think the gentleman earlier mentioned it as well, a lot of work is kind of getting it ready through code and such.
So we've started doing more in the area of support modules. In particular, in the last six months or so, we've added a whole host of matrix operations as well as done things like path functions. And path functions are interesting for finding patterns within data sets.
So the features of Madlib are better parallelism. I think that's a key thing. And Madlib is it's SQL based and it's designed to take advantage of this massively parallel processing
architecture and green fund database, as well as distributed computing capability. Better scalability and by this I mean the algorithms scale as well as your data set scale. You don't want to change your software as your data sets get bigger.
So if you have a small data set, you can do development on your laptop on Postgres, for example. And then you can run that as well on, say, a hot cluster with 100 nodes. The software hasn't changed. It's the exact same software that ran
on your laptop that's running on a 100 node cluster. Another key thing is predictive accuracy. If you speak with data scientists, they don't like to sample data. Even if it's very large data set, especially if you
have dimensions with high cardinality, like user IDs and things like that. If you want to group by those and machine learning by user ID, then even if you have a large data set, if you sample it, you can end up with sparse data. So the idea is to operate on all the data.
So these are the supported platforms, as I mentioned. Both Hawk, and from the database, Postgres are obviously all open source. So scalability. I have a couple of charts on scalability here.
The way to read this one, this is sort of scaling by cluster size, if you will. On the x-axis, you have number of independent variables, and then on the y-axis is the execution time. This is for linear regression on 10 million rows. You can see in the top right-hand side there,
if you have six segments, segments of worker node, it takes about 200 seconds. If you double the number of segments, so you come down to the red dot on the right, you can see that execution time is approximately half about 100 seconds. If you double the number of worker nodes
again to 24 worker nodes, then again your execution time is halved as well. So this is just to show you the linear scalability of that cluster size. With respect to regression, with respect to scaling
by data size, this shows linear regression scalability for 10 million, for 1 million, 10 million, and 100 million records. You can see that it's linear. This is logistic regression.
This is logistic regression So this is what the example usage looks like. So it's SQL based, as I said. So it's a declarative call within a SQL statement. So this is how you would train a model, for example for linear regression.
Here I'm predicting the price of houses given some historical data on houses including how much is paid in tax, number of bathrooms and size for example. So I train that model. Then if I want to do a prediction, I take the result from that
and again I call and again in my SQL statement I do a linear, I do a prediction based on the result of the training. So it's fairly easy to use. I'd like to talk to you a little bit about the architecture.
So many machine learning problems are iterative in nature. the architecture with the way it's set up with Madeleine is that it's a combination of Python and C++.
So we have this high level iteration controller layer which is written in Python. The actual core algorithms themselves are implemented in C++. And there's a C++ extraction layer to the database.
So to look at how scalability would work for linear regression. each algorithm needs to be kind of crafted for a distributed environment. As you guys probably all know that you can't just
take an algorithm which is written for a single node and automatically have it distributed. So when you're developing a new algorithm in Madeleine, you need to think about how to take that algorithm and distribute it across to components that still need to be distributed across multiple nodes. So we take kind of a
passive linear regression example and think of it as a straight line fit through a set of data and we want to find essentially C we want to find our slopes if you will.
So the way you can solve that is ordinary squares the equation looks like this so when you set up your matrix X you want to kind of distribute your data so you think, well why don't I go ahead and distribute it like this I'll put A and B in worker node one C and D in worker node two
if I do the transpose then A and B in worker node one C and D in worker node two and then I start doing my multiplication what I see is that I'm actually operating across segments here because A and C are in different segments so that's going to increase network traffic
so I don't want to do that I want to reduce the amount of network traffic I have between worker nodes so I need to kind of rethink this probably so if you kind of look at the algebra you can see it's actually decomposable you can see A squared B squared and AB
you could separate those out then you could do everything, every operation on those in one node all the operations on C and D in another node and then just collect them after so it turns out you can do that using something called an outer product as opposed to an inner product
and the idea then is that you can see all of the operations on the left hand side of the second line could be done in one node all the operations on the right hand side could be in another, done in another so this is kind of the way that we think about decomposing machine learning algorithms for distribution in an MQP environment
same kind of idea for the second term here you can take you can decompose that to do operations in segment one, segment two or any number of segments bring them together onto the master okay, I'm going to skip ahead a little bit through this
not every data science speaks SQL so many data scientists use R so what we've done is built an R interface for Madeline so what it allows you to do is use familiar R syntax on your client machine
but have it actually call this distributed MQP database in the back end in order to scale up on so the way that works is that you write regular R in the client that converts it to SQL it executes in the database that's where the data lives
and then it returns the result set back to the client right, so the big data, if you will, is in the database the result set comes back to the client so just to finish up
so what's coming up we have our 1.9 release which is coming up within the next month or so the areas that we focused on here is we've completely refactored the support vector machines including support for non-linear kernels we've added more in the area of utilities as I mentioned matrix operations, path functions
stemming and some other text analytics kind of functionality and seeing the future there are potential areas that we're very interested in working on a lot more in the area of utilities a bunch more predictive
models that can go as listed here and in terms of usability thinking about Python MPI as an example so thank you very much for your attention you're more than welcome to participate in the project
there's some links for the website where the software is on GitHub and the mailing lists and check it out and if you have any questions send them to the mailing list and we'll be happy to respond thank you for your attention
ok, so in one of the last images you said computation is carried out in the database happens within
the database and the data is stored in the database so that it is actually in the same customer anyone wants a t-shirt?