MADlib
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Untertitel |
| |
Serientitel | ||
Anzahl der Teile | 20 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben | |
Identifikatoren | 10.5446/19025 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache | ||
Produzent |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
PGCon 20126 / 20
1
3
6
7
9
11
12
14
15
16
19
20
00:00
VideokonferenzMetropolitan area networkOffene MengeBildschirmfensterModul <Datentyp>IkosaederParallelrechnerLokales MinimumServerLandau-TheoriePlastikkarteMultiplikationsoperatorFensterfunktionGradientMereologieSchnittmengeWort <Informatik>GRASS <Programm>MaßerweiterungGreen-FunktionFrequenzAbfrageZellularer AutomatGruppenoperationAuswahlaxiomBimodulDatenflussOrtsoperatorDatenbankProzess <Informatik>BenutzerschnittstellenverwaltungssystemMailing-ListePunktPhysikalisches SystemSoftwareentwicklerDreiecksfreier GraphServerTypentheoriePatch <Software>ComputerarchitekturTwitter <Softwareplattform>MehrrechnersystemCoxeter-GruppeGanze ZahlComputerspielHackerXMLUMLComputeranimation
02:46
Metropolitan area networkFreewareReelle ZahlTotal <Mathematik>BinärdatenWort <Informatik>Logik höherer StufeVIC 20CASE <Informatik>BildschirmsymbolMehrwertnetzDatenerfassungMomentenproblemROM <Informatik>IterationAnalysisOffice-PaketAnalytische MengeGeradeVerkehrsinformationResultanteStatistikUnternehmensarchitekturDatenbankFehlermeldungPhysikalisches SystemSoftwareBitHypermediaLeistung <Physik>Prozess <Informatik>RuhmasseTeilmengeSchnittmengeRechenbuchAbfrageIterationFensterfunktionLineare RegressionNatürliche ZahlOpen SourceRückkopplungZweiWeb-SeiteHalbleiterspeicherMathematische LogikArithmetisches MittelDatenstrukturFormale SemantikFacebookFamilie <Mathematik>ClientTwitter <Softwareplattform>Fortsetzung <Mathematik>DigitalsignalMereologieIntegralKugelFunktionalDiagramm
09:11
Prinzip der gleichmäßigen BeschränktheitAnalysisMachsches PrinzipParallele SchnittstelleOffene MengeSoftwareentwicklerHomepageElektronisches ForumMetropolitan area networkARM <Computerarchitektur>SinusfunktionAdditionMittelwertMehrwertnetzGammafunktionFächer <Mathematik>PortscannerBimodulAnalytische MengeSoftwareentwicklerSkalierbarkeitLinearisierungParametersystemKollaboration <Informatik>MereologieProjektive EbeneGrundraumOpen SourceVirtuelle MaschineProzess <Informatik>TypentheorieRechter WinkelGemeinsamer SpeicherCodeVersionsverwaltungAnalysisProgrammbibliothekQuellcodeDistributionenraumRandomisierungWald <Graphentheorie>InformationEntscheidungstheorieLeistung <Physik>DatenbankInterface <Schaltung>Logistische VerteilungNatürliche ZahlVideokonferenzFunktionalLineare RegressionTopologieVerkehrsinformationEndliche ModelltheorieAdditionBeweistheorieGruppenoperationAlgorithmische LerntheoriePhasenumwandlungGrundsätze ordnungsmäßiger DatenverarbeitungZentrische StreckungStützpunkt <Mathematik>MomentenproblemInferenzstatistikDeterminanteResultanteSoftwareGreen-FunktionFormale SemantikPhysikalische TheorieKugelAusdruck <Logik>StichprobenumfangXMLComputeranimation
16:51
ProgrammschleifeSpeicherabzugFront-End <Software>StandardabweichungDiskrete-Elemente-MethodeInnerer PunktIterationGruppenoperationLokales MinimumZählenBenutzerprofilBimodulGradientARM <Computerarchitektur>Lineare RegressionLineare AbbildungMatrizenrechnungMetropolitan area networkGammafunktionVIC 20DifferenteE-MailAdressierungVirtuelle MaschineSchwach besetzte MatrixVektorrechnerEndliche ModelltheorieE-MailLinearisierungKontrast <Statistik>EntscheidungstheorieMaßerweiterungLoopTypentheorieProgrammbibliothekDatenbankBimodulLogistische VerteilungCodeAlgorithmusAdditionImplementierungSummierbarkeitSpeicherabzugKartesische KoordinatenAbstraktionsebeneFunktionalUnüberwachtes LernenFront-End <Software>Güte der AnpassungComputerarchitekturPhysikalische TheorieMereologieDatenkompressionCASE <Informatik>TopologieSelbstrepräsentationInstallation <Informatik>Lineare RegressionInhalt <Mathematik>Peer-to-Peer-NetzDeskriptive StatistikAssoziativgesetzÜberwachtes LernenBitSchnitt <Mathematik>RandverteilungMAPMultiplikationDifferenteWort <Informatik>Fortsetzung <Mathematik>Puls <Technik>MultigraphFormale GrammatikLuenberger-BeobachterBeobachtungsstudieRechter WinkelBasis <Mathematik>GamecontrollerKonditionszahlFormation <Mathematik>SchlussregelProgramm/QuellcodeXML
23:00
t-TestAttributierte GrammatikEin-AusgabeInverser LimesTabellePunktSchnittmengeGrenzschichtablösungRelation <Informatik>Normierter RaumStichprobeWinkelMetropolitan area networkGammafunktionFunktion <Mathematik>FunktionalSchnittmengeTabelleAutomatische IndexierungDimensionsanalyseFehlermeldungAbstandAnalytische MengeOrtsoperatorMetrisches SystemGruppenoperationTypentheorieAttributierte GrammatikCodeEin-AusgabeRandomisierungResultanteZahlenbereichKontrast <Statistik>Trennschärfe <Statistik>BimodulDämpfungPunktSensitivitätsanalyseAlgorithmusVektorrechnerFlächeninhaltEndliche ModelltheorieGüte der AnpassungProzess <Informatik>Klasse <Mathematik>BeobachtungsstudieMultiplikationsoperatorKartesische KoordinatenGradientenverfahrenTrigonometrische FunktionBitrateRechter WinkelIterationArithmetisches MittelVarianzProgrammierungLesen <Datenverarbeitung>Algorithmische ProgrammierspracheNormalvektorLokales MinimumSpieltheorieXML
28:30
Funktion <Mathematik>Atomarität <Informatik>Lineare RegressionDatenmodellAnalysisSingularität <Mathematik>Attributierte GrammatikVektorpotenzialTabelleBildschirmmaskeEin-AusgabeATMTrigonometrische FunktionSystemaufrufProdukt <Mathematik>DivisionLineare RegressionAnalytische MengeResultanteAnalysisEndliche ModelltheorieVektorrechnerLogistische VerteilungEntscheidungstheorieTrennschärfe <Statistik>PunktCASE <Informatik>Kategorie <Mathematik>Ein-AusgabeFunktion <Mathematik>Güte der AnpassungMereologieRegulärer GraphLuenberger-BeobachterAttributierte GrammatikFunktionalProzess <Informatik>StichprobenfehlerUnüberwachtes LernenTabelleDruckverlaufDatensatzWellenpaketMAPZahlenbereichGewicht <Ausgleichsrechnung>VektorpotenzialProdukt <Mathematik>Information RetrievalGebäude <Mathematik>Physikalischer EffektVarianzFormation <Mathematik>OrtsoperatorVorhersagbarkeitArithmetischer AusdruckTypentheorieSummierbarkeitKoeffizientWasserdampftafelBitrateBildverstehenRandomisierungXML
34:01
MaßerweiterungOffene MengeGreen-FunktionInstallation <Informatik>ClientBinärcodeOpen SourceInverser LimesPatch <Software>VorhersagbarkeitCoprozessorAnalytische MengeDatenbankVirtuelle MaschineSystemprogrammProjektive EbeneBimodulÄhnlichkeitsgeometrieWeb SiteInjektivitätStrömungsrichtungDatenstrukturEndliche ModelltheorieGraphGreen-FunktionMaßerweiterungDigitalisierungPhysikalismusComputeranimation
36:12
Proxy ServerParallele SchnittstelleBimodulTabelleEndliche ModelltheorieEntscheidungstheorieGebäude <Mathematik>TypentheorieAbfrageSoftwareFunktion <Mathematik>ParallelrechnerHydrostatikDatenbankWald <Graphentheorie>InformationsüberlastungVisualisierungRandomisierungTopologieGüte der AnpassungFunktionalHalbleiterspeicherStatistikRechter WinkelRechenbuchFormale SpracheMAPPartitionsfunktionComputerspielMereologieHarmonische AnalyseElement <Gruppentheorie>RandverteilungDatenverwaltungLeistung <Physik>Physikalische TheorieResultanteVariableEin-AusgabeAutorisierungBildschirmmaskeProzess <Informatik>Spezielle unitäre GruppeSoundverarbeitungSystemaufrufParametersystemGruppenoperationAdditionDifferenteUmsetzung <Informatik>XMLUML
Transkript: Englisch(automatisch erzeugt)
00:05
Well, I think it's time. So I just, I want to start my presentation about Madlib. So I'm Hitoshi Harada. Thanks for coming today. First of all, I want to talk about myself.
00:23
I introduce myself. I initially started to work on Postgres as a hacker. And I wrote a window function patch in 8.4, and extended it in 9.0. And also in 9.1 development cycle,
00:42
I helped writeable CD feature with David Fetta here. And I'm not sure if Marco is here or not, but it's a great feature. And I'm now working on PLV8. Thanks, Peter. I really appreciate about the advertisement
01:01
around the drone. Peter, hello, guys. This is a cool feature. But actually, I'm not talking about PLV8 today. And I'm also working on other modules like Twitter FTW, TinyInt, which is a one-byte integer extension.
01:20
And I joined Greenblum last year. I really enjoyed the development life in Greenblum. So I just want to ask you, how many people here have ever heard about Greenblum?
01:40
Oh, awesome. Great. So a lot of people may know about Greenblum, but I just talk about the architecture of Greenblum. So Greenblum is a company that develops a Greenblum database, which is forked from Postgres 8.2. And it's a distributed database system.
02:03
So this is the type called cluster system of Greenblum. So here is a master server. And the cluster has a whole bunch of segment servers. And the query is dispatched from master to a segment. And data is distributed in segments.
02:22
So the query processing is parallelized in the segment server. So a lot of customers have experienced like terabyte of data or petabyte data. And they process the huge amount of data in Greenblum.
02:47
So let me explain about what's going on in Greenblum. So I think you guys are a little tired about big data.
03:00
I think a lot of people are talking about big data, but not only us, but also the media companies, like CNN or blog, people around here. So we think the big data, true big data error is here.
03:23
So here's a simple example about our type of customer. It's just an example, but it explains what we are doing. So in the legacy system, in the enterprise system,
03:41
the customer couldn't do any report because the data is too huge. And the legacy system is too slow. So they tried to run the reporting query, but it took like days or sometimes weeks.
04:01
And then Greenblum is a massive power processing system. So actually, the Greenblum, they bought a Greenblum database. And the system can run the query and report around the query in a few seconds
04:20
for the big chunk of data. So they started to understand the packed data, what's going on behind the data. And after that, the customer started to predict the future, or they are
04:42
trying to optimize their profit based on data. So after understanding a simple fact behind the data, they are trying to leverage the data to optimize their profit.
05:01
And for example, this customer is trying to do user-based recommendation based on the data. So they need to aggregate and leverage their data. And the problems we are facing is kind of this.
05:22
So this is a very simplified traditional BI analytics workload. On the left-hand side, this is a database which may be a postgres or Greenblum. And the analysis use analytics tools,
05:40
like you may have heard about SAS or R or maybe some kind of BA tools. And they needed to extract the data out of database to put it to the analytic tools and run the analyze, then get the result back to the database to do analytics.
06:05
Now, today, a database gets bigger. And also, the companies are collecting all kinds of data, not only the enterprise system,
06:22
but also they are collecting the data from Facebook, Twitter, or Office applications. So the problem here is those softwares, like the analytics tools, are not designed for this kind of big data.
06:42
So basically, those kinds of software are in-memory system. And it is hard to make it parallelized. So performing the analysis is a bit of a challenge.
07:04
And so still, they needed to extract the data out of the database. But it is impossible to do analyze on the entire data set. So they needed to sample. They needed to extract a small subset of the data
07:22
and to put it into the analytics tools. This doesn't solve the problem. So we are trying to push the analytics calculation into a database. This is exactly what we want to do.
07:41
The main concept here is three, magnetic, agile, deep. Magnetic means, so the database is now like a magnet. They're collecting all kinds of data, not only structured data, but also unstructured data. They're using the data source from SMS and agile.
08:06
I think the nature of analytics is a kind of iteration. And we need to try an error and get some insights
08:21
and get feedback to the line of business. And they tried the hypothetical analysis based on the insight that they got, also the deep. I think the standard SQL defines
08:43
some kind of analytics features, like simple aggregate function, window functions. Also, there are grouping sets to drill down, analyze features. But it's not enough. We do need more accurate method, more deep,
09:03
statistic method, like linear regression or more complicated methods. So here comes MagLib. So we are developing MagLib. So first of all, we introduce this new usage
09:24
of a database, Enterprise Data Warehouse. In 2009, in VLDB, a colleague from Greenplum and also the Jill Harrison from University of California, Berkeley, put together this idea into one paper
09:40
and introduced a mad skill, new analysis practices for deep data, which describe things that I explain now. And we started the MagLib project as a software development. And we reported this status in this year,
10:04
just a few months ago, how the MagLib project is now. So MagLib, the definition, why we call this MagLib. MagLib stands for magnetic, agile, deep, which I explain now.
10:25
Deep stands for, of course, library. So MagLib is an add-on to the Postgres or Greenplum. So this is just a library, and you can install to your database, and you can run the analytics methods.
10:44
MagLib has advanced methods, like mathematical, statistical, machine learning modules. And it is actually designed for power and scalable, because Greenplum database is a power database, and you can scale out for the big data.
11:05
And all the interfaces for the analytics method are defined as a database function, so it's just a SQL function. And the mission is to foster widespread development
11:21
of scalable analytics skills. We want to harness efforts from a commercial practice from us, and also as well as the academic research from universities. And we want to do this as an open source project, because we want to have more and more contribution
11:44
from academic side. And yes, this is a BSD license. And of course, you can hack this project, and you can send a pull request.
12:03
So this is a kind of collaborative project between Greenplum and the universities. But this is a very neutral position, so it is not necessarily a Greenplum project. So everybody can contribute to this source code.
12:23
Currently, we have a contribution from University of California Berkeley, Wisconsin-Madison, Florida. And because Greenplum code base shares a lot of Postgres interfaces,
12:41
so Margarit is currently supports both of Postgres and Greenplum. And the version is from 8.4 to 9.1, and the Greenplum is 4.0 to 4.2. And this is designed for data scientists to provide more scalable, robust analytics capabilities.
13:05
You can look at the information on madlib.net, and the source code is hosted in GitHub, so you can just clone into your desktop. And if you have any questions on madlib or usage,
13:22
just feel free to post to Google Groups. So madlib is the same answer. We think the madlib is the same answer to big data, because madlib is designed for better performance
13:40
and scalability. This is madlib is running inside your database, so you don't need to extract any data from that database. It's just run inside the database. And it's only for Greenplum, but leverage the power reason. And it's very easy to use, because it's just
14:02
a SQL function, so you don't need no additional tools. SQL is your friend. And also, this is open source, so you can just hack the source code. I mean, the nature of this kind of analytics is very complicated, and sometimes you
14:24
want to customize the modules. So predefined package may not be enough for you, but you can just read the source code, and you can just change some parameters or the module, so you can just replace some parts of the modules.
14:45
Of course, it's free. I mean, yes? Is there some process whereby we could make changes? Yeah, so this is a type of GitHub process,
15:00
so you can just send the progress, yes. Actually, a lot of, you have a question? Yes? In Netezza? I'm not sure about the Netezza interface, but I think that some ideas are still shared
15:20
among the Netezza and Postgres, so. Yeah, it's where the Postgres supports. Right, right. I'm not sure about the current status of the inside of Netezza, but yeah, we appreciate the contribution, if you are familiar with that. Actually, some data spenders like us
15:41
are providing similar thing to the predictive analytics modules, but they are proprietary software, and you cannot look at the inside. And it, typically, it's so expensive. But Madlib is free.
16:04
So current status of the Madlib roadmap, inside, from, inside Greenblum development, we are targeting the, every release for every quarter. So we are, we've just released video three.
16:21
The project itself is a little still, the startup phase, but still you can use the whole bunch of, like, the modules like this. So linear regression, logistic regression, K-means clustering, decision trees, and we are going to release a new version,
16:41
zero, 0.4, in the end of this quota, with distribution functions, random forest. Okay, so let's go inside the Madlib architecture.
17:02
Madlib has a lot of parts. Actually, on the top, we have Python UDF, so it depends on the peer Python, so you need to install peer Python first. But the peer Python controls the loop algorithm,
17:22
or external libraries. Like, some algorithms, like clustering, need to do a convergence calculation, so it's not easy to run this type of algorithm in, on the SQL, so we use Python. First of all, then, and for the simplest way
17:43
of the, like, linear regression, we use just a SQL function, UDF or UDA, user-defined aggregate. And we have, below that, we have a C++ abstraction layer. So the main concept of Madlib
18:00
is not only for Postgres and GreenBump. So we have the abstraction layer in C++, and if someone is familiar with other database, like Netezza, then you can just adapt the connector to this abstraction layer. And the database DBMS backend has core modules.
18:26
We have some simple, compressive vector representation as a user-defined type, and also we have just a connector, like a great function interface.
18:42
The whole contents looks like this. So Madlib consists of some kind of data modeling and descriptive statistics, support modules. From other machine learning methods, we have supervised learning, which are like a linear regression, logistic regression,
19:02
naive Bayes classification, decision tree, SPM. And for the unsupervised learning, we have association rules, k-means clustering, SPD. Also we have descriptive statistics, and the support modules, like array extension.
19:22
So we have the sparse vector on the right-hand side. Sparse vector is a user-defined type which compresses the typical sparse vector. So you have the very efficient sparse vector in your database. And we also extend the Postgres array type, so you can just, like a summation on the array.
19:44
So we just defined additional array function in Madlib. The good parts in Madlib is, we have a good amount of documentation online.
20:00
So you can just go to doc.madlib.net, and the documentation of Madlib is good because it's not only about the how to use, how to code API, but also we have a whole bunch of the mathematic background, logic, the theory.
20:27
So you can just go around here and understand the kind of idea and use it. Okay, so Madlib is,
20:42
so I'm going to Madlib use cases. So what can you do, actually? I'm going to talk about two types of machine learning here, unsupervised learning,
21:00
unsupervised learning, and unsupervised learning. So I think people around here may know about the differences between two, but anyway, I explained the two types of learning. So unsupervised is learning from raw data,
21:23
which means you don't need any label on data, it's just you categorize the data that exists in your database. And in contrast to unsupervised, unsupervised learning is if you have any historical data,
21:43
which are rabid, then you just, you can just build predictive model from the historical data, and with that model you just build,
22:02
you predict the new observation, you can just, you can level the new data with that model. So type code example is unsupervised learning
22:20
is a consumer market segmentation study, which I'm going to talk about. The method is k-means crossing, and for supervised learning, cross by the email as a spam or non-spam. So type code application is a spam filter, and there are a lot of implementation for the spam,
22:43
but we use logistic regression on my base. If the label is not a yes or no, then we can use a decision tree for the multi-label problem.
23:01
Okay, so market segmentation. This is very easy to understand, I think. So customer segmentation study, if you have some kind of customer database, then you can just run the k-means cluster, and then the data is automatically categorized like this.
23:25
So you can see the red group as the height brand lawyer group, the type code customer is age 34 and above.
23:41
They tend to shop at Nordstrom, and in contrast to that, the green group is a group for the customers who are a price sensitive group, and the budget conscious, they tend to shop at Costco, not Nordstrom.
24:04
Okay, so how to do that? So we have a k-means clustering module in MALLE, and after you install MALLE in your database, prepare your data. So here we have the input points, which consists the customer data,
24:21
and the customer ID, and some kind of attribute. Here we use just a simplified example. So k-means function needs to represent the attributes as a vector, not a column.
24:41
So first of all, you need to transform your attribute column to an array here. So just select array xy as a float array, and then save it. Then the table will look like this.
25:03
And the k-means, so k-means is a traditional way to cluster your data. So there are a bunch of initialization approaches, which may impact your analytics quality.
25:23
So you need to choose one of the initialization approach. So k-means is just like a convergence calculation, and the k-means do some loop algorithm. So just calculate the distance from the center
25:46
of the cluster from each data point. Then just adjust the cluster position until the result is stable.
26:00
So actually the initialization point is very important stuff for the good quality result. This is a kind of the trial and error stuff, and you can just customize the method in the k-means cluster.
26:22
Some of it is very simple to choose one of the data as an initialization point, which is k-means random. And k-means purpose is a little improved method to do that, but the cost to initialize is a little higher than random.
26:42
And you can just keep the centralized set as a data set in an arbitrary set. And also, depending on your program, you can choose one of the distance metrics.
27:00
So for this application, you can just L2-norm, aka the Euclidean distance. So for the spatial data, I think the L2-norm is enough. But for the thousand of this dimension of vectors, or the documentation clustering program,
27:23
you need to choose cosine, or Tanimoto, and Jacquard index. And for the upcoming release, we make it pluggable, so you can just write your customized distance metrics function and put this
27:42
into a k-means method. And then now you can run the k-means function. So select star from madlib.k-means++. Actually, it is a little hard to read,
28:02
but this is just a function. And the input is like a table name, the column name of the ID or attribute, and then which distance metrics you choose. And which, so maximum number of iteration,
28:24
the convergence parameter, above us or not, then do it. And then you've got the results. So for this example, the other points is,
28:42
so these two columns are the input column, and the third column, CID, is the result of which categorized point it belongs to as a result. And then each central is represented in another table,
29:02
so you can just look up the output here. That's the unsupervised learning for k-means cluster. Now is how to supervise learning, and the use case is heart attack risk analytics.
29:27
So classification analysis is, as I explained earlier, the classification is identified which category a new observation belongs to, with known observations. So typically the classification process is spread
29:42
to two parts, one of which is training, one of which is the classification. Training is to build your model based on your level data. And the classification process is, with this model you build, classify the new observation.
30:02
And the example method, logistic regression, for the multi-classification process, we have decision trees, also you can use naive Bayes. So here I'm explaining about logistic regression
30:21
for heart attack prediction. Calculate the potential risk of heart attack based on the historical patient data with a number of attributes. So for example, the patient data may have the age,
30:41
cholesterol, volume, height, weight, anything else? And then again, you need the input data, create table, coronary, with age, blood pressure, cholesterol, height, weight.
31:01
And the last column, heart attack, is level, so it's yes or no. So the historical data which this record had heart attack. This record didn't have heart attack. Then transform, again, transform into an array
31:22
as a vectorize. Just select array, column, column, column. And train the data, build the model. Select star from modeling, log, regular function. This is much simpler than K-means.
31:42
Again, you need to specify which table is input, which column to specify as a feature, as an attribute. Then the build model is like this. So the result is just a record type,
32:02
so this is just expand as a row. So the feature, like age, has coefficient, negative nine dot 105. So standard error is here, this is just an example.
32:22
So I think the value of standard error is very bogus here, but. This example shows that the blood pressure has a big impact on the heart attack, based on the historical data.
32:44
And then the classification. So if you have new data, without blue label, and then you have now the build model, run the logistic function with the dot product,
33:00
with your model, then the result is logistic seven dot 88, E minus zero log, zero six. Again, this result is, I think it's not bad, but you, so basically the logistic regression
33:22
returns from minus one to one, and if the result value is positive, then the risk is positive, and if the result value is negative, less than zero, then the risk is negative. So the good part of logistic regression
33:42
is not just yes or no, it returns the possibility based on the historical data. So how to deploy, how to install the MALDI?
34:02
Now, I just uploaded the module to PgXN. Right, I mean, so the PgXN client, actually, as of today, PgXN client cannot install MALDI,
34:21
because PgXN client needs a patch that I send to as a pull request data also, because the MALDI project needs some configure and make. But PgXN client, as of today, assumes that there is only a make file. But I hope that in a few days,
34:42
those are merged in the pull request, and you can just say PgXN install MALDI, then your Postgres is now ready to do this kind of predictive analytics.
35:01
Just for mentioning, we have similar utility tools for Greenblum. So actually, the Greenblum is providing a Greenblum commutation, which you can use without any money. So you just download the Greenblum binary package
35:22
from Greenblum's site, and then you just install it in your Linux, and there's no limitation. So you can run the Greenblum database processor in your, say, Amazon EC2, in a distributed way, and run your analytics with this command.
35:46
Two already, but yeah. Yes, so MALDI is an open source project, and we want you to contribute if you have any insights or new port,
36:01
or for a new method for the machine learning, or any good parts, then yeah, we are welcome. That's it. Oh, sorry, yes.
36:24
Yeah, so we carefully designed the compatibility between the Greenblum database and the Postgres. So I think there is no gap feature. Just the modules are polarized in Greenblum database,
36:40
but the feature is not different from each other. Okay, Jeff. Yeah, good question. So yeah, PLR is good for the statistics method.
37:05
And actually, inside Greenblum, we use PLR heavily, but PLR is still not as polarized way. So it's just running locally, and it runs in memory. Okay, right? And the MALDI runs like, so writing the data
37:23
to the database and get back to the calculation. So if the data gets bigger, then PLR may run out of memory. The query for Postgres, how much would that improve with MALDI?
37:44
Should people, is this yet another reason to do parallel query? In Postgres? Yeah. Yeah. Yeah, actually yesterday, in the meeting, we are talking about how to parallelize
38:02
the Postgres query as a single node. But I think the problem here is a little different from the multi-node parallel query and the Postgres parallel query. So I'm not sure what kind of parallelization, because parallelization is a lot of,
38:21
types of problem. So for example, it's only, I mean, you can just parallelize the building sort, or if you want to parallelize the aggregate function, then it's a different problem. So I'm not sure what kind of parallelization
38:40
is going to get into the Postgres. But yeah, if Postgres, you have some parallel processing, then MALDI should, to consider how to parallelize, how to scale. Yes? Yeah, are you thinking about adding random forest
39:02
to this sub-engineering algorithm? Sorry? Random forest? Random, yes. Yes? For the nebs of B zero four, we will have random forest based on the decision tree. Right. Yes?
39:20
These functions cause a block, I'm assuming while the thing is running. So it's called function that generates two tables out there. It blocks until it does that. Yeah, so the inside function, it create table, it make create a temporary table, and there'll be some table, but the output is like two tables or more. Actually, it's software currency,
39:41
like having static table names. If you call it two functions, you do different sessions, you don't have the same global overload. Okay, yeah. So, yes, so we designed this kind of problem. We careful to not damage the unexpected result, but yeah, you need to careful.
40:03
So, for example, decision tree modules creates a few tables, but the name is based on the input table. So it just adds some suffix. Do you generate graphics in your function calls, charts? No, so visualization,
40:21
oh, okay, so visualization is our next step. So, basically, Modlib doesn't have any visualization tool, but decision tree has some text-based decision tree representation, like explain analyze.
40:44
Okay. Yes, language?
41:07
So parallelization here is based on Greenbaum parallelization. So as a language, we don't have any idea to do parallelization in language level.
41:33
I see, I see. I think that's a good idea to, for example,
41:42
say using pure proxy inside the modeling function, then parallelize the unmodified prosperous, and yeah, we may parallelize the query. Okay, no other questions?
42:07
So I think that's all. Thank you.