Merken

# Apache MADlib

#### Automatisierte Medienanalyse

## Diese automatischen Videoanalysen setzt das TIB|AV-Portal ein:

**Szenenerkennung**—

**Shot Boundary Detection**segmentiert das Video anhand von Bildmerkmalen. Ein daraus erzeugtes visuelles Inhaltsverzeichnis gibt einen schnellen Überblick über den Inhalt des Videos und bietet einen zielgenauen Zugriff.

**Texterkennung**–

**Intelligent Character Recognition**erfasst, indexiert und macht geschriebene Sprache (zum Beispiel Text auf Folien) durchsuchbar.

**Spracherkennung**–

**Speech to Text**notiert die gesprochene Sprache im Video in Form eines Transkripts, das durchsuchbar ist.

**Bilderkennung**–

**Visual Concept Detection**indexiert das Bewegtbild mit fachspezifischen und fächerübergreifenden visuellen Konzepten (zum Beispiel Landschaft, Fassadendetail, technische Zeichnung, Computeranimation oder Vorlesung).

**Verschlagwortung**–

**Named Entity Recognition**beschreibt die einzelnen Videosegmente mit semantisch verknüpften Sachbegriffen. Synonyme oder Unterbegriffe von eingegebenen Suchbegriffen können dadurch automatisch mitgesucht werden, was die Treffermenge erweitert.

Erkannte Entitäten

Sprachtranskript

00:05

changing all the time and you can read and then at the end of the world and I think of the of the constants the in the end of the year and the use of a the title of the book and some of you did so and so on and on and and the and the Senate and the length of of the they were much if you don't like everyone was was a colony and that the only thing that we want to know how the tool that you know In the 1st half of the root of the word in a key tool for my good to start the times of that is that the 2 images you this is what I do for a living so they have quite a while the people from various backgrounds were distributed systems and a lot of the computer science and and I suppose that there were looking for interesting projects over the years and found that we heard you liked look in this area of logic of but you still have living so you that number 2 is that the very large commercial enterprise here using relational database using data which is arranged in light of form of which is what and so you could but for those of you who don't know that is also the goal so you put those 2 together is the equation that you know which is that of 1 plus money it was not so that's the material I like you so in particular the part that's part of to the talk about the during to part you were in format that is so a new source project and then look at the database machinery in which you know the the architecture of the management and then talk about some of the skills so the power in the of provokes a in the to get so let's start

03:16

with the and the touring to use of the that and I like the start of the history of the so this is

03:26

sort of the history of all post grasses it's hard in the and I felt a little celebrated multiple processes or so this is a time of the presence of God couple days that I think are interesting and this time I 1 wonder what is that was added the workers in all the years later the other thing that was the of the and so in the 2000 the by the there a budget of culturing from and who thought about all of that data inside of grass and a possible to make a distributed solution may be massively parallel processing and so

04:22

they both work of those that's the point to the version that doesn't have the feel this massively-parallel losses in the engine on top of now about the very of interesting and the problem that you were found about 2 here's years where the a lab and they realize that you have now hold this terrible computing capability

04:52

with the database and it is going to add machine learning component to it right so the idea is that you don't move the data out of the database operated on property owners in some

05:07

experts from that in fact if you want to do everything in it and so that is the advent of the environment which was launched in 2011 so shortly after that we have to do if it follows that I. Greenwald and later and I said why don't we take this matter with processing parallel processing engines on so local storage and

05:36

distributed I also the capabilities of this unit you're going to the ecosystem can be added to the whole heart which is a Apache hot later the this in the because of that you the knowledge about the heart and the continued I told you about in the useful version of these are all research projects that you can work hard and that there are no such thing you if so so there's we have the result

06:23

of the interesting part of collaboration with industry and then some the history as well as academia so the project was actually where he published University in Berkeley University 1st guy act indeed there is a lowering of the architecture and you realize that only because I couldn't stand with Stanford University School of Medicine so you were so why is the life that she might like to think that just of these the of wealth is because it is that this really a great place to be used by the developers come together to work in a collaborative way on software and clearance of the transparency of our dataset so if you have a research project that you can a share of this happens and you that here you know this is groups who what if you don't have it here the linking the of that the other using that as the knowledge of the so called back around that it's important community for a of this slide post-listening using while in the snow and projects and if you want to know about that should be the some sum of the village really the source of the building itself to appear rule has taken out of heart the with all the motion products would well through the the database and the Apache the the the when you have these are all the open source in the last of last year's go all of which the sum of the that was a little bit of

08:53

history is that there were about like a tale of that and the talk of because of the world we show that the use of scalable in From the runs in today's as in the rest of you agree in principle the base as well as touching the heart and to you in a matter of power and scalability for so if you have the physical memory of a the nodes of the utility of the lots of other solutions to the right and on the the of the variety the what do you what do you think that you know and every word in the sense that it's aware that is the other thing is that but if you want so the performance of the sense that if you have a large dataset work faster rather the of where the weights and the of any such that you know you're working on large datasets and you get your results but so these are the functions that exist in Annapolis they said severely injured libraries in the world of 35 40 possible functions of the problems of not less so now we lost 5 over 5 years you see the the expected of nearest provides thing while the unsupervised learning of involved in over all of the work that and in fact this is the focus of a recent development in the state of the art of the time that they stayed in the area of the feature extraction and what have you the words you using the likelihood of focus and the application and talking about the other solves owls however I think in general if you want to know a lot of work and getting ready to go to such we started doing more irritable models in particularly in the last 6 months or so that all of the matrix operations as well as the inside of that function the function of interest in the matter of creating it M so the features of the future power better parallelism and the key thing she said that it is a sequel based that is designed to take advantage of this method would parallel processing architecture in the as well as of distributed and so on the scalability and by designing the algorithm Scalability datasets that don't change your software as you it's a bigger so if you have a universal dataset you know your just want to test for example and then you can run that is well on its way in the direction of the to change the facts in some really wonder what it's running of the another key thing but the idea is that if you do look at the dates I just kind of like this company right in the area of data sets that that you mentioned that high cardinality by writing things that you want to find a way to write the embedding Portuguese example in 1st so that the idea of all the data so these are the supported platforms that matches the people talk about it was so all then we have a field of the culture of jobs lost in the elite elitist this is the scaling by what the size you will say so the exactitude variables and then on the Y axis is the use right so this is a linear regression and going over the top right hand side you have a guest segments the vote so the to 2nd the double respects so you come down to of the red dots dot on the right thank you that you might have and if you don't have a lot of reports and then to move or 1 of them again in your life is to have this just to shows a year of our scalability losses With respect to regression with respect to scaling by the size this shows linear regression scalability or 10 million of what we have in this this means that it will not be so the part of this is just the direction around the world so this is where the the the right of the sequel that this is the area of the book but this is how you came all for example the linear regression here and predicting the price of houses of given some historical data houses in out of the impact of that through size the train that then if I want you prediction the results from that was and again i call and again unless the statement by the the idea of prediction of based on the results of the tree it's very

16:15

easy to I like to talk a little bit architecture be so many machine learning problems are in in nature and the picture truth is that a bad so that was accommodation type the process that we have this and relations all layer of which I the the actual core of the the of the of the of the of the what's 1 of the simplest possible but if you don't later the don't look at how scalability and we were over this in this and regression so each of them we have crafted for distributed part the guys from will although that you can't just take it out of which for a single node time to the so we developed the need to think about how the cell to the loss of this and the so we have a few so that example of they don't have sort of a straight line at the heart of the universe of there we wanna find essentially seek to modify the some of the way and so ordinary squares of reason was like this so we set up the matrix of that we monitor and distributed to think well I like going to be like that nobody in the mind the Indian work to not just by the the transpose savings circle of the the so you from the research in future but I see that actually operating costs here because they can see the difference right so that's an increase of the only you that was used to work out what the would the kind of everything that's the problem with the if you have look at the algebra you can see that actually decomposable right you can see a square the square in the had you could be separated those out you could do everything every operation was 1 not all the operations on the other hand and then just like them it turns out you can do that using something called like of product the inner product the idea here is that you can see all the operations on the the inside of the 2nd line the node of the right side of the but this is kind of like to think about posing and learning algorithms for this the that kind of idea for a 2nd and they did that offers the 1 thing to the University of great together because the lecture this time not

20:03

every data science sample and

20:07

so many women in science is more so than the heart of the matter will allow you that your and that's what you're quite keen on how to actually solve this distributed and keep did back in the scale where we work is that the the right regular or light from the inverse of the sequel to execute in database has been there and then returns the results in fact all of the data you will the variance from of the results that we have to on so just to finish up so what's coming

21:01

out of the edges of them next most of the areas that we focused on here is the very fact that support vector machines and approval nonlinear kernels and we've added more material of utilities that you need the operations cost functions that means the text of a little of functionality in the future as a learning theory is a theory of the whole lot for nearly all these of actual predictive of models the year and terms of usability and thinking about the size of this set of so they were prepared and you're more than welcome to participate in the project in the links on the web site where the sample size is a bit of a list of and the check it out in the questions that you you the fact that if I'm going to think of it is way the it is and they have the same of the 1st and the last day of the land is in the form of a theory of the by I have to use have 1 of the of the of the thing you think of the the of the the and in some of the things that have that God I think that we do with him in the the an what users go in the

00:00

Zahlenbereich

Gleichungssystem

Maschinelles Lernen

Mathematische Logik

Computeranimation

Datenhaltung

Bildschirmmaske

Datenmanagement

Auflösbare Gruppe

Wurzel <Mathematik>

Informatik

Bildgebendes Verfahren

Leistung <Physik>

Relationale Datenbank

Transinformation

Datenhaltung

Quellcode

Physikalisches System

Konstante

Flächeninhalt

Mereologie

Dateiformat

Projektive Ebene

Wort <Informatik>

Computerarchitektur

Unternehmensarchitektur

03:14

Software

Prozess <Physik>

Grundsätze ordnungsmäßiger Datenverarbeitung

Maschinelles Lernen

GRASS <Programm>

Quick-Sort

Datenhaltung

04:17

Einfügungsdämpfung

Punkt

Kategorie <Mathematik>

Datenhaltung

Versionsverwaltung

Virtuelle Maschine

IRIS-T

Maschinelles Lernen

Computerunterstütztes Verfahren

Systemzusammenbruch

Menge

Digitale Photographie

Datenhaltung

Hypermedia

Virtuelle Maschine

ATM

Makrobefehl

Demoszene <Programmierung>

Zusammenhängender Graph

05:05

Resultante

Expertensystem

Datentyp

Elektronischer Programmführer

Versionsverwaltung

Bildauflösung

Maschinelles Lernen

Übergang

Dateiformat

Datensichtgerät

Datenhaltung

Rahmenproblem

Einheit <Mathematik>

Funktion <Mathematik>

Standardabweichung

Projektive Ebene

Parallele Schnittstelle

Programmierumgebung

Gammafunktion

06:18

Resultante

Abstimmung <Frequenz>

Einfügungsdämpfung

Prozess <Physik>

Gewichtete Summe

Gemeinsamer Speicher

Skalierbarkeit

Gruppenkeim

Kartesische Koordinaten

Fortsetzung <Mathematik>

Richtung

Netzwerktopologie

Metropolitan area network

Prognoseverfahren

Algorithmus

Skalierbarkeit

Prozess <Informatik>

Lineare Regression

Parallele Schnittstelle

Lineares Funktional

Zentrische Streckung

Befehl <Informatik>

Topologische Einbettung

Datenhaltung

Gebäude <Mathematik>

Systemaufruf

Quellcode

Biprodukt

Rechenschieber

Kollaboration <Informatik>

Datenfeld

Menge

Rechter Winkel

Festspeicher

Unüberwachtes Lernen

Projektive Ebene

Varietät <Mathematik>

Aggregatzustand

Lineare Abbildung

Wellenpaket

Gewicht <Mathematik>

Mathematisierung

Maschinelles Lernen

Systemplattform

Datenhaltung

Open Source

Knotenmenge

Informationsmodellierung

Variable

Software

Programmbibliothek

Softwareentwickler

Gammafunktion

Leistung <Physik>

Videospiel

Lineare Regression

Matrizenring

Open Source

Likelihood-Funktion

Softwarewerkzeug

Schlussregel

Fokalpunkt

Skalarprodukt

Flächeninhalt

Mereologie

Wort <Informatik>

Computerarchitektur

Verkehrsinformation

16:14

Algebraisches Modell

Lineare Abbildung

Einfügungsdämpfung

Bit

Subtraktion

Prozess <Physik>

Natürliche Zahl

Skalierbarkeit

Zellularer Automat

Maschinelles Lernen

Systemzusammenbruch

Datenhaltung

Virtuelle Maschine

Knotenmenge

Skalierbarkeit

Algorithmus

Lineare Regression

Stichprobenumfang

Datentyp

Grundraum

Parallele Schnittstelle

Gerade

Implementierung

Nichtlinearer Operator

Lineare Regression

Kreisfläche

Relativitätstheorie

Einfache Genauigkeit

Biprodukt

Skalarproduktraum

Quick-Sort

Quadratzahl

Rechter Winkel

Mereologie

Speicherabzug

Quadratzahl

Computerarchitektur

20:07

Resultante

Schnittstelle

Bit

Web Site

Maschinelles Lernen

Fortsetzung <Mathematik>

Term

Physikalische Theorie

Kernel <Informatik>

Datenhaltung

Informationsmodellierung

Bildschirmmaske

Stichprobenumfang

Nichtlineares System

Lineares Funktional

Zentrische Streckung

Nichtlinearer Operator

Benutzerfreundlichkeit

Datenhaltung

Inverse

Softwarewerkzeug

Support-Vektor-Maschine

Binder <Informatik>

Digitale Photographie

Flächeninhalt

Grundsätze ordnungsmäßiger Datenverarbeitung

Projektive Ebene

23:40

Speicherabzug

Computeranimation

### Metadaten

#### Formale Metadaten

Titel | Apache MADlib |

Untertitel | Distributed in Database Machine Learning for Fun and Profit |

Serientitel | FOSDEM 2016 |

Teil | 27 |

Anzahl der Teile | 110 |

Autor | McQuillan, Frank |

Lizenz |
CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. |

DOI | 10.5446/30933 |

Herausgeber | FOSDEM VZW |

Erscheinungsjahr | 2016 |

Sprache | Englisch |

#### Inhaltliche Metadaten

Fachgebiet | Informatik |