Using Hadoop as a SQL Data Warehouse
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Teil | 40 | |
Anzahl der Teile | 110 | |
Autor | ||
Lizenz | CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/30952 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
FOSDEM 201640 / 110
4
6
10
11
13
15
17
19
20
23
25
27
30
32
36
38
39
41
42
43
44
45
46
47
48
50
52
54
58
61
62
69
71
72
75
76
78
79
80
82
87
88
91
93
94
95
96
97
101
103
104
106
107
110
00:00
Data-Warehouse-KonzeptDiskrete-Elemente-MethodeGeradeSummierbarkeitAggregatzustandComputervirusSchnittmengeMereologieGarbentheorieProjektive EbeneStichprobenumfangRechter WinkelFlächeninhaltGefangenendilemmaWellenpaketTotal <Mathematik>Twitter <Softwareplattform>HalbleiterspeicherMetropolitan area networkBitWissensbasisQuick-SortBefehl <Informatik>EindeutigkeitKategorie <Mathematik>Dreiecksfreier GraphKartesische KoordinatenFormale SpracheArithmetisches MittelFunktionalVariableAnalysisGüte der AnpassungMusterspracheSchreib-Lese-KopfPaarvergleichProgrammbibliothekModul <Datentyp>ComputerspielPartitionsfunktionMultiplikationsoperatorFigurierte ZahlElement <Gruppentheorie>DifferenzkernDualitätstheorieOpen SourceStandardabweichungFehlertoleranzEinsKontinuumshypotheseRechenwerkMannigfaltigkeitTabelleTransaktionBefehlsprozessorNichtlinearer OperatorDatenkompressionAbfrageSoftwareentwicklerGruppenoperationInternetworkingSoftwareDimensionsanalyseDifferenteWarteschlangeAdditionLaufzeitfehlerMechanismus-Design-TheorieNotebook-ComputerSchwingungZentrische StreckungDateiformatOffene MengeKernel <Informatik>MAPRechenschieberEin-AusgabeZeitrichtungVerdeckungsrechnungPivot-OperationZweiFolge <Mathematik>Ordnung <Mathematik>Benutzerdefinierte FunktionDatenverwaltungAtomarität <Informatik>CachingDatenbankEinfache GenauigkeitPhysikalische TheorieMaßerweiterungAnalytische MengeGlobale OptimierungFramework <Informatik>LastXMLVorlesung/Konferenz
08:55
Data-Warehouse-KonzeptDiskrete-Elemente-MethodePrototypingJensen-MaßBetafunktionArchitektur <Informatik>Physikalisches SystemGleitendes MittelEinfache GenauigkeitGemeinsamer SpeicherProzess <Informatik>Globale OptimierungDatenbankAutomatische HandlungsplanungÜbersetzer <Informatik>Dienst <Informatik>AbfragePhysikalisches SystemZusammenhängender GraphOnline-KatalogSyntaktische AnalyseMechanismus-Design-TheorieInformationsspeicherungMinimalgradForcingComputerarchitekturVersionsverwaltungRechenschieberPrototypingDatenverwaltungGreen-FunktionNatürliche ZahlGrenzschichtablösungImplementierungRuhmasseParallele SchnittstelleInverser LimesProgram SlicingJensen-MaßKartesische KoordinatenStandardabweichungEinfach zusammenhängender RaumSkalierbarkeitServiceorientierte ArchitekturNeuroinformatikMultiplikationsoperatorRechter WinkelUnternehmensarchitekturSyntaxbaumZentralisatorMaßerweiterungParserPivot-OperationKernel <Informatik>ResultanteBinärcodeBitStandardmodell <Elementarteilchenphysik>Bildgebendes VerfahrenBefehl <Informatik>MetadatenDemoszene <Programmierung>AggregatzustandDichte <Physik>Endliche ModelltheorieKlasse <Mathematik>GruppenoperationMereologieZweiGewicht <Ausgleichsrechnung>MinimumDifferenteStrömungsrichtungt-TestCASE <Informatik>GeradeDatenfeldMAPLeistung <Physik>EreignishorizontSchnittmengeProtokoll <Datenverarbeitungssystem>Orakel <Informatik>Kategorie <Mathematik>SoundverarbeitungZahlenbereichBenchmark
17:46
PrototypingBetafunktionArchitektur <Informatik>Physikalisches SystemData-Warehouse-KonzeptCILDiskrete-Elemente-MethodeKlon <Mathematik>Repository <Informatik>CodeProzess <Informatik>Automatische IndexierungSystemplattformTabelleSpezialrechnerKonfiguration <Informatik>Web SiteMultiplikationsoperatorElementargeometrieDivergente ReiheFlächeninhaltStrömungsrichtungMereologieComputerspielNeuroinformatikVarietät <Mathematik>Anwendungsspezifischer ProzessorAutomatische IndexierungInformationKategorie <Mathematik>Strategisches SpielComputervirusSummierbarkeitAggregatzustandMetropolitan area networkEin-AusgabeRechenwerkRechter WinkelMessage-PassingSchwach besetzte MatrixZusammenhängender GraphGemeinsamer SpeicherOnline-KatalogEinfügungsdämpfungKonfiguration <Informatik>ResultanteInhalt <Mathematik>MustererkennungEndliche ModelltheorieZahlenbereichFeuchteleitungExogene VariableLeistung <Physik>Physikalisches SystemProzess <Informatik>Office-PaketGrundraumWort <Informatik>TaskParallele SchnittstelleParametersystemStichprobenumfangCASE <Informatik>SystemplattformUmwandlungsenthalpieInformationsspeicherungWeb Sitesinc-FunktionDateiverwaltungCodeMini-DiscMailing-ListeOpen SourceDateiformatTabelleKerberos <Kryptologie>E-MailArithmetisches MittelAbfrageBeobachtungsstudieVerkehrsinformationVirtualisierungProgrammfehlerWikiIntegralDatenverwaltungVerzweigendes ProgrammMulti-Tier-ArchitekturPunktWärmeausdehnungOrdnung <Mathematik>VerschlingungBetafunktionAutomatische HandlungsplanungLaufzeitfehlerDemo <Programm>ComputeranimationVorlesung/Konferenz
26:36
Data-Warehouse-KonzeptSpeicherabzugGoogolVorlesung/KonferenzComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:05
All right, so continuing on that line of thought that Kyle has put forward that SQL lives, I would like to introduce my colleague, Lei Cheng, who will be talking about the latest addition to the Apache Software Foundation Incubator project by the name
00:22
apache.com, which also happens to be a project that could potentially trace its lineage to one of the oldest and best known open source projects, you know, So with that, let's take it away. Thank you very much. So today I'd like to talk a little bit about Apache Hawk.
00:43
So this is the general of today's talk. So first I will give an introduction about Apache Hawk, including the background and the map features and the kernel status and the latest, you know,
01:02
development and the zulamax stuff. And I also want to talk about a little bit about home tutorial, which will be ready in about this quarter of early April. So finally I will give a brief introduction about how to contribute to the community. So Hawk, this is a native, actually, SQL on Hadoop engine.
01:24
So in order to give you guys, you know, a basic sense about how to, what it is, then I will first give you guys a demo. And you can see here, actually I installed two segments here.
01:43
So one is a mask segment and the other is actually the slave segment. So this is a mask segment and this is the slave segment. So the segment is just like a single node database. So Hawk is like a multi-node Postgres on top of HDFS.
02:04
So now we have two nodes. Then actually you can use Hawk just like Postgres. You can use a PC connecting to Hawk. Then you can show the tables in the database. You can also use slash bell to show all the databases in Hawk.
02:26
And for example, you can create tables. You can insert, insert into T and also execute some queries.
02:52
And here I also preloaded some TPC-H table, line item. You can see, we can execute some complex query here.
03:14
This is TPC-HQ1, you know. You can see that there are a lot of operators here. You can, you know, there is group by, you know, sorting.
03:23
There are some sum, average, and a lot of stuff. These are very complex query. You can see that, you know, Hawk can support this very easily. And so this is a very short demo, you know, show about, you know, how to use Hawk.
03:41
And so this slide shows the map features of Hawk. You can see from here, you know, from the laptop side actually. From the other side, you can, you know, it's SQL standard compliant. Actually, it's, you know, it's called SQL 92, SQL 2003.
04:03
And it also selects a part of, you know, all average caches. So you can use JDBC and ODBC, you know, connect to Hawk. Then you can actually connect to almost all of your applications. And internally, it has cost-based optimizer, you know. For next talk, actually, we have, you know, our colleague, Addison, you know,
04:21
who will introduce Hawk. That is the cost optimizer in Hawk. And internally, we have, you know, dynamic, you know, pipeline. Actually, this is the internet. Internet is actually the network protocol, actually, connect all of the nodes together.
04:41
And it just, of course, I do the native file formats. And it supports different kind of compression methods. For example, JDBC, Smackey, and a lot of other compression methods. And it also supports, you know, multi-level partitioning. It's called release-based partitioning, range-based partitioning, and a lot of other partition mechanisms.
05:01
And it also, you know, transaction compliant. It supports full ACID. So, you know, actually, a lot of Hadoop solutions are not supported transactions. Actually, Hawk, from the beginning of this talk, you know, transactions. Sometimes, you may think that, you know, for analytics to work there, transaction is not necessary.
05:21
But actually, you know, when you do some update or, you know, load to some tables, you will see that the transaction is very, very important. So, you know, for example, when you do a load, if the transaction fails, sometimes you will get a lot of garbage data. If you don't, if we don't spot transaction, actually, you will get, you know, inconsistent data.
05:41
And finally, you need to remove the garbage data by yourself. So, or manage it by your own code. It's a nightmare, actually, for developers. And so, and Hawk is, you know, supported, actually, petabytes data, and it's kind of petabytes scale, actually. Internally, we, you know, we have elastic runtime
06:01
to support such a large scale. And for the tolerance side, actually, we support multi-level for tolerance. You know, if there are some, you know, segments die, then actually, you know, Hawk can continue, you know, query the data without, you know, by removing the segments. And it also support master and master mirroring.
06:20
If your master, you know, dies, then actually the center byte can take over. We also support online caching. So, actually, you know, if you want to add a node, actually, it's seconds. You know, you can add a node in seconds. You know, you do not need any, you know, data usage. This is very important, you know, for management.
06:43
And we also, you know, support, actually, it's multi-tenancy. You know, Hawk, you know, is kind of multi-tenant. You know, it's caused by, you know, high-critical resource queues. And actually, in 2.0, actually, we introduce this feature, and actually, different, you know, business unit
07:00
can, you know, create different resource queues by managing the CPUs and the memories and the IOS. And it also support granular oscillation mechanism. And so Hawk is also extensible. So now, actually, Hawk already support a lot of formats. And naturally, it supports AO format and parquet format.
07:22
And it also support a lot of other formats, sequence file, arrow file, JSON, you know, through PXF. There is a pivotal extension framework. And you can also, you know, if you want to support, you know, new format, you can write some plug-in to support it. And it also support much language user-defined functions.
07:42
It just support Python, Perl, Java, C, and many, you know, user-defined functions. And actually, it also support a theory in data science library, actually, in the medley. So this is nine features of Hawk. And this is how we releasing, actually, Hawk compared to, you know, a lot of other, you know,
08:02
SQL on Hadoop solutions. You can see that, actually, there are two dimensions. The first dimension is about, you know, performance and the SQL compliance. You can see a lot of traditional, you know, data warehouse, for example, what you got, IBM, you know, or WhatEx, actually. So, and they have, you know, SQL compliant, actually,
08:23
you know, engine. And actually, it's kind of high performance. And, of course, some new SQL on Hadoop solutions, for example, Impala, JIRA, Spark SQL are high, actually. It has, they have limited, you know, performance. And the SQL, from a SQL standard side, actually, typically, if you want to run, you know,
08:41
TPC-H or TPC-DS queries, actually, you know, you don't need to resign the query. Actually, it's not, you know, standard compliant. And about, and another dimension is about, you know, priority and openness. Actually, HOG now is a hard-g project. It's completely open. And now, actually, it's Hadoop-related.
09:00
It's integrated, you know, with HDFS, YARN and VARAY, naturally. And they can pass, you know, it is highly scalable. And so, from the two dimensions, you can see that HOG is the only one in the top-right, you know, quantity. So this is how we position HOG in the SQL Hadoop solutions.
09:21
And this is how we typically, you know, deploy HOG. You know, there are several master nodes. And typically, you know, HOG has, you know, several master nodes. There's HOG master and YARAY standard by master. And there are a lot of slave nodes, that is, computer nodes. On each computer node, you can see that
09:41
there are one or two physical segments. And when a query comes, actually, HOG starts a lot of query executors, that is, QEs to ask you to query. And on each slave node, you can also see, actually, there is a data node. There is HDFS data node. Actually, typically, you know, we call a cat, data node, and
10:01
pop a segment together, you know, to get better data locality. And in a slave node, we can also see there is a node manager, there is a YARN node manager. We can manage the resources of the cluster. So, and on the master side, actually, there is a name node and there is also a catalog service. Currently, the catalog service is
10:21
co-located with the HOG master. Actually, we, you know, we abstract the catalog services, you know, into a separate service internally. And there is also a YARN resource manager. You can configure in a high-availability way. So this is a typical deployment of the HOG master.
10:40
And this slide shows the HOG and the high-level architecture. You can see that from the top side, it's components in the master nodes. And the clean, the client, it can be physical, it can be GDPC, or ODPC, or any other applications. And when a query comes, actually,
11:01
you know, it first goes to the, you know, parser and analyzer. You know, after the parser and analyzer, we get a query sheet, then the query sheet will pass to, it will be passed to the optimizer. And the optimizer will, you know, get an optimal plan, you know, for the query. After the query plan is calculated,
11:21
then actually it goes to the dispatcher. So from the dispatcher side, actually, the force will get the resources available in the system. And so it goes to, actually, it, you know, runs together with YARN. And actually, you know, it will go through the resource manager
11:40
and actually through the resource broker and get the resources from the YARN side. Then actually obtain the resources or returning resources are completely dynamic. That is according to the query cost. And after we get the resources, we know, actually, well, the resources are available.
12:01
Well, we should execute the query. Then we dispatch the query to the segments to execute. You can see that on the bottom side, it says, you know, the segments node. And on each segment host, there is a physical segment. You can look at this physical segment just in a single node database.
12:21
And actually, you know, during the execution, we studied multiple virtual segments on each physical segment. You can see that. For example, you know, if we have a bigger query, then actually we have 10,000 nodes. But we want a degree of petroleum to be, you know, 10,000.
12:41
So what can we do? You know, we have only 10,000 nodes, right? You know, here, I start actually 10 virtual segments on each node. So you can achieve, you know, 10K, actually, degree of terrorism to execute a query in massive parallel way.
13:01
And so, you know, between the physical segments, there is an interconnect. Actually, for interconnect, actually, you can see that, you know, actually we implement, we have two interconnect implementations. One is TCP-based and the other is UDP-based. You may think that why we need a UDP, you know, interconnect.
13:23
Actually, in the early version, actually, we have a TCP-based interconnect. Finally, we find that, finally, that actually TCP has a lot of limitations. And actually, you can think that, you know, if we have 1,000 node and each node will communicate with each other,
13:41
other node, then actually on each node, there will be 1,000,000 connections, right? 1,000 node, each node will communicate with each other. Then on each node is 1,000,000 connections. If one node has 1,000,000 connections, actually the node will stop. You know, you cannot even log into the node. So TCP has a kind of scalability bottleneck.
14:01
So we developed a UDP-based interconnect, you know, to solve this issue. So this is, actually, interconnect is, you know, the main, you know, things, you know, for the parallelism in the database. And here, actually, for the external system, actually, we use PXF,
14:21
I think, pivotal extension framework, you know, to connect to external system and to access data. For example, it can be Atribase. It can be Oracle. It can be any other databases and any other, you know, systems. So this is the overall architecture of COD. And this is the basic query execution flow.
14:43
So after query comes, actually, the first goes to the parser, you know. After parser, we get a parse tree. And the parse tree, actually, you know, is given to, you know, here, I move to another component that's the analyzer. Actually, the parse tree is given to the analyzer. Analyzer will, you know, do some semantic analysis.
15:02
For example, check, you know, whether the columns access in the query is there and also some other semantic analysis. After the analyze this, actually, the plan will be generated by the planner. And the plan will be given to a dispatcher
15:21
and the dispatcher will work with the resource manager and to get resources. Then, actually, the dispatcher will start the processes on different nodes. Then, actually, we also slice the plan. That is, if the plan, actually, a plan can be sliced into, you know, several slices, you know, and each slice will be,
15:41
you know, dispatched to the, you know, query executer for execution. This is the overall execution. All right, so this, I just introduced, you know, the overall architecture and actually how the query works. And here is the HUC status,
16:01
including the history and kernel status. In 2011, actually, we did a prototype, and at that time, actually, Hadoop just become very popular, and a lot of enterprises began to use, began using Hadoop as their centralized, you know, storage system. So, you know, at that time,
16:21
Pudel has, you know, green plan database. So, as you see, the NPV database, you know, it is extremely faster than actually Hadoop solutions, for example, Hive and Pig and other solutions. So, but it has, you know, it has a share nothing, you know, storage architecture. So, we think, you know, how can we combine,
16:42
you know, Hadoop with NPV database together. So, this is why we started prototype. So, anytime we call it, we call it prototype as green plan database on HDFS. So, after prototype, actually, you know, we developed the alpha version. So, alpha version is tried by a lot of customers.
17:00
So, after the customer tried that, actually, they compared the, they did a lot of benchmark, and compared the performance between Hawk and the Hive. So, at that time, actually, Hawk is actually four times faster than Hive. So, actually, it's promoted, actually, Hawk as a product, you know.
17:20
So, we, in 2008, actually, we released the Hawk 1.0. Actually, right now, actually, we changed the architecture a lot. Since, actually, now, we have a kind of centralized storage on HDFS. So, a lot of things in the traditional NPV database is not necessary. So, we changed the processing mechanism.
17:43
We changed the port tolerance mechanism to make it more like Hadoop system. And in 2013 to 2014, actually, we have a lot of minor releases. We added Kerberos. We added a packet spot. We added a lot of other, you know, features.
18:02
And in 2015, it goes to Apache. And currently, you can see from the Apache side, actually, it's Hawk 2.0 beta. And in 2016, actually, it's very close to Hawk 2.0 GA. And this is a feature, you know, that to be released in Hawk 2.0 GA.
18:21
And it's about more than 10 plus features. So, mostly, it's a community feature, you know, it's about the elastic execution runtime. That is, we can, you know, virtual segments can be studied on, you know, multiple virtual segments can be studied on each, you know, physical segment. That is, you can, you know, we can start, you know,
18:41
we can execute a query according to the cost. That is, study as many as virtual segments according to the query cost. Another is about resource management. Now, we still have three-layer resource management. We support dynamic expansion. We support a new dispatcher for the elastic runtime. We support a performance model. And we also, you know,
19:01
consolidated the storage model. And we also supported a block-level storage. And we also have a HDFS catalog to accelerate, you know, HDFS accesses. And we also support edge catalog integration. That is, you can query the tables in edge catalog very easily.
19:22
All right, so finally, I will take up maybe three minutes to, about, you know, how do we, you know, how can we contribute to the Apache Hub? This, you know, to contribute to Hub, it's, you know, you can, you know, for example, you can add some documentation stuff.
19:41
You can edit your wikis. You can report some bugs. And, you know, you can provide fixes for the bugs. And you can also, you know, can also grab some features. Here is the website and wiki and reports and JIRA. And the mailing list, if you have any questions or have any ideas, you can send it to our demo mailing list or user mailing list.
20:02
So this is a code contribution process. It's very, it's very easy. You know, you just need to start JIRA, actually, poke a GitHub report and clone your report to local. Then, actually, you create a feature branch. Then, after you finish your feature, you just, you know, start a pull request. The committers will work with you on the, you know, to pull your code in.
20:23
So details can be found on this wiki page. And there are a lot of interesting areas that, you know, you guys may consider, you know, contributing. So currently, we are also working on these areas. So indexes are very important for, you know, point of queries. Now we are working on the indexes.
20:42
And for now table tables, you know, we also want to support update, delete. You know, you know, Hadoop is kind of a kind of only file system. You cannot do update and delete on the file, on the, on the, on the system. So in order to support update, delete, on the kind of only tables, you know, it's kind of challenging. So currently, we are working on this too.
21:01
And Snapshot, geo-replication, and integrated with other ecosystem for the Spark link and any other ecosystem is also an interesting area. And we can also enhance, you know, enhancing, you know, compiling and building, you know, even a platform is also an interesting area. So this is the best steps to build and set up.
21:23
Due to the time limit, I will skip this. So this is the references of this talk. You can see, you know, a lot of materials, documents on our website. And you can also take a look at our publications on some research, you know, conferences.
21:40
For example, signal conference, VLDB conferences. All right. That's it. Thank you. Any questions?
22:03
So, yeah, just a question. When would you suggest someone uses something like Greenplum versus when would you suggest they use Paul? So actually, you know, typically, you know, if customers are using Hadoop, we suggest actually you should use Paul since Paul is kind of making signal engine on Hadoop.
22:22
If you use actually data warehousing, you know, solution, it's better to use Greenplum than this. Is it a native layer that can be deployed over a pre-running Hadoop file system or meaning can the request be lost
22:44
by connecting to a running Hadoop system? Actually, so you can deploy it to your existing Hadoop cluster. So currently, you know, it's kind of,
23:02
it's very, you know, hot, it's very independent. So you can deploy it to your existing DFS or young, you know, you can use it in Hadoop. Yep. Right here. In what way is it different from Preso versus? So Preso actually is also a SQL Hadoop solution.
23:21
You know, if you can, you know, if you compare a hop with Preso actually, you know, there are a lot of benefits, you know, already. For example, SQL compliance, you know, Preso is, you know, if you run TPC-DS or, you know, if you run TPC-H, you know, you need to rewrite a lot, almost all the queries. And from the performance side, actually, hop is much faster.
23:42
And from other side, actually, you know, for example, you know, integration with, you know, existing Hadoop ecosystem, hop is also better. For example, you bring young and integration with and various stuff, you know, it's much easier for hop, you know, to work with existing system.
24:01
So the method is different from Python? Huh? Do you have a specific method? So for story side, actually, you know, we have some native story format. It's also open source. You can use it as, you know, for example, you can use a method used for Spark to access story format.
24:20
It is a kind of only format, and the other is a packet. You see, this is open source format. It's not private. No, it's not private. Metastore. Oh, Metastore. Well, the Metastore currently is our master. It's our local disk. It's specific disk. It's not using the high one. You can use the high one.
24:40
Actually, we supported edge catalog integration. You can use edge catalog to access high catalog. And for our own metadata, actually, currently it's our local disk. It's just like Postgres catalog. Any more questions? What about accessing the data?
25:03
Security-wise, could this be done by a host? Can you say that again? So that the data accesses the data, can we just manage the host, or should we manage the folders in HTTPS?
25:22
Actually, the data is on HTTPS. So actually, if our system wanted to access the data, you know, written by a host, it can access directly on HTTPS. So it's, you know, it's a shared storage, actually not managed by a host. Only the catalog is managed by a host.
25:43
Security, we started security. Security, we started Kerberos and any, you know. Permissions and stuff. Can you answer any more questions?
26:07
What about the ORC 5 format support? Do you want to support ORC or just forget it? Yes, it's kind of a new plan, actually. You know, for ORC 5, you can use GXF.
26:20
Actually, you know, you can't have external tables for accessing now. And we also, you know, plan to support just like a natural storage. You see in the little map. All right, thank you so much.