Using BigBench to compare Hive and Spark

FOSDEM VZW

Poggi, Nicolas Montero, Alejandro

Formale Metadaten

Titel

Untertitel

BigBench in Hive and Spark

Alternativer Titel

Using BigBench to compare Hive and Spark versions and features

Serientitel

FOSDEM 2017

Anzahl der Teile

611

Autor

Poggi, Nicolas

Montero, Alejandro

Lizenz

CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/42409 (DOI)

Herausgeber

FOSDEM VZW

Erscheinungsjahr

2018

Sprache

Englisch

Produktionsjahr

2017

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

BigBench is the brand new standard for benchmarking and testing Big Datasystems. This talk first introduces BigBench and how problems can it solve.Then, presents both Hive and Spark benchmark results with with theirrespective 1 and 2 versions under different configurations. Results arefurther classified by use cases, showing where each platform shines (ordoesn't), and why, based on performance metrics and log-file analysis. BigBench is the brand new standard (TPCx-BB) for benchmarking and testing BigData systems. The BigBench specification describes several application usecases combining the need for SQL queries, Map/Reduce, user code (UDF), MachineLearning, and even streaming. From the available implementation, we can testthe different framework combinations such as Hadoop+Hive+Tez (with Mahout) andSpark (SparkSQL+SparkML) in their different versions and configurations,helping us to spot problems and possible optimizations of our data stacks. This talk first introduces BigBench and how problems can it solve. Then,presents both Hive and Spark benchmark results with with their respective 1and 2 versions under different configurations including: Tez, LLAP, and fileformats. Experiments are run on Cloud and On-Prem clusters of differentnumbers of nodes and testing data scales, taking into account interactive andbatch usage. Results are further classified by use cases, showing where eachplatform shines (or doesn't), and why, based on performance metrics and log-file analysis. The talk concludes with the main findings, the scalability andlimits of each framework.