We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Using BigBench to compare Hive and Spark

Formale Metadaten

Titel
Using BigBench to compare Hive and Spark
Untertitel
BigBench in Hive and Spark
Alternativer Titel
Using BigBench to compare Hive and Spark versions and features
Serientitel
Anzahl der Teile
611
Autor
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produktionsjahr2017

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
BigBench is the brand new standard for benchmarking and testing Big Datasystems. This talk first introduces BigBench and how problems can it solve.Then, presents both Hive and Spark benchmark results with with theirrespective 1 and 2 versions under different configurations. Results arefurther classified by use cases, showing where each platform shines (ordoesn't), and why, based on performance metrics and log-file analysis. BigBench is the brand new standard (TPCx-BB) for benchmarking and testing BigData systems. The BigBench specification describes several application usecases combining the need for SQL queries, Map/Reduce, user code (UDF), MachineLearning, and even streaming. From the available implementation, we can testthe different framework combinations such as Hadoop+Hive+Tez (with Mahout) andSpark (SparkSQL+SparkML) in their different versions and configurations,helping us to spot problems and possible optimizations of our data stacks. This talk first introduces BigBench and how problems can it solve. Then,presents both Hive and Spark benchmark results with with their respective 1and 2 versions under different configurations including: Tez, LLAP, and fileformats. Experiments are run on Cloud and On-Prem clusters of differentnumbers of nodes and testing data scales, taking into account interactive andbatch usage. Results are further classified by use cases, showing where eachplatform shines (or doesn't), and why, based on performance metrics and log-file analysis. The talk concludes with the main findings, the scalability andlimits of each framework.