We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Using BigBench to compare Hive and Spark

Formal Metadata

Title
Using BigBench to compare Hive and Spark
Subtitle
BigBench in Hive and Spark
Alternative Title
Using BigBench to compare Hive and Spark versions and features
Title of Series
Number of Parts
611
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2017

Content Metadata

Subject Area
Genre
Abstract
BigBench is the brand new standard for benchmarking and testing Big Datasystems. This talk first introduces BigBench and how problems can it solve.Then, presents both Hive and Spark benchmark results with with theirrespective 1 and 2 versions under different configurations. Results arefurther classified by use cases, showing where each platform shines (ordoesn't), and why, based on performance metrics and log-file analysis. BigBench is the brand new standard (TPCx-BB) for benchmarking and testing BigData systems. The BigBench specification describes several application usecases combining the need for SQL queries, Map/Reduce, user code (UDF), MachineLearning, and even streaming. From the available implementation, we can testthe different framework combinations such as Hadoop+Hive+Tez (with Mahout) andSpark (SparkSQL+SparkML) in their different versions and configurations,helping us to spot problems and possible optimizations of our data stacks. This talk first introduces BigBench and how problems can it solve. Then,presents both Hive and Spark benchmark results with with their respective 1and 2 versions under different configurations including: Tez, LLAP, and fileformats. Experiments are run on Cloud and On-Prem clusters of differentnumbers of nodes and testing data scales, taking into account interactive andbatch usage. Results are further classified by use cases, showing where eachplatform shines (or doesn't), and why, based on performance metrics and log-file analysis. The talk concludes with the main findings, the scalability andlimits of each framework.