Using BigBench to compare Hive and Spark

FOSDEM VZW

Poggi, Nicolas Montero, Alejandro

Formal Metadata

Title

Subtitle

BigBench in Hive and Spark

Alternative Title

Using BigBench to compare Hive and Spark versions and features

Title of Series

FOSDEM 2017

Number of Parts

611

Author

Poggi, Nicolas

Montero, Alejandro

License

CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/42409 (DOI)

Publisher

FOSDEM VZW

Release Date

2018

Language

English

Production Year

2017

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

BigBench is the brand new standard for benchmarking and testing Big Datasystems. This talk first introduces BigBench and how problems can it solve.Then, presents both Hive and Spark benchmark results with with theirrespective 1 and 2 versions under different configurations. Results arefurther classified by use cases, showing where each platform shines (ordoesn't), and why, based on performance metrics and log-file analysis. BigBench is the brand new standard (TPCx-BB) for benchmarking and testing BigData systems. The BigBench specification describes several application usecases combining the need for SQL queries, Map/Reduce, user code (UDF), MachineLearning, and even streaming. From the available implementation, we can testthe different framework combinations such as Hadoop+Hive+Tez (with Mahout) andSpark (SparkSQL+SparkML) in their different versions and configurations,helping us to spot problems and possible optimizations of our data stacks. This talk first introduces BigBench and how problems can it solve. Then,presents both Hive and Spark benchmark results with with their respective 1and 2 versions under different configurations including: Tez, LLAP, and fileformats. Experiments are run on Cloud and On-Prem clusters of differentnumbers of nodes and testing data scales, taking into account interactive andbatch usage. Results are further classified by use cases, showing where eachplatform shines (or doesn't), and why, based on performance metrics and log-file analysis. The talk concludes with the main findings, the scalability andlimits of each framework.