BigBench is the brand new standard for benchmarking and testing Big Datasystems. This talk first introduces BigBench and how problems can it solve.Then, presents both Hive and Spark benchmark results with with theirrespective 1 and 2 versions under different configurations. Results arefurther classified by use cases, showing where each platform shines (ordoesn't), and why, based on performance metrics and log-file analysis. BigBench is the brand new standard (TPCx-BB) for benchmarking and testing BigData systems. The BigBench specification describes several application usecases combining the need for SQL queries, Map/Reduce, user code (UDF), MachineLearning, and even streaming. From the available implementation, we can testthe different framework combinations such as Hadoop+Hive+Tez (with Mahout) andSpark (SparkSQL+SparkML) in their different versions and configurations,helping us to spot problems and possible optimizations of our data stacks. This talk first introduces BigBench and how problems can it solve. Then,presents both Hive and Spark benchmark results with with their respective 1and 2 versions under different configurations including: Tez, LLAP, and fileformats. Experiments are run on Cloud and On-Prem clusters of differentnumbers of nodes and testing data scales, taking into account interactive andbatch usage. Results are further classified by use cases, showing where eachplatform shines (or doesn't), and why, based on performance metrics and log-file analysis. The talk concludes with the main findings, the scalability andlimits of each framework. |