Data Analytics with MySQL, Apache Spark and Apache Drill

FOSDEM VZW

Smirnova, Sveta Rubin, Alexander

Formal Metadata

Title

Title of Series

FOSDEM 2017

Number of Parts

611

Author

Smirnova, Sveta

Rubin, Alexander

License

CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/41959 (DOI)

Publisher

FOSDEM VZW

Release Date

2018

Language

English

Production Year

2017

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Apache Spark is a cluster computing framework, similar to Apache Hadoop. Thereare a number of tasks where MySQL does not show great performance: for exampleMySQL is not massively parallel system and a single query will only utilize 1CPU core . Spark, on the the other hand is designed to be massively parallel;in addition Spark is a clustering framework, so you can easily add morecompute nodes so that Spark can utilize more resources and scale. Apache Drill is similar project aimed to make data discovery easier. Forexample it allow you to join data sources in MySQL, MongoDB, flat files, otherRDBMS, etc. In this talk I will demonstrate how to use Apache Spark together with MySQLfor data analysis. I will sho how Apache Spark aggregates data (wikipediapageview statistics) and stores the resultset in MySQL. I will also show howto use Apache Spark with multiple sources and join virtual tables from MySQL,flat files and even MongoDB.