#bbuzz: Building Petabyte Scale ML Models with Python

Cite

Related Material

Plain Schwarz

Srivastav, Vaibhav

Formal Metadata

Title

#bbuzz: Building Petabyte Scale ML Models with Python

Title of Series

Berlin Buzzwords 2020

Number of Parts

Author

Srivastav, Vaibhav

Contributors

N. N. (Moderation)

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/68804 (DOI)

Publisher

Plain Schwarz

Release Date

2020

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Although building ML models on small/ toy data-set is easy, most production-grade problems involve massive datasets which current ML practices don’t scale to. In this talk, we cover how you can drastically increase the amount of data that your models can learn from using distributed data/ml pipes. Detailed Description: It can be difficult to figure out how to work with large data-sets (which do not fit in your RAM), even if you’re already comfortable with ML libraries/ APIs within python. Many questions immediately come up: Which library should I use, and why? What’s the difference between a “map-reduce” and a “task-graph”? What’s a partial fit function, and what format does it expect the data in? Is it okay for my training data to have more features than observations? What’s the appropriate machine learning model to use? And so on… In this talk, we’ll answer all those questions, and more! We’ll start by walking through the current distributed analytics (out-of-core learning) landscape in order to understand the pain-points and some solutions to this problem. Here is a sketch of a system designed to achieve this goal (of building scalable ML models): 1. a way to stream instances 2. a way to extract features from instances 3. an incremental algorithm Then we’ll read a large dataset into Dask, Tensorflow (tf.data) & sklearn streaming, and immediately apply what we’ve learned about in last section. We’ll move on to the model building process, including a discussion of which model is most appropriate for the task. We’ll evaluate our model a few different ways, and then examine the model for greater insight into how the data is influencing its predictions. Finally, we’ll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance. Detailed Outline: 1. Intro to out-of-core learning 2. Representing large datasets as instances 3. Transforming data (in batches) – live code [3-5] 4. Feature Engineering & Scaling 5. Building and evaluating a model (on entire datasets) 6. Practicing this workflow on another dataset 7. Benchmark other libraries/ for OOC learning 8. Questions and Answers Key takeaway By the end of the talk participants would know how to build petabyte scale ML models, beyond the shackles of conventional python libraries. Participants would have a benchmarks and best case practices for building such ML models at scale.