Speed Up Your Data Processing

EuroPython

Ong, Chin Hwee

Formale Metadaten

Titel

Untertitel

Parallel and Asynchronous Programming in Data Science

Serientitel

EuroPython 2020

Anzahl der Teile

130

Autor

Ong, Chin Hwee

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben

Identifikatoren

10.5446/49953 (DOI)

Herausgeber

EuroPython

Erscheinungsjahr

2020

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

In a data science project, one of the biggest bottlenecks (in terms of time) is the constant wait for the data processing code to finish executing. Slow code, as well as connectivity issues, affect every step of a typical data science workflow — be it for network I/O operations or computation-driven workloads. In this talk, I will be sharing about common bottlenecks in data processing within a typical data science workflow, and exploring the use of parallel and asynchronous programming using concurrent.futures module in Python to speed up your data processing pipelines so that you could focus more on getting value out of your data. Through real-life analogies, you will learn about: 1. Sequential vs parallel processing, 2. Synchronous vs asynchronous execution, 3. Network I/O operations vs computation-driven workloads in a data science workflow, 4. When is parallelism and asynchronous programming a good idea, 5. How to implement parallel and asynchronous programming using concurrent.futures module to speed up your data processing pipelines This talk assumes basic understanding of data pipelines and data science workflows. While the main target audience are data scientists and engineers building data pipelines, the talk is designed such that anyone with a basic understanding of the Python language would be able to understand the illustrated concepts and use cases.