The Painless Route in Python to Fast and Scalable Machine Learning

Cite

Related Material

EuroPython

Fedotova, Victoriya Schlimbach, Frank

Formal Metadata

Title

The Painless Route in Python to Fast and Scalable Machine Learning

Title of Series

EuroPython 2020

Number of Parts

130

Author

Fedotova, Victoriya

Schlimbach, Frank

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/49927 (DOI)

Publisher

EuroPython

Release Date

2020

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Python is the lingua franca for data analytics and machine learning. Its superior productivity makes it the preferred tool for prototyping. However, traditional Python packages are not necessarily designed to provide high performance and scalability for large datasets. From this talk you will learn how to get close-to-native performance with Intel-optimized packages, such as numpy, scipy, and scikit-learn. The next part of the talk is focused on getting high performance and scalability from multi-cores on a single machine to large clusters of workstations. It will be demonstrated that with Python it is possible to achieve the same performance and scalability as with hand-tuned C++/MPI code: - Scalable Dataframe Compiler (SDC) makes possible to efficiently load and process huge datasets using pandas/Python. - A convenient Python API to data analytics and machine learning primitives (daal4py). While its interface is scikit-learn-like, its MPI-based engine allows to scale machine learning algorithms to bare-metal cluster performance. - From the talk you will learn how to use SDC and daal4py together to build an end-to-end analytics pipeline that scales to clusters, requiring only minimal code changes.