Scaling scikit-learn: introducing new sets of computational routines

EuroPython

Jerphanion, Julien

Formale Metadaten

Titel

Serientitel

EuroPython 2022

Anzahl der Teile

112

Autor

Jerphanion, Julien

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 4.0 International:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.

Lizenzieren

Identifikatoren

10.5446/60850 (DOI)

Herausgeber

EuroPython

Erscheinungsjahr

2022

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

scikit-learn is an open-source scientific library for machine learning in Python. Since its first release in 2010, the library gained a lot of traction in education, research and the wider society, and has set several standards for API designs in ML software. Nowadays scikit-learn is of one the most used scientific library in the world for data analysis. It provides reference implementations of many methods and algorithms to a userbase of millions. With the renewed interest in machine-learning based methods in the last years, other libraries providing efficient and highly optimised methods (such as for instance LightGBM and XGBoost for Gradient-Boosting-based methods) have emerged. Those libraries have encountered a similar success, and have put performance and computational efficiency as top priorities. In this talk, we will present the recent work carried over by the scikit-learn core-developers team to improve its native performance. This talk will cover elements of the PyData ecosystem and the CPython interpreter with an emphasis on their impact on performances. Computationally expensive patterns will then be covered before presenting the technical choices associated with the new routines implementations, keeping the project requirements in mind. At the end, we will take a quick look at the future work and collaborations on hardware-specialised computational routines.