Automating machine learning workflow with DVC

EuroPython

Lee, Hongjoo

Formale Metadaten

Titel

Untertitel

What data scientist / ML engineer wants to do while software engineers are busy with CI/CD.

Serientitel

EuroPython 2020

Anzahl der Teile

130

Autor

Lee, Hongjoo

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben

Identifikatoren

10.5446/49923 (DOI)

Herausgeber

EuroPython

Erscheinungsjahr

2020

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

As software engineers work on CI/CD process as soon as they start a new project, data scientists and ML engineers define a pipeline for data as it flows through a typical workflow. Each step of the pipeline is fed data processed from its preceding step as CI/CD process starts from code changes. "Pipelining ML project" is sometimes misleading as it implies a large project with a group of engineers working on some large systems , being considered to be hard for an individual and unnecessary for a small project. Regardless of its size, having well organized pipelines for any ML projects is essential to succeed and actually it could be done easily with utilizing a proper tool. In this talk, we will go through a machine learning workflow divided into a few steps composing a ML pipeline from data ingestion to model deployment. Each step depends on data produced by previous step, which are controlled by DVC. DVC is open-source version control system for data scientist and ML engineer helping them to organize data, models and experiments for some ML projects. The presentation will not only introduce how to use the tool but also show how to organize a ML pipeline with some examples. The goal of this talk is to motivate data scientists and ML engineer to start building machine learning pipeline with DVC. Audience might expect a guide to using DVC for automating the pipeline. Also I will give some explanation about concepts of machine learning related techniques necessary for understanding the pipeline. This session is designed to be accessible to everyone in beginners level. Understandings of basic concepts of machine learning and version control system (preferably, Git) might be helpful but not mandatory for the audience.