Automating machine learning workflow with DVC

Cite

Related Material

EuroPython

Lee, Hongjoo

Formal Metadata

Title

Automating machine learning workflow with DVC

Subtitle

What data scientist / ML engineer wants to do while software engineers are busy with CI/CD.

Title of Series

EuroPython 2020

Number of Parts

130

Author

Lee, Hongjoo

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/49923 (DOI)

Publisher

EuroPython

Release Date

2020

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

As software engineers work on CI/CD process as soon as they start a new project, data scientists and ML engineers define a pipeline for data as it flows through a typical workflow. Each step of the pipeline is fed data processed from its preceding step as CI/CD process starts from code changes. "Pipelining ML project" is sometimes misleading as it implies a large project with a group of engineers working on some large systems , being considered to be hard for an individual and unnecessary for a small project. Regardless of its size, having well organized pipelines for any ML projects is essential to succeed and actually it could be done easily with utilizing a proper tool. In this talk, we will go through a machine learning workflow divided into a few steps composing a ML pipeline from data ingestion to model deployment. Each step depends on data produced by previous step, which are controlled by DVC. DVC is open-source version control system for data scientist and ML engineer helping them to organize data, models and experiments for some ML projects. The presentation will not only introduce how to use the tool but also show how to organize a ML pipeline with some examples. The goal of this talk is to motivate data scientists and ML engineer to start building machine learning pipeline with DVC. Audience might expect a guide to using DVC for automating the pipeline. Also I will give some explanation about concepts of machine learning related techniques necessary for understanding the pipeline. This session is designed to be accessible to everyone in beginners level. Understandings of basic concepts of machine learning and version control system (preferably, Git) might be helpful but not mandatory for the audience.