We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Automating machine learning workflow with DVC

00:00

Formal Metadata

Title
Automating machine learning workflow with DVC
Subtitle
What data scientist / ML engineer wants to do while software engineers are busy with CI/CD.
Title of Series
Number of Parts
130
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
As software engineers work on CI/CD process as soon as they start a new project, data scientists and ML engineers define a pipeline for data as it flows through a typical workflow. Each step of the pipeline is fed data processed from its preceding step as CI/CD process starts from code changes. "Pipelining ML project" is sometimes misleading as it implies a large project with a group of engineers working on some large systems , being considered to be hard for an individual and unnecessary for a small project. Regardless of its size, having well organized pipelines for any ML projects is essential to succeed and actually it could be done easily with utilizing a proper tool. In this talk, we will go through a machine learning workflow divided into a few steps composing a ML pipeline from data ingestion to model deployment. Each step depends on data produced by previous step, which are controlled by DVC. DVC is open-source version control system for data scientist and ML engineer helping them to organize data, models and experiments for some ML projects. The presentation will not only introduce how to use the tool but also show how to organize a ML pipeline with some examples. The goal of this talk is to motivate data scientists and ML engineer to start building machine learning pipeline with DVC. Audience might expect a guide to using DVC for automating the pipeline. Also I will give some explanation about concepts of machine learning related techniques necessary for understanding the pipeline. This session is designed to be accessible to everyone in beginners level. Understandings of basic concepts of machine learning and version control system (preferably, Git) might be helpful but not mandatory for the audience.
61
Thumbnail
26:38
95
106
Machine learningSoftware developerSoftwareCodeSoftware testingDigital signal processingMathematical modelPerformance appraisalPreprocessorSoftware engineeringRevision controlProcess modelingVirtual machineMachine codeSoftware repositoryProcess modelingSoftware developerMachine learningProcess (computing)Open sourcePerformance appraisalCodeRevision controlSelectivity (electronic)MathematicsInformation engineeringLine (geometry)Computer hardwareMereologySemiconductor memorySoftwarePhysical systemTouchscreenAreaNeuroinformatikBuildingSet (mathematics)LaptopData miningSingle-precision floating-point formatSpacetimeData storage deviceChainIterationGraph (mathematics)CASE <Informatik>Sheaf (mathematics)Function (mathematics)Multiplication signPreprocessorAnalytic continuationContinuous integrationTask (computing)Level (video gaming)Data acquisitionData managementOperator (mathematics)Projective planeNichtlineares GleichungssystemArithmetic progressionAsynchronous Transfer ModeCartesian coordinate systemSocial classProper mapComputer animationMeeting/Interview
Revision controlProcess modelingSoftwareMetric systemMachine learningIndependence (probability theory)Gastropod shellWave packetRepository (publishing)Computer-generated imageryData structureScripting languageIntegrated development environmentTetraederComputer fileTotal S.A.Parameter (computer programming)Bit rateLevel (video gaming)Directory serviceSource codeComputer-assisted translationRepository (publishing)Computer fileCoefficient of determinationScripting languageIndependence (probability theory)Mathematical modelProcess modelingRevision controlMedical imagingSet (mathematics)Multiplication signGastropod shellSocial classWave packetPreprocessorHeegaard splittingModule (mathematics)LaptopGreatest elementProcess (computing)Bit rateMachine learningProjective planeTotal S.A.TrailSoftware developerMachine codeTouchscreenDecision theorySubject indexingParameter (computer programming)Formal languageWeb 2.0Java appletClient (computing)Server (computing)Data structureCASE <Informatik>StapeldateiWeightSoftware testingLevel (video gaming)Virtual machineRaw image formatCodeNeuroinformatikValidity (statistics)NumberMetric systemDesign by contractDataflowDisk read-and-write headMusical ensembleData storage deviceGraphics processing unitSource codeXML
Level (video gaming)Plot (narrative)Wave packetCurve fittingSign (mathematics)Revision controlProcess (computing)Metric systemData AugmentationHill differential equationAugmented realitySlide ruleEmailSign (mathematics)Directory serviceWave packetProcess modelingProcess (computing)Performance appraisalMereologyComputer fileMedical imagingMusical ensembleLevel (video gaming)Augmented realityMetric systemValidity (statistics)Machine codeSlide ruleTwitterRegular graphComputer configurationScripting languageResultantParameter (computer programming)Multiplication signPlotterRevision controlMathematicsRegulator geneFunction (mathematics)PreprocessorOpticsSoftware testingTrailTask (computing)Source codeArithmetic progressionDesign by contractRepository (publishing)Computer-assisted translationInformationCodeElectronic data processingCommitment schemePoint (geometry)Coefficient of determinationConvolution
Slide ruleEmailRepository (publishing)Revision controlHeat transferHash functionCache (computing)CodeComputer fileoutputWave packetDirectory serviceProcess modelingData managementCASE <Informatik>Different (Kate Ryan album)Exterior algebraSlide ruleProcess (computing)CloningRow (database)Function (mathematics)Multiplication signData storage deviceVirtual machineShared memorySoftware repositoryXMLMeeting/Interview
Process modelingSoftware testingProcess (computing)Meeting/Interview
Transcript: English(auto-generated)
and so today's talk is about automating machine learning workflow with DVC. And let me introduce myself first. My name is Hongju. I'm living in Korea, and I work for
SK Hidix as a data scientist. So some of you may not know my company, but actually SK Hidix is one of the largest memory chip makers. And you would easily find some when you uncover your
laptop or desktop, especially when you are using the computer from Apple or Dell. And my recent work interest is building a knowledge graph and supply chain management, doing them automatically, and some kind of mining software repository.
It's all machine learning job. So today I'm going to talk about DVC, an open source tool for managing ML workflow efficiently. And firstly, I will start from how software
developers work well with various practices and tools, then talk about data scientists and machine learning developers who have some challenges to adopt their
work with software developments. I think DVC could help them to work more efficiently. And then lastly, I will show you how to use DVC, how DVC work with an example projects.
Actually, the title is automating ML workflow, not automated machine learning itself. So actually, there exists an active area for AutoML. So you better notice that this session
is not about auto machine learning. So let's start with Waterfall to Agile. I don't think people used to work with Waterfall way for developing software,
even in the old days, designing, building software, and really could never meet a set of requirements at once. I've never experienced such case after doing my homework as CS 101 class. However, it's useful to learn that we should work with iterative process other than Waterfall.
Since the requirements are always changes or not concrete enough, we used to organize a small set of tasks and do what we can do earlier and release features in a progressive
way until all the requirements are satisfied. So as we are not working alone and the iterative process should run fast with extreme efficiency, we used to divide our work stages
into a few steps and try to keep moving forward without a stop. And for each stage, we have been continuously think about how we can do better with the job.
Some people start to talk about methods just like TDD and continuous integration or continuous deployment. And some develop some efficient tools such like Git and Maven or or JUnit, Jenkins. So those tools help us to do our job easier and more efficient way.
So we have so many helps even on deploying, operating, and monitoring our software, and maybe sooner or later software development could be the easiest job in this world.
Now how about machine learning? So there are typical workflow in machine learning as well. Which are data acquisition and data pre-processing and build model and evaluation
and model selection. And lastly deployment. Although such workflows are a part of whole process of developing machine learning application, but they are relatively new and less developed. This is because data science or machine learning is different with software
development as the software development is different from the developing hardware with waterfall process. So these are the typical process of machine learning. And this is the
typical workflow in one chart. It is an iterative process starting from data acquiring and the left side. But very different from what that software developing process as it deals with data and more along with codes. Sometimes data and model takes more
important part of process with just a few lines of code. Also it is a team sport and some parts need some specialists like data acquisition and processing stage. These data engineers area
and also pre-processing and model selection is for data scientists or for machine learning engineer. And even the software developing engineers are needed for the last step, the
for this reason machine learning workflow cannot just follow software development processes. And I think there are machine learning's own three main challenges in machine learning
projects. They are burgeoning data along with code and deploy a model not a code. And lastly metric driven development. So people used to have their own burgeoning system as
you see in the screen. And later we don't know which one is the proper working version. And also data scientists should share those data but it's not easy because they usually take
so large space in storage and hard to manage. A few gigabytes or even larger data. How can we easily share them? Another problem is sometimes changes in data triggers pipelines even there's no single lines of code changes. But it's difficult to notice which part of data
has been changed. So we should keep organizing the data with its related code so that we can reproduce output at any time if the data changes. I'm sorry this line is supposed
to be the separate section as a separate challenge. I made a mistake here. Different from software development the most important and final artifact is a model but not a code. So we have to version models and keep tracking of which data and code produces the model.
Lastly machine learning is a metric driven job. A software development process starts from requirements and end with requirements. A metric is the most important milestone teaches what we
should do next for the improvements. I'll show you some example what kind of decision we can can be made for tracking the metrics at the last step. So metrics should be kept
tracking along with codes data and models. So the metric must be kept tracked and now DVC
comes out. DVC helps to handle these challenges. There are other solutions such as GitFS, MLflow and Apache Airflow but I recommend DVC because it's easy to use. If you are familiar
with using Git then it's very intuitive to use the DVC with Git. And it's language independent even though even the DVC is written in Python but you can do you can use DVC with C languages
or Java or any other tools whatever you want. So it's language independent. And lastly it's useful to individual to a large team.
Other tools like MLflow, Apache Airflow they need to manage the web server but in case of DVC it's just a client command line tool. So for you can adopt to your project individually or you can share the tool with other members in a large team. So it's easy to start with.
Okay it's time to see how DVC works with the problem of cats and dogs classification.
Actually the this example the project trains a small vgg net to classify cat and dog images. So go to the GitHub repository later and there will be an instruction to build a docker image that which contains everything you need for following the walkthrough
example. And later containerize the image with running batch shell then follow the commands. And the following command should be run inside the docker container.
So this is the typical data directory structure I use used to work with when I'm doing machine learning project. So I put a data some raw data and process and last
when I'm ready to deploy the model then I put the retrained finalized model in the notebook directory. Actually I use this directory occasionally but mostly I just
put the source code in the source directory at the bottom. So I put the I make a cat dog
module then when I need to experiment such a module then I open a notebook and import the cat dog module and test the modules and do some experiments. And also there are these some data downloading scripts in a scripts directory and also
deployment of the script for deployments. So to start with we need to initialize the git repository as you see in the screen then source add the source directory then do some commits.
And after that we do the same thing with dvc init which initialize the dvc repository inside the git repository. So you can see some .dvc directories and some files inside the directory
organizing the whole repository. And also we need to add index the .dvc directory into the git repository and we had to so that we can track the dvc version as well with with git.
And lastly we commit the git dvc repository with a commit command. And there's a script download start shell which downloads 25,000 images in total
half cats and half dogs it's pretty large. So I'll the script put those files in temp directory so there are so there's a cat directory and dog directory which has 12.5 k images for each.
And next step is set a set of the parameters. Those parameters are used for data preparation or pre-processing or it's it contains some hyper parameters
for training a model. So as you see in the prep stage we use a split rate as 0.9 which splits
whole data into the training data and test data for training a model and evaluating it. The class size is actually the data set is too large so the training takes a long time.
So I just limited each class with 2,000 images so in total 4,000 animals inside a stored in the training data as a training data set. So if you have a GPU computer then
the training whole training set will be finished in a minute. Then we have a learning rate and batch size and number of epochs and validation rate 0.2 for the validation steps.
Now it's time to define the first stage of the pipeline which is called prepare. So there is a preprocess.py file in cat dog directory which divides 4,000 images into
training data and test data and sampling 4,000 images in total out of 25k.
So the process data stored in data process with the command the python with the python command. So you see the options minus n is the name of the stage and p is the
parameter which you have seen in the previous slide the parameter and d option is the dependency. So the prep stage is dependent on the preprocess.py and the output
stored in the data processed directory. So after running the dvc run command with such options we can check if what kind of files or directories have been changed.
So there are three directories and files have changed. So I add them to the git repository and commit. So now we are start to tracking the preparation stage. Okay thank you.
The next step is defining a train and evaluate stage. So I named it version 0.1 because
I just put a one convolutional layer and one fully connected layer very simple model. So such code is written in cat dog train.py. So as you see the first command
I run dvc run again with another name train and it accepts a train parameter with p option and depends on data preprocess which was the output of the previous stage. And depends on the train script itself and the output goes to data with the model.h5 file.
This is the model exported file and it draws the plot data to the plot.json and the task is run by cat dog.train running the cat dog.train model.
And so you will see some output in progress of the training the model and then we will define another stage named evaluate which depends on the model h5 which was the output of the
previous stage and also depends on evaluate script and it tracks the metrics with m option option m with the score.json. So the train evaluation metric will store in the score.json
and it will keep track with the model. I mean the metric information stored in the score.json will be kept tracking with the model file. So I added some more files
and made a commit and tagged the version as a 0.1. Now we have defined three stages
starting from prep to ending with evaluate. So with dvc dag command we can see an ASCII art. So the prep or when the train depends on the prep stage and evaluate depends on the training stage. So when you have change on prep stage the whole deck has to reproduce
or if you have only changes in related to train stages then only evaluate stage has to be run again. So when there is nothing have changed we if we try to reproduce the experiment with
command dvc repro then you can see there is nothing changed in the previous stages so nothing has to be done. But if I update the model with adding another convolutional layer
then running dvc repro it detects some change in the source code so it starts to build model again. So after finishing the training model I put a tag 0.2 as a version
as another version and did the same thing adding a third convolutional layer and put the version 0.3 and commit. Now it's time to compare the metrics for each version regarding to the accuracy as you see the accuracy acc score is just around 0.67 to 0.71.
So it says just adding convolution or more convolutional layers is not helping the result
and I tried to check the training process for each experiment and it tells something. As you see the training accuracy goes high but the validation accuracy sometimes drops
and stop increasing at epoch two or three which means it's overfitting. So it's a clear sign of overfitting so I put some regularization with a dropout
and then run the dvc repro and do the same training job again and and you see it's still sometimes the the left part of the chart it says sometimes
the validation accuracy drops but it continues to increase. So I also tried data augmentation
so rather than increasing the size of data I tried to manipulate the existing 4 000 images oh sorry and it also helped so later merging two data augmentation techniques and regularization
technique I could have up to seven point eight point zero accuracy so maybe later you can try this at home with the walkthrough example and the slide. So John if any questions just shoot.
Thank you very much. So we are not in a room but people are there. So time for questions.
Stanislav is asking how do I recreate the data on a different machine? For code I do git clone
what does one does one for data? So you can yeah you can check out the the code with a repo. How do I recreate the data on different machines? Oh oh good question. Actually I haven't explained the great feature of dvc in this slide but you can
make a shared cache. You know the git has a cache inside our home directory but you can think of sharing such cache to the shared storage and then you can share the cache so that if I'm if I'm on training version 0.5 and follower tries to
train the same model he won't it won't take a minute because the the cache the shared cache will just come inside to my dvc repository so it's amazingly fast because it's using
sharing the cache. Okay okay thank you. So any other questions?
Oh another one. Does dvc handle version control on the data or rather input data must always be the same and we are just versioning receive the transfer or repair and actually it makes a hash of the file or the directory and put inside the cache
so dvc doesn't do anything with git but git tries to manage everything with dvc repository because the dvc repository has been kept versioning after we defining a pipeline or
uh training a new model everything output input and dependent file will be uh hashed and stored in the cache so it's it's going to be managed with the git.
Would you use consider dvc as an alternative airflow or can those work together? Let me repeat the question please so because we need it for the recording so if you read it low it's better. Okay would you use consider dvc as an alternative to airflow or can those work together?
Usually airflow they have a advantage and monitoring but which dvc doesn't have so if we want to do a monitor our job we have to use another tool so in that case we cannot use the
airflow with dvc but like jenkins we can use jenkins with dvc to monitor our jenkins as a dvc test when it takes a one hour two hour or or day to train a model you can you can put
those uh dvc tests inside the jenkins so that you can monitor the job. Okay perfect so thank you very much thank you for presenting we are thank you thank you for participating.