Automating machine learning workflow with DVC
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 130 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/49923 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202080 / 130
2
4
7
8
13
16
21
23
25
26
27
30
33
36
39
46
50
53
54
56
60
61
62
65
68
73
82
85
86
95
100
101
102
106
108
109
113
118
119
120
125
00:00
Machine learningSoftware developerSoftwareCodeSoftware testingDigital signal processingMathematical modelPerformance appraisalPreprocessorSoftware engineeringRevision controlProcess modelingVirtual machineMachine codeSoftware repositoryProcess modelingSoftware developerMachine learningProcess (computing)Open sourcePerformance appraisalCodeRevision controlSelectivity (electronic)MathematicsInformation engineeringLine (geometry)Computer hardwareMereologySemiconductor memorySoftwarePhysical systemTouchscreenAreaNeuroinformatikBuildingSet (mathematics)LaptopData miningSingle-precision floating-point formatSpacetimeData storage deviceChainIterationGraph (mathematics)CASE <Informatik>Sheaf (mathematics)Function (mathematics)Multiplication signPreprocessorAnalytic continuationContinuous integrationTask (computing)Level (video gaming)Data acquisitionData managementOperator (mathematics)Projective planeNichtlineares GleichungssystemArithmetic progressionAsynchronous Transfer ModeCartesian coordinate systemSocial classProper mapComputer animationMeeting/Interview
08:37
Revision controlProcess modelingSoftwareMetric systemMachine learningIndependence (probability theory)Gastropod shellWave packetRepository (publishing)Computer-generated imageryData structureScripting languageIntegrated development environmentTetraederComputer fileTotal S.A.Parameter (computer programming)Bit rateLevel (video gaming)Directory serviceSource codeComputer-assisted translationRepository (publishing)Computer fileCoefficient of determinationScripting languageIndependence (probability theory)Mathematical modelProcess modelingRevision controlMedical imagingSet (mathematics)Multiplication signGastropod shellSocial classWave packetPreprocessorHeegaard splittingModule (mathematics)LaptopGreatest elementProcess (computing)Bit rateMachine learningProjective planeTotal S.A.TrailSoftware developerMachine codeTouchscreenDecision theorySubject indexingParameter (computer programming)Formal languageWeb 2.0Java appletClient (computing)Server (computing)Data structureCASE <Informatik>StapeldateiWeightSoftware testingLevel (video gaming)Virtual machineRaw image formatCodeNeuroinformatikValidity (statistics)NumberMetric systemDesign by contractDataflowDisk read-and-write headMusical ensembleData storage deviceGraphics processing unitSource codeXML
17:05
Level (video gaming)Plot (narrative)Wave packetCurve fittingSign (mathematics)Revision controlProcess (computing)Metric systemData AugmentationHill differential equationAugmented realitySlide ruleEmailSign (mathematics)Directory serviceWave packetProcess modelingProcess (computing)Performance appraisalMereologyComputer fileMedical imagingMusical ensembleLevel (video gaming)Augmented realityMetric systemValidity (statistics)Machine codeSlide ruleTwitterRegular graphComputer configurationScripting languageResultantParameter (computer programming)Multiplication signPlotterRevision controlMathematicsRegulator geneFunction (mathematics)PreprocessorOpticsSoftware testingTrailTask (computing)Source codeArithmetic progressionDesign by contractRepository (publishing)Computer-assisted translationInformationCodeElectronic data processingCommitment schemePoint (geometry)Coefficient of determinationConvolution
25:32
Slide ruleEmailRepository (publishing)Revision controlHeat transferHash functionCache (computing)CodeComputer fileoutputWave packetDirectory serviceProcess modelingData managementCASE <Informatik>Different (Kate Ryan album)Exterior algebraSlide ruleProcess (computing)CloningRow (database)Function (mathematics)Multiplication signData storage deviceVirtual machineShared memorySoftware repositoryXMLMeeting/Interview
29:04
Process modelingSoftware testingProcess (computing)Meeting/Interview
Transcript: English(auto-generated)
00:06
and so today's talk is about automating machine learning workflow with DVC. And let me introduce myself first. My name is Hongju. I'm living in Korea, and I work for
00:22
SK Hidix as a data scientist. So some of you may not know my company, but actually SK Hidix is one of the largest memory chip makers. And you would easily find some when you uncover your
00:43
laptop or desktop, especially when you are using the computer from Apple or Dell. And my recent work interest is building a knowledge graph and supply chain management, doing them automatically, and some kind of mining software repository.
01:07
It's all machine learning job. So today I'm going to talk about DVC, an open source tool for managing ML workflow efficiently. And firstly, I will start from how software
01:31
developers work well with various practices and tools, then talk about data scientists and machine learning developers who have some challenges to adopt their
01:44
work with software developments. I think DVC could help them to work more efficiently. And then lastly, I will show you how to use DVC, how DVC work with an example projects.
02:04
Actually, the title is automating ML workflow, not automated machine learning itself. So actually, there exists an active area for AutoML. So you better notice that this session
02:23
is not about auto machine learning. So let's start with Waterfall to Agile. I don't think people used to work with Waterfall way for developing software,
02:43
even in the old days, designing, building software, and really could never meet a set of requirements at once. I've never experienced such case after doing my homework as CS 101 class. However, it's useful to learn that we should work with iterative process other than Waterfall.
03:08
Since the requirements are always changes or not concrete enough, we used to organize a small set of tasks and do what we can do earlier and release features in a progressive
03:20
way until all the requirements are satisfied. So as we are not working alone and the iterative process should run fast with extreme efficiency, we used to divide our work stages
03:42
into a few steps and try to keep moving forward without a stop. And for each stage, we have been continuously think about how we can do better with the job.
04:02
Some people start to talk about methods just like TDD and continuous integration or continuous deployment. And some develop some efficient tools such like Git and Maven or or JUnit, Jenkins. So those tools help us to do our job easier and more efficient way.
04:30
So we have so many helps even on deploying, operating, and monitoring our software, and maybe sooner or later software development could be the easiest job in this world.
04:49
Now how about machine learning? So there are typical workflow in machine learning as well. Which are data acquisition and data pre-processing and build model and evaluation
05:07
and model selection. And lastly deployment. Although such workflows are a part of whole process of developing machine learning application, but they are relatively new and less developed. This is because data science or machine learning is different with software
05:25
development as the software development is different from the developing hardware with waterfall process. So these are the typical process of machine learning. And this is the
05:42
typical workflow in one chart. It is an iterative process starting from data acquiring and the left side. But very different from what that software developing process as it deals with data and more along with codes. Sometimes data and model takes more
06:06
important part of process with just a few lines of code. Also it is a team sport and some parts need some specialists like data acquisition and processing stage. These data engineers area
06:22
and also pre-processing and model selection is for data scientists or for machine learning engineer. And even the software developing engineers are needed for the last step, the
06:48
for this reason machine learning workflow cannot just follow software development processes. And I think there are machine learning's own three main challenges in machine learning
07:01
projects. They are burgeoning data along with code and deploy a model not a code. And lastly metric driven development. So people used to have their own burgeoning system as
07:23
you see in the screen. And later we don't know which one is the proper working version. And also data scientists should share those data but it's not easy because they usually take
07:40
so large space in storage and hard to manage. A few gigabytes or even larger data. How can we easily share them? Another problem is sometimes changes in data triggers pipelines even there's no single lines of code changes. But it's difficult to notice which part of data
08:05
has been changed. So we should keep organizing the data with its related code so that we can reproduce output at any time if the data changes. I'm sorry this line is supposed
08:24
to be the separate section as a separate challenge. I made a mistake here. Different from software development the most important and final artifact is a model but not a code. So we have to version models and keep tracking of which data and code produces the model.
08:49
Lastly machine learning is a metric driven job. A software development process starts from requirements and end with requirements. A metric is the most important milestone teaches what we
09:06
should do next for the improvements. I'll show you some example what kind of decision we can can be made for tracking the metrics at the last step. So metrics should be kept
09:22
tracking along with codes data and models. So the metric must be kept tracked and now DVC
09:40
comes out. DVC helps to handle these challenges. There are other solutions such as GitFS, MLflow and Apache Airflow but I recommend DVC because it's easy to use. If you are familiar
10:02
with using Git then it's very intuitive to use the DVC with Git. And it's language independent even though even the DVC is written in Python but you can do you can use DVC with C languages
10:27
or Java or any other tools whatever you want. So it's language independent. And lastly it's useful to individual to a large team.
10:42
Other tools like MLflow, Apache Airflow they need to manage the web server but in case of DVC it's just a client command line tool. So for you can adopt to your project individually or you can share the tool with other members in a large team. So it's easy to start with.
11:12
Okay it's time to see how DVC works with the problem of cats and dogs classification.
11:23
Actually the this example the project trains a small vgg net to classify cat and dog images. So go to the GitHub repository later and there will be an instruction to build a docker image that which contains everything you need for following the walkthrough
11:46
example. And later containerize the image with running batch shell then follow the commands. And the following command should be run inside the docker container.
12:04
So this is the typical data directory structure I use used to work with when I'm doing machine learning project. So I put a data some raw data and process and last
12:23
when I'm ready to deploy the model then I put the retrained finalized model in the notebook directory. Actually I use this directory occasionally but mostly I just
12:54
put the source code in the source directory at the bottom. So I put the I make a cat dog
13:02
module then when I need to experiment such a module then I open a notebook and import the cat dog module and test the modules and do some experiments. And also there are these some data downloading scripts in a scripts directory and also
13:25
deployment of the script for deployments. So to start with we need to initialize the git repository as you see in the screen then source add the source directory then do some commits.
13:50
And after that we do the same thing with dvc init which initialize the dvc repository inside the git repository. So you can see some .dvc directories and some files inside the directory
14:08
organizing the whole repository. And also we need to add index the .dvc directory into the git repository and we had to so that we can track the dvc version as well with with git.
14:30
And lastly we commit the git dvc repository with a commit command. And there's a script download start shell which downloads 25,000 images in total
14:49
half cats and half dogs it's pretty large. So I'll the script put those files in temp directory so there are so there's a cat directory and dog directory which has 12.5 k images for each.
15:12
And next step is set a set of the parameters. Those parameters are used for data preparation or pre-processing or it's it contains some hyper parameters
15:26
for training a model. So as you see in the prep stage we use a split rate as 0.9 which splits
15:42
whole data into the training data and test data for training a model and evaluating it. The class size is actually the data set is too large so the training takes a long time.
16:07
So I just limited each class with 2,000 images so in total 4,000 animals inside a stored in the training data as a training data set. So if you have a GPU computer then
16:26
the training whole training set will be finished in a minute. Then we have a learning rate and batch size and number of epochs and validation rate 0.2 for the validation steps.
16:52
Now it's time to define the first stage of the pipeline which is called prepare. So there is a preprocess.py file in cat dog directory which divides 4,000 images into
17:14
training data and test data and sampling 4,000 images in total out of 25k.
17:28
So the process data stored in data process with the command the python with the python command. So you see the options minus n is the name of the stage and p is the
17:45
parameter which you have seen in the previous slide the parameter and d option is the dependency. So the prep stage is dependent on the preprocess.py and the output
18:06
stored in the data processed directory. So after running the dvc run command with such options we can check if what kind of files or directories have been changed.
18:28
So there are three directories and files have changed. So I add them to the git repository and commit. So now we are start to tracking the preparation stage. Okay thank you.
18:51
The next step is defining a train and evaluate stage. So I named it version 0.1 because
19:02
I just put a one convolutional layer and one fully connected layer very simple model. So such code is written in cat dog train.py. So as you see the first command
19:21
I run dvc run again with another name train and it accepts a train parameter with p option and depends on data preprocess which was the output of the previous stage. And depends on the train script itself and the output goes to data with the model.h5 file.
19:47
This is the model exported file and it draws the plot data to the plot.json and the task is run by cat dog.train running the cat dog.train model.
20:07
And so you will see some output in progress of the training the model and then we will define another stage named evaluate which depends on the model h5 which was the output of the
20:28
previous stage and also depends on evaluate script and it tracks the metrics with m option option m with the score.json. So the train evaluation metric will store in the score.json
20:49
and it will keep track with the model. I mean the metric information stored in the score.json will be kept tracking with the model file. So I added some more files
21:10
and made a commit and tagged the version as a 0.1. Now we have defined three stages
21:21
starting from prep to ending with evaluate. So with dvc dag command we can see an ASCII art. So the prep or when the train depends on the prep stage and evaluate depends on the training stage. So when you have change on prep stage the whole deck has to reproduce
21:46
or if you have only changes in related to train stages then only evaluate stage has to be run again. So when there is nothing have changed we if we try to reproduce the experiment with
22:06
command dvc repro then you can see there is nothing changed in the previous stages so nothing has to be done. But if I update the model with adding another convolutional layer
22:23
then running dvc repro it detects some change in the source code so it starts to build model again. So after finishing the training model I put a tag 0.2 as a version
22:41
as another version and did the same thing adding a third convolutional layer and put the version 0.3 and commit. Now it's time to compare the metrics for each version regarding to the accuracy as you see the accuracy acc score is just around 0.67 to 0.71.
23:11
So it says just adding convolution or more convolutional layers is not helping the result
23:22
and I tried to check the training process for each experiment and it tells something. As you see the training accuracy goes high but the validation accuracy sometimes drops
23:42
and stop increasing at epoch two or three which means it's overfitting. So it's a clear sign of overfitting so I put some regularization with a dropout
24:05
and then run the dvc repro and do the same training job again and and you see it's still sometimes the the left part of the chart it says sometimes
24:26
the validation accuracy drops but it continues to increase. So I also tried data augmentation
24:43
so rather than increasing the size of data I tried to manipulate the existing 4 000 images oh sorry and it also helped so later merging two data augmentation techniques and regularization
25:04
technique I could have up to seven point eight point zero accuracy so maybe later you can try this at home with the walkthrough example and the slide. So John if any questions just shoot.
25:32
Thank you very much. So we are not in a room but people are there. So time for questions.
25:53
Stanislav is asking how do I recreate the data on a different machine? For code I do git clone
26:03
what does one does one for data? So you can yeah you can check out the the code with a repo. How do I recreate the data on different machines? Oh oh good question. Actually I haven't explained the great feature of dvc in this slide but you can
26:26
make a shared cache. You know the git has a cache inside our home directory but you can think of sharing such cache to the shared storage and then you can share the cache so that if I'm if I'm on training version 0.5 and follower tries to
26:50
train the same model he won't it won't take a minute because the the cache the shared cache will just come inside to my dvc repository so it's amazingly fast because it's using
27:07
sharing the cache. Okay okay thank you. So any other questions?
27:21
Oh another one. Does dvc handle version control on the data or rather input data must always be the same and we are just versioning receive the transfer or repair and actually it makes a hash of the file or the directory and put inside the cache
27:45
so dvc doesn't do anything with git but git tries to manage everything with dvc repository because the dvc repository has been kept versioning after we defining a pipeline or
28:09
uh training a new model everything output input and dependent file will be uh hashed and stored in the cache so it's it's going to be managed with the git.
28:25
Would you use consider dvc as an alternative airflow or can those work together? Let me repeat the question please so because we need it for the recording so if you read it low it's better. Okay would you use consider dvc as an alternative to airflow or can those work together?
28:46
Usually airflow they have a advantage and monitoring but which dvc doesn't have so if we want to do a monitor our job we have to use another tool so in that case we cannot use the
29:03
airflow with dvc but like jenkins we can use jenkins with dvc to monitor our jenkins as a dvc test when it takes a one hour two hour or or day to train a model you can you can put
29:21
those uh dvc tests inside the jenkins so that you can monitor the job. Okay perfect so thank you very much thank you for presenting we are thank you thank you for participating.
Recommendations
Series of 2 media