Managing complex data science experiment configurations with Hydra
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 112 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/60824 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202259 / 112
5
10
14
17
19
23
25
29
31
32
34
44
47
51
53
54
57
61
69
70
82
83
88
93
94
97
101
105
106
00:00
Meta elementFacebookMachine learningComputer animation
00:40
Atomic numberSet (mathematics)Interface (computing)RAIDEndliche ModelltheorieParameter (computer programming)AlgorithmTrailSoftware testingDifferent (Kate Ryan album)Arithmetic meanRight angleComputer animation
01:44
Data modelConfiguration spaceCodeBoilerplate (text)Order (biology)PerimeterEndliche ModelltheorieContent (media)Cartesian coordinate systemParameter (computer programming)Automatic differentiationDifferent (Kate Ryan album)Revision controlCombinational logicObject (grammar)TrailInheritance (object-oriented programming)NumberMessage passingComplex (psychology)Computer fileSoftware frameworkData managementComputer animation
03:55
Game theoryData modelEmailBoolean algebraCartesian coordinate systemScripting languageParameter (computer programming)Configuration spaceObject (grammar)Endliche ModelltheorieMultiplication signRevision controlDefault (computer science)Directory serviceCuboidLine (geometry)Dot productFunction (mathematics)Online help
05:05
Directory serviceData modelComplete metric spaceGastropod shellDialectGamma functionInterpolationNumberData managementBoilerplate (text)NamespaceSimilarity (geometry)Modul <Datentyp>Endliche ModelltheorieConfiguration spaceComputer architectureComputer fileParsingSet (mathematics)SpacetimeRange (statistics)Different (Kate Ryan album)Sheaf (mathematics)Gastropod shellComplete metric spaceCombinational logicDirectory serviceParameter (computer programming)Function (mathematics)Boilerplate (text)Object (grammar)Product (business)Resolvent formalismCartesian coordinate systemSelectivity (electronic)Functional (mathematics)Reading (process)CodeDefault (computer science)LogicSource codeModule (mathematics)BitStructural loadConfiguration managementMereology2 (number)Maxima and minimaSoftware developerAdditionScripting languageSoftware frameworkMultiplication signBasis <Mathematik>SequenceNamespaceResultantElectric generatorRadical (chemistry)outputConnectivity (graph theory)Sweep line algorithmRevision controlPoint (geometry)Arithmetic meanCountingFirst-person shooterAutomatic differentiationRight angleAddress spaceAuthorizationSound effectHybrid computerOrder (biology)Direction (geometry)Analytic continuationNetwork topologyPhysical lawMachine visionTrailAsynchronous Transfer ModeInterpreter (computing)Process (computing)FamilyComputer animation
14:28
Configuration spaceComputer file1 (number)Wave packetVisualization (computer graphics)Field (computer science)Direction (geometry)Set (mathematics)Protein foldingNetwork topology2 (number)Default (computer science)Computer animation
15:21
Computer configurationGroup actionMassEmailPartial derivativeCore dumpTypprüfungGame theoryData modelQueue (abstract data type)Adaptive behaviorQuantum stateDisintegrationEmulationConfiguration spaceGraph coloringType theoryIdentity managementRevision controlMultiplication signDataflowEndliche ModelltheorieComputer fileMereologyResultantWave packetGroup actionParameter (computer programming)Metric systemInterface (computing)Figurate numberTrailUtility softwareCombinational logicPoint (geometry)CuboidComputer architectureDifferent (Kate Ryan album)Default (computer science)Mixed realityPhysical lawProcess (computing)Social classObject (grammar)Network topologySound effectAnalytic continuationCodeCartesian coordinate systemFunctional (mathematics)Function (mathematics)Data managementRule of inferencePlug-in (computing)Medical imagingGoodness of fitWorkloadContent (media)Electronic mailing listPartial derivativeCASE <Informatik>Inheritance (object-oriented programming)Virtual machineSpacetimeDatabaseProduct (business)Level (video gaming)Server (computing)Direction (geometry)Set (mathematics)BitLibrary (computing)Loop (music)Field (computer science)NamespaceData storage deviceLoginArithmetic progressionRun time (program lifecycle phase)Symbol tableLatent heatData typeMathematical optimizationDirectory serviceoutputComputer animation
23:40
Parameter (computer programming)Multiplication signEndliche ModelltheorieSoftwareDefault (computer science)Phase transitionFreezingResultantConfiguration spaceWebsiteInheritance (object-oriented programming)Lecture/Conference
24:54
SoftwarePhysical lawConfiguration spaceResultantOrder (biology)Parameter (computer programming)Fraction (mathematics)Scripting languageLecture/Conference
25:33
Configuration spaceMultiplication signCodeComputing platformLecture/Conference
26:00
First-person shooterMereologyCode
26:39
Goodness of fitCodeFunctional (mathematics)Plug-in (computing)Lecture/Conference
27:11
Queue (abstract data type)Adaptive behaviorQuantum stateProcess (computing)Virtual machineSequence2 (number)Default (computer science)Local ringParallel portPlug-in (computing)Front and back endsTraffic reportingLecture/ConferenceComputer animation
27:55
Execution unitSampling (statistics)Lecture/Conference
28:25
XML
Transcript: English(auto-generated)
00:06
So, I'm going to talk about two tools, but primarily Hydra. So, myself, I work at Intel, but Hydra was developed at Facebook Research, so it's from META.
00:21
And I'm also going to mention a tool called MLflow, which is now curated by the Linux Foundation. So, what is Hydra and what are we trying to solve here? So, when we're doing data science experiments, or machine learning experiments, we have...
00:49
You run into the problem that you have quite a few things to keep track of. You have your dataset, you have your model, you have your algorithms, parameters, optimizers, and so on.
01:02
So, even though the Pythonic way of writing things is to keep interfaces very simple, with data science experiments, we end up having a lot of parameters to keep track of, sometimes dozens of different parameters. And this is actually science. We're doing science, meaning that we're making experiments that we don't really know the outcome of.
01:24
We're making an educated guess that this set of parameters will be the best one, but we really have to test. And only after we train our model, test our model, do we know the answer, if this is the right set of parameters. Or, another set of parameters might be better.
01:40
So, because we have to make many of these experiments... Because the number of experiments we have to make, we have to keep track of all the different combinations of parameters that we tried,
02:06
in order to explain what worked best, and in order to reproduce it. So, these problems, managing complex configurations, making them traceable and reproducible, is what we're trying to solve here. And one of the solutions is this application framework called Hydra.
02:24
So, I'm going to tell you how you can use it. So, we're going to start with a simple example. So, this is like a minimal version of a Hydra-based application. In reality, it's a toy example. It doesn't do anything other than log a hello message,
02:45
and the parameters of an object we called model in the configuration here. But, I hope you can appreciate how minimal the boilerplate that Hydra adds to this example is.
03:00
Because really what Hydra is adding is just this decorator here, and it's going to inject this config parameter into the main experiment method. So, what does this code do? Well, if you run it, it's going to log the contents of a configuration file.
03:27
So, you need to have this configuration file as well. With Hydra, you use YAML files, and if you have a model and its parameters defined like this, then when you run this very simple application, what it's going to print out is your hello
03:43
and the parameters of the model from the configuration file. So far, very straightforward. What it did was parse the YAML file and pass it in as the configuration. So, that's the basics of what you would expect it to do. But immediately, out of the box, you get a couple of other features.
04:03
So, the first thing you get is you can ask your very simple application for some help. And that will check the configuration objects that are available to the configuration and list them for you. So, you can inspect the different parameters of your application.
04:26
And it tells you here that you can overwrite anything in the config with a dotted path and a value. And so, you can actually do it this way. You can run your script again saying that model.a equals a different value,
04:45
and now it will actually equal this other value when you run it. So, this way, you can inspect your configuration parameters, you can set your configuration parameters at the command line without having written much other than adding this Hydra decorator to your script.
05:06
But another thing that it does is every time you run your Hydra-based script, it produces a directory of output files. And by default, this is set to a date and time that you started your script.
05:26
The main log of your application is stored in a file with the name of your application.log. But there's also a .hydra directory which contains three YAML files.
05:41
And the config.yaml file is actually the effective final configuration that was the product of all the overrides that you did, or maybe a configuration file that you provided. Hydra YAML contains all the settings of Hydra that were used during this run,
06:02
and overrides are the things that were overwritten by you at the command line. And if you want to, you can actually use this config.yaml file, this effective configuration that was recorded during this run. And to do this, you just specify that Hydra should look in a configuration directory
06:23
from your output .hydra and then use the configuration there. And when you run this, you will get your stored configuration and you're essentially able to reproduce the previous run. So thanks to this, you get traceability and reproducibility of your experiment,
06:43
just by serializing the configuration that was used into a YAML file that you can then use as input to your next experiment run. So those are nice features of Hydra. One other nice feature is that it actually comes with a command line completion,
07:05
so you get tab completion for your parameters as well. You need to execute this eval in your shell, which will generate code for shell completion. And then when you're on your terminal
07:22
and you double-tap the tab after the model, it will suggest you the different parameters of the model that you can set. So that's very useful for being able to quickly set a couple of parameters that you want for your next experiment.
07:43
Another cool feature of Hydra is multi-run, and that allows you to run your experiment with a set of different parameters. So for example here, we're telling our script
08:01
that it should launch the multi-run mode and that it should try to run with model A parameters 1 and 3 and model B parameter values 2 and 4. And it will make all the combinations of these values and start four different jobs and run each of them in sequence by default.
08:25
And you will then get the results in the output directory of each of those runs, and you can compare them and decide which one is best. So you can sweep through a lot of different parameter values this way. So I hope that you're getting the impression
08:43
that with not a lot of additional code added to your experiment, you're getting a lot of features. So that's what Hydra does. Now I want to talk about how it works and how its internals are organized. So the components of Hydra are omega conf,
09:04
which I will get to in a second. It also sets up Python logging for you. There are launchers that start your jobs, sweepers that are the thing that scans the different parameter spaces that you define by providing the names, the values, or ranges of values for each parameter.
09:23
And it's also a modular architecture that can be extended with plugins. So omega conf is actually a separate package that Hydra is based on. It's the YAML configuration manager which Hydra is based on. It was created by the same main author, Omri,
09:44
who created Hydra initially. And you can of course install it from PyPI. What does omega conf do? What comes from omega conf? Well, it's the parsing of the YAML configuration files. So you can use omega conf to load up your YAML files.
10:03
And then this object that is created when you load the configuration is quite flexible. So you can address your values, read and set your values, either like this was an object or like it was a nested dictionary,
10:22
or even just with the dotted path using the omega conf select function. So it's quite flexible, quite easy to use the configuration after it's loaded. There's also a nice feature called variable interpolation, which means that inside of your YAML configuration you can use some values from the configuration
10:46
to define other values of your configuration. So it works a little bit like in the shell with a dollar and curly braces. You can use a value defined in the configuration as part of a value for another configuration.
11:09
So if you were to define these three, foo, bar, and baz, and baz used the previous two, you would actually get the combination hello, your Python there.
11:22
Another useful feature of omega conf is called resolvers. And resolvers allow you to add functions that are able to take these configuration values and inside of the YAML file you can actually have a little bit of logic that will either combine the other values or...
11:46
Well, in this toy example here we defined addition. And it's very easy to add your own resolvers. You essentially just call the omega conf register new resolver function and define whatever logic you need.
12:01
So these configuration YAML files, they're quite flexible and allow you to do quite a lot just in the YAML file themselves thanks to omega conf. So that's the basis of Hydra, what Hydra is based on.
12:24
But what does Hydra itself add? Well, Hydra itself adds all these other features. It's an application development framework with minimal boilerplate and it's focused on managing all these configurations.
12:43
So the first thing that Hydra adds is the ability to compose these configurations from smaller files. So the idea is that you don't want to have a huge configuration file for your huge experiment
13:03
but rather split it up into smaller sections and each section can then define a different model or a different data set or a different optimizer, whatever object that has a set of configuration values that work together
13:22
can be in a separate file, like in the source code it's defined in a separate file. So you split up big configurations into small files and then you combine them back again through composition. Hydra has interesting ways of thinking about these configurations.
13:42
For example, each directory of YAML files is called a package and it kind of works like a Python package or a module in that it defines its own namespace and it can be used interchangeably. There's also a concept of configuration search path, kind of like the Python path
14:04
so you can have these configuration files live in different directories or different modules or they can be even imported from other packages that you install together with Hydra. So all these things sound abstract, so let's go through some examples.
14:26
So the simplest way to combine configurations with Hydra is using the defaults directory
14:44
or defaults directive, which allows you to set some defaults from another file. So for example, if we have an experiment configuration file called my-experiment-yaml I can define defaults coming from a second file called training-settings
15:01
and if I have a second file called training-settings.yaml those defaults will populate the configuration and I will override only the ones that I want. So I can have some global defaults defined somewhere and override only the ones that I want in my experiment that I'm running right now.
15:27
So the effective configuration that you're going to get here from these two files is the combination of this and of course, as you would expect, overriding this one specific value. You can also have configuration groups
15:43
and configuration groups are subdirectories of YAML files. For example, you can have a subdirectory called dataset and inside of that subdirectory have different files for different datasets. You can have another directory called model and different model files defining different models
16:05
and you can then combine them by specifying in your defaults list that the dataset is named imagenet and that will automatically Hydra will look inside the dataset subdirectory for the imagenet.yaml file
16:25
and the output will be the content of the file taken from that specific model or dataset definition. You can also override where you're going to use your configuration groups.
16:48
So we're getting a little bit deeper into how you can use these Hydra settings but you can have one dataset file and reuse it multiple times by using this at symbol.
17:06
So it's actually quite flexible. There's even a package directive that you can set at the beginning of a YAML file that will say that everything in this particular file
17:21
actually populates the namespace Hydra job logging. This is actually part of a configuration for a plugin called color log that makes the logs that Hydra outputs nice and colorful, which I also recommend.
17:42
The logging that comes with Hydra is just regular Python logging but it's set up for you out of the box. So the biggest pain point of Python logging configuring it is done for you. Another nice feature that I really like about Hydra is a utility function called instantiate
18:04
and it works like this. If you add into your configuration file the special keyword underscore target underscore then you can specify a dotted path to a class in your code base
18:21
or in this case PyTorch code base. And if you then call this Hydra utils instantiate function on this configuration it will look up this module, import it and instantiate this class with this parameter.
18:41
So that's quite useful for keeping all of your configuration together and making your application quite flexible. Sometimes you may have a problem that some parameters that you need are only known at runtime like in the case of PyTorch
19:03
if you want to define an optimizer you need to pass in the model so you first need to define the model then you have the optimizer but Hydra has another fun feature if you add this underscore partial underscore keyword to your configuration
19:23
you can instantiate a partial so you get a functools partial object pre-populated with the parameters that you specify but you can finalize the creation of your optimizer in your code until you have the model by calling the partial when you're ready.
19:51
So there's also type checking possible with Hydra So because I have not very much time I'm not going to get into the details of this
20:03
but if you define data classes that have the same fields as the YAML file that you had before you can of course annotate them with types and then by calling the config store of Hydra
20:20
you can actually add them as configurations so essentially you're replacing the YAML with type annotated data classes and then you can use that for mypy by specifying that the configuration object is this specific data class and at runtime it can even point out that this is the wrong data type for a value if you specify it.
20:50
So there are quite a few things that you can do with Hydra beyond these basics because of its plugin architecture so launchers are responsible for starting jobs
21:04
sweepers are responsible for setting different configuration parameter combinations based on the input and there are plugins already for all of these cool workload managers
21:20
these libraries and these libraries for scanning parameter values so it's actually quite a powerful way to start using also these other tools. So I want to talk very briefly about integrating Hydra with MLflow
21:43
but first of all what is MLflow? It's an application that is very easy to use for keeping track of your experiments so the simplest way to use it is just to start a server locally on your machine
22:01
for production you might want to set up a database on a bigger server but to get started with MLflow all you have to do is just these two things and then you will get an interface to your experiments where you can view your training progress and results and the way you actually log metrics into MLflow is very simple
22:29
so it fits in nicely with Hydra all you have to do is just import MLflow and start logging the metrics it will automatically create an experiment and run for you
22:45
and how you can actually integrate Hydra together with MLflow? Well you can log artifacts into MLflow so files, you can upload files into MLflow as well so if you have the Hydra config, the effective configuration that you used for your experiment
23:03
you can log it as an artifact into MLflow and that way you can keep your configuration that's needed to reproduce this experiment together with the results of the experiments in MLflow so it closes the loop so you can have all your results together
23:21
and the ability to reproduce them as well Okay, I think that I have to finish so my takeaways are that Hydra will make your experiments more easy to configure traceable and reproducible and thank you
23:45
Thank you very much for this great talk it was very interesting, at least for me So are there any questions in the audience? If so, just come in front to the microphone
24:03
Hi, thank you I wanted to ask, because I'm personally tracking experiments in Neptune but whatever use I think MLflow, waste and biases or Neptune it doesn't really matter, it's very similar but what I noticed that during let's say initial phase of experimentation we decided we want to track these specific parameters
24:22
and later in time and let's say we are training the model with always with freeze initial few layers but then we decide, okay we want to try new things and maybe unfreeze some of those and what is problematic for me is to retroactively populate those experiments with some default value
24:42
and is it somehow possible or doable maybe using Hydra or is it just something that I would have to live with? Well, if you did this and put your configuration into your experiment result tracking software
25:08
then if it has the ability to scan through your experiments then you could extract any value that you didn't previously maybe add as a specific parameter from this log into your experiment
25:23
Okay, so just go through all of that Well, write a script for that So the first example you showed is just a decorator on a fraction and then the further stuff you showed it got more complex
25:45
Do you think it's easy to stop, to keep this separate from your code and to prevent it getting its claws into your code Maybe later times you could swap it out for a different experimental platform So the configuration itself probably is quite easy
26:06
but when you start doing things like using the instantiate helper then it kind of becomes part of your stack and it might be more difficult to replace it later So I guess it depends on how many features you end up using
26:25
You think the benefits of using something like instantiate seems like it's worth the benefit of allowing that into your code Well, the idea is that it's very helpful and it's not a lot of code
26:41
so there's not that much of a downside because it's not so much code Thank you, that's good Hi, thanks for a nice talk, seems like an interesting package I was wondering about this multi-run functionality You didn't mention anything about parallelizability
27:03
Is that because it doesn't exist yet? And if so, is it something you're hoping to implement? The plugins for Hydra are actually solving this issue
27:20
By default, Hydra just launches jobs in sequence on the local machine that you have But thanks to these plugins that are able to take advantage of backends like Slurm You can actually, instead of running the job locally, it just starts jobs
27:42
and they run remotely and then they report back So check out these plugins if you're interested because they do the parallelization for you Cool, thanks Thanks for a great talk
28:01
Do you have any experience with PyTorch Lightning? I noticed you used PyTorch for some of your examples No, I don't I guess that's just it then Okay, thank you very much for the talk
28:24
A loud applause for the end, thank you