PyAutoFit: A Classy Probabilistic Programming Language For Data Science
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | ||
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 4.0 International: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben. | |
Identifikatoren | 10.5446/58768 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
00:00
MarketinginformationssystemKlasse <Mathematik>GoogolMathematisches ModellVisualisierungAnalysisMathematische ModellierungFitnessfunktionProgrammLogischer SchlussMAPLokales MinimumBayes-VerfahrenOpen SourceQuick-SortRuhmasseSpeicherabzugParametersystemSoftwareDatenbankFunktion <Mathematik>Kontextbezogenes SystemBasisfunktionTheoremProblemorientierte ProgrammierspracheBAYESSoftwareschwachstelleKomponente <Software>Pi <Zahl>DifferenteSystemaufrufVirtuelle MaschineProgrammierspracheAusgleichsrechnungRechenschieberTypentheorieMinkowski-MetrikFunktionale ProgrammierspracheStatistikCASE <Informatik>Nichtlineares SystemEin-AusgabeProjektive EbeneBeobachtungsstudieGeradeSchnittmengePunktKurvenanpassungLeistungsbewertungSelbstrepräsentationMarkov-Ketten-Monte-Carlo-VerfahrenBildgebendes VerfahrenLikelihood-FunktionFigurierte ZahlKlasse <Mathematik>FestplatteRelationale DatenbankResultanteProzess <Informatik>BitSoundverarbeitungAutokorrelationsfunktionResiduumDämpfungBayes-NetzAlgorithmusGenerator <Informatik>MultiplikationNatürliche ZahlFunktionentheorieChi-Quadrat-TestZahlenbereichOrdnung <Mathematik>GrundraumLuenberger-BeobachterMultifunktionStandardabweichungQuellcodeSigma-AlgebraWeb-SeiteKartesische KoordinatenTouchscreenFramework <Informatik>Güte der AnpassungExponentialfunktionProfil <Aerodynamik>Modulare ProgrammierungInstantiierungMultiplikationsoperatorMobiles InternetParallelrechnerObjektorientierte ProgrammierspracheSchedulingGoogolStromlinie <Strömungsmechanik>Konstruktor <Informatik>DateiformatMaximum-Likelihood-SchätzungAllgemeine RelativitätstheorieKategorie <Mathematik>Programmbibliothekt-TestKette <Mathematik>Demoszene <Programmierung>Divergente ReiheStationäre VerteilungNichtlineares GleichungssystemSoftwareentwicklerRauschenRückkopplungInterface <Schaltung>Einfacher RingCodeExpertensystemPolygonzugDatenstrukturFehlermeldungTaskLikelihood-Quotienten-TestMomentenproblemWahrscheinlichkeitsverteilungCliquenweiteStichprobenumfangVerschlingungMereologieNotebook-ComputerGebäude <Mathematik>SupercomputerMailing-ListeStrukturierte ProgrammierungKundendatenbankDimensionsanalyseInterpretiererAerothermodynamikAudiovisualisierungUmwandlungsenthalpieForcingSystemprogrammierungLesen <Datenverarbeitung>Schreiben <Datenverarbeitung>WellenpaketHierarchische StrukturVorzeichen <Mathematik>RichtungPaarvergleichIdentitätsverwaltungDeskriptive StatistikArithmetisches MittelNormalvektorFokalpunktAdditionBesprechung/Interview
00:38
Klasse <Mathematik>SoftwareTouchscreenSchnittmengeNatürliche ZahlStatistikOpen SourceProblemorientierte ProgrammierspracheQuick-SortKartesische KoordinatenComputeranimation
01:14
SoftwareentwicklungGebäude <Mathematik>MAPMathematisches ModellMultiplikationGravitationProgrammierungParametersystemAusgleichsrechnungSigma-AlgebraBenutzerprofilComputerLineare AbbildungQuadratzahlKlasse <Mathematik>Gewicht <Ausgleichsrechnung>A-posteriori-WahrscheinlichkeitLogischer SchlussSpeicherabzugNichtunterscheidbarkeitProzess <Informatik>Funktion <Mathematik>LeistungsbewertungSoftwarePaarvergleichVerschlingungSchreib-Lese-KopfKonstruktor <Informatik>HochdruckBildschirmsymbolSchreiben <Datenverarbeitung>AnalysisMultifunktiont-TestSigma-AlgebraExpertensystemWeb-SeiteSchedulingTaskTypentheorieStromlinie <Strömungsmechanik>DifferenteLogischer SchlussSchnittmengeMathematisches ModellCliquenweiteKontextbezogenes SystemParametersystemFigurierte ZahlEin-AusgabeKomponente <Software>Klasse <Mathematik>SystemaufrufQuick-SortOrdnung <Mathematik>Prozess <Informatik>RückkopplungMathematische ModellierungSoftwareAusgleichsrechnungAlgorithmusLesen <Datenverarbeitung>Modulare ProgrammierungProgrammierspracheProgrammLikelihood-FunktionFunktionale ProgrammierspracheMinkowski-MetrikResultanteNichtlineares SystemKette <Mathematik>Chi-Quadrat-TestOpen SourceProgrammbibliothekStatistikRauschenResiduumFitnessfunktionBayes-EntscheidungstheorieParallelrechnerBildgebendes VerfahrenFokalpunktIdentitätsverwaltungFramework <Informatik>Virtuelle MaschineCASE <Informatik>Projektive EbeneSoftwareschwachstelleSpeicherabzugRechenschieberVerschlingungNotebook-ComputerInterface <Schaltung>Problemorientierte ProgrammierspracheBeobachtungsstudieSelbstrepräsentationGeradeTheoremLeistungsbewertungAnalysis
09:29
VerschlingungFunktion <Mathematik>Mathematisches ModellAnalysisKlasse <Mathematik>Schreiben <Datenverarbeitung>AusgleichsrechnungLineare AbbildungInformationInterface <Schaltung>Multi-Tier-ArchitekturMereologieParametersystemVererbungshierarchieSkalierbarkeitResiduumMaßerweiterungVisualisierungLokales MinimumRippen <Informatik>DefaultAbfrageHochdruckNotebook-ComputerDatenbankSupercomputerROM <Informatik>Generator <Informatik>PaarvergleichGradientVerkettung <Informatik>Kette <Mathematik>Textur-MappingGravitationMinkowski-MetrikEinfacher RingAnalysisKartesische KoordinatenCASE <Informatik>ProgrammbibliothekMathematisches ModellTypentheorieUmwandlungsenthalpieQuick-SortParametersystemLesen <Datenverarbeitung>Klasse <Mathematik>FitnessfunktionVisualisierungNichtlineares SystemExponentialfunktionKurtosisBildgebendes VerfahrenKette <Mathematik>ResultanteMathematische ModellierungRückkopplungObjektorientierte ProgrammierspracheAlgorithmusFunktionale ProgrammierspracheLikelihood-FunktionStichprobenumfangEigentliche AbbildungFehlermeldungAutokorrelationsfunktionEin-AusgabeKurvenanpassungGeradeCodeLikelihood-Quotienten-TestStrukturierte ProgrammierungAusgleichsrechnungInstantiierungBitMailing-ListeKundendatenbankSchnittmengeMaximum-Likelihood-SchätzungFunktion <Mathematik>InterpretiererWahrscheinlichkeitsverteilungMomentenproblemMultifunktionSoftwareProzess <Informatik>Bayes-EntscheidungstheorieRelationale DatenbankAudiovisualisierungFestplatteOrdnung <Mathematik>FunktionentheorieNotebook-ComputerStatistikZahlenbereichDatenbankDimensionsanalyseMultiplikationsoperatorDämpfungKomponente <Software>Sigma-AlgebraComputeranimation
17:43
GravitationNormalvektorRuhmasseSoftwareentwicklerQuick-SortAusgleichsrechnungEinfacher RingBildgebendes VerfahrenOpen Source
18:45
GravitationOpen SourceMultiplikationMultiplikationsoperatorObjektorientierte ProgrammierspracheMathematisches ModellGrundraumWeb-Seite
19:15
GravitationCodeNotebook-ComputerMathematisches ModellOffene MengeGoogolVirtuelle MaschineQuick-SortCASE <Informatik>ProgrammbibliothekLesen <Datenverarbeitung>WellenpaketSoftwareGenerator <Informatik>BeobachtungsstudieAnalysisOpen SourceComputeranimation
20:09
GravitationMathematisches ModellSystemprogrammierungOpen SourceKomponente <Software>RuhmasseStatistikOpen SourceProgrammbibliothekRichtungSystemprogrammierungSoftwareentwicklerGrundraumMathematische ModellierungRuhmasseParametersystemMathematisches ModellSchnittmengeQuick-SortForcingDeskriptive StatistikKomponente <Software>MAPMultiplikationTypentheorieComputeranimation
21:41
Uniformer RaumSchreiben <Datenverarbeitung>Mathematisches ModellRechenbuchBenutzerprofilRuhmasseFunktionale ProgrammierspracheMathematisches ModellKomponente <Software>Schreiben <Datenverarbeitung>Ein-AusgabeDateiformatParametersystemRuhmasseProfil <Aerodynamik>Klasse <Mathematik>Computeranimation
22:12
SystemprogrammierungOpen SourceKomponente <Software>Mathematisches ModellRuhmasseFunktion <Mathematik>SpeicherabzugMAPMultiplikationGrundraumAbstandKlasse <Mathematik>Mini-DiscAusgleichsrechnungSoftwareBenutzerprofilParametersystemLokales MinimumPROMGradientATMAnalysisCodeStatistikMathematisches ModellKlasse <Mathematik>BeobachtungsstudieQuick-SortProfil <Aerodynamik>MAPMinkowski-MetrikNichtlineares SystemMultiplikationsoperatorParametersystemMailing-ListeEin-AusgabeHierarchische StrukturSchnittmengeInstantiierungKonstruktor <Informatik>Mathematische ModellierungOpen SourceObjektorientierte ProgrammierspracheMultiplikationKomponente <Software>AusgleichsrechnungLikelihood-FunktionFunktionale ProgrammierspracheRuhmasseProgrammbibliothekCodeTypentheorieAnalysisQuellcodePunktFunktionentheorieGebäude <Mathematik>CASE <Informatik>Natürliche ZahlComputeranimation
25:49
MultitaskingGrundraumStatistikOpen SourceProgrammierungMultiplikationsoperatorVorzeichen <Mathematik>AdditionStandardabweichungProgrammParametersystemRechenschieberVorlesung/KonferenzComputeranimationBesprechung/Interview
26:21
AusgleichsrechnungMathematisches ModellVorzeichen <Mathematik>Klasse <Mathematik>Konstruktor <Informatik>Schreiben <Datenverarbeitung>Sigma-AlgebraParametersystemMathematisches ModellMereologieSchnittmengeLokales MinimumKontextbezogenes SystemArithmetisches MittelParametersystemKurvenanpassungNormalvektorComputeranimation
27:06
MultiplikationsoperatorFitnessfunktionAusgleichsrechnungKontextbezogenes SystemQuick-SortUmwandlungsenthalpieMAPRauschenAerothermodynamikFunktionale ProgrammierspracheAnalysisMathematisches ModellBitSchnittmengeCASE <Informatik>PunktBesprechung/Interview
28:49
E-Mail
Transkript: English(automatisch erzeugt)
00:06
Great, so now we have with us James Nightingale, who's going to talk about Pi AutoFit, a classy probabilistic programming language for data science. Welcome, James. He is an observational cosmologist and postdoctoral researcher at Durham University,
00:25
where he focuses on strong gravitational lensing, devising new ways to use the study to dark matter and the distant universe. So over to you, James. Can I just first check, is my microphone working? Yes. Yes, it's working fine. Is my screen displaying correctly?
00:41
Yep. Brilliant. It's in full screen. Great. Excellent. Okay. So good afternoon, everyone. I'm James Nightingale. Thank you for the introduction. I'm a cosmologist at Durham University, and my research typically focuses on studying galaxies and trying to understand the nature of dark matter. In the past couple of years, we've found that the statistical methods and techniques that we use to do that have far-reaching applications in the data science sort of domain.
01:06
So we've ended up developing this open source software called Pi AutoFit to try and basically allow people to use these techniques in a far more generalized setting. So for this talk, I'm going to give you a run through of model fitting, what Pi AutoFit, what probabilistic programming is, give you a sense of what this software does.
01:23
And then I'm going to describe sort of how we ended up here, starting with our cosmological use case on strong gravitational lensing. And so I want to begin by making sure that we're all kind of in the same place in understanding what I mean when we're talking about model fitting, probabilistic programming languages, and so on.
01:41
So I've sort of got this initial slide just to say that when we're talking about model fitting, these are the types of things, you know, I've got some data, I've got these data points, and I'm going to fit them with a curve. I want to understand what model corresponding to this red line gives the best fit of the data. I've got another image here showing the results or a pictorial representation of a Markov chain Monte Carlo analysis, or MCMC, so if you're familiar with that type of stuff, this
02:04
is the domain we're talking about. And I've also got Bayes' theorem because all of these model fitting tools, all of the things you can do with Pi AutoFit can be done in the Bayesian context, following Bayes' theorem using Bayes' equation, and so on. And so to really drive home what we're talking about, I'm going to very quickly go through
02:23
the simplest model fitting example one could conceive of. So here I have some data, it's a 1D dataset that clearly contains a signal, and it contains a Gaussian, the data has noise, and my task as a model fitting expert is to find out what Gaussian best fits this data, what Gaussian corresponds to this signal.
02:45
And so the way I would approach the model fitting is as follows. I would first compose my model as a Gaussian, which has three parameters, a center, an intensity, and a sigma width value, and I would draw a set of parameters via my model fitting algorithm.
03:00
So here you can see we've got a center of 60, and so on. I would then use these parameters to create a model Gaussian, this is the Gaussian that corresponds to these set of parameters, you can see it's centered on a value of 60. The next step in my model fitting process is I would use this Gaussian to fit my dataset, I would compare my model to the dataset, you can see here that this Gaussian
03:21
isn't really representative of my data. I would subtract the two from one another to get residuals and define some sort of a figure of merit or likelihood function that quantifies how well this model Gaussian fitted my data, you can see here it's not done a very good job, the residuals of chi-squared values are very large, and I would then repeat this process using some
03:44
model fitting algorithm, or what's called a non-linear search. And this non-linear search would guess lots of values of parameters, and eventually it would find a solution that gives us the highest likelihood and indeed tells us what the Gaussian parameters of this dataset corresponds to.
04:01
And so this is all very simple, it's all very, you know, hopefully you're following, but I wanted to really start at the beginning so we're all on the same page about what probabilistic programming languages do, and therefore what Biotifit does. So what is a probabilistic programming language, or PPL for short? Well these are basically, you know, software packages or frameworks or statistical inference
04:21
libraries that make it straightforward to compose a probabilistic model, i.e. the Gaussian I just showed you, and fit it to data, i.e. perform inference automatically. So people who are familiar with this type of software will know of many PPLs, some of the most popular are PyMC3, Stan, there's many that focus on more sort of machine learning,
04:46
deep learning techniques like Pyro, and each of these probabilistic programming languages, they're all suited to different problems, they have different core features, so they have strengths and weaknesses. So there is a question here, why have we ended up developing our own PPL to do astronomy when there already exists some of the biggest open source
05:03
projects on earth? And the reason we think we've done this is because we've actually found that the type of statistical inference problems, the type of model fitting challenges that we face in astronomy and cosmology, and we're learning in a wider setting in data science, there were
05:20
problems that existing PPLs weren't really suited to. So I've listed a couple of examples here, in astronomy we have these large homogeneous data sets, you know, images of thousands of galaxies, and all we want to do is fit those images one by one in a sort of identical homogeneous fashion, and we just want tools that make doing that straightforward, that make it straightforward to
05:42
process large libraries of results and then do our science, do our study. Another example is in astronomy we often have these very expensive likelihood evaluations and our model fits can take days if not months to run, whereas a lot of PPLs you typically, you know, it takes a minute to run and the challenges you face with those PPLs are very different. And so autofit has lots of tools
06:05
for customizing how the model fits performed as well as doing this in the context of massively parallel computing. And we also have a need to fit each data set with many different model models and streamline model comparison, use Bayesian inference to determine what the best models are,
06:22
which again isn't something that's typical of a lot of PPLs. So the way we sort of concisely describe this is that Pyautofit is this highly customizable model fitting software for big data challenges in the model, many model regime, and I'll explain how Pyautofit can be used in a second. The first, just to get the, you know, get the
06:42
links and whatnot out of the way, Pyautofit is obviously an open source project, we're developing it to do cosmology, we want as many people to use this as possible. We have a GitHub, there's all the things you'd expect if you're interested in this type of stuff, check it out, they're listed on the schedule page. And to really drive home how we want as many people using this as possible, we have written a Jupyter Notebook
07:04
lecture series, which for our students at Durham, we give them, we use this to teach them about statistics about model fitting, but these are publicly available, so anyone who kind of wants to get into this domain, wants to learn how to kind of do this type of model fitting, you should absolutely check them out. They're on the Read the Docs, they can be done on
07:21
Binder, and you know, we're getting very good feedback that these are sort of a great introductory way for someone to get into Bayesian inference statistics, model fitting and so on. Okay, so now let's look at how we would actually use Pyautofit using what we call our classy interface. And so to demonstrate how one would approach model fitting in Pyautofit,
07:43
I'm going to use the same example problem I just showed, that is fitting data that contains a Gaussian with a Gaussian. How would we get Pyautofit to find the parameters that correspond to this red curve? And in order to set up a model fit with Pyautofit, you basically have to undertake three steps. It requires you to basically write three Python classes,
08:06
and so the reason we call this classy probabilistic programming is it's heavily built into the Python class data structures. So first of all, we need to write our model as a class. This is an example of what we'd write, so we'd call the class Gaussian because
08:20
this is the model component we're going to fit, and the crucial thing to understand is that the parameters of the Gaussian that we're going to fit, the center, the intensity, the sigma we saw previously, these are going to be written as the input parameters of our constructor, of our init function. So Pyautofit, when we do the fitting, will read this init constructor, it will recognize these parameters, and it will compose a model
08:43
and a non-linear parameter space of these dimensions. So that's the first thing to understand of the API. The other nice thing about using Python classes is, of course, you can extend this Gaussian class with functions and tools that do the things you need it to do. So this function will allow us to create the model Gaussian that we compare
09:04
to the data, and we're about to use this function to perform our model fit. So this is the first of the three classes we need, our model. The second thing we want to write is an analysis class. So this is where our data meets our model, and we fit the data with the model in order to get our figure of merit, our likelihood. So the analysis class
09:24
has two inputs, it's got another init constructor, here is where you put your data, your noise map, anything you need to do the model fit can be put here, and then alongside that you define your log likelihood function. This is the function that takes an instance of the model and fits it to the data and returns the likelihood. It tells AutoFit how
09:42
well that model fits information. So the black magic, the crucial thing that AutoFit is doing is that this instance that comes in, as we can see, is an instance of our Gaussian class, and the value of its input parameters, which we saw here, have been set by our model fitting algorithm on nonlinear search. So if we're doing a model fit that has priors,
10:05
AutoFit will take care of all of that behind the scenes, and now you just focus on writing how the actual likelihood is computed. So we have our model, we have our analysis class, the final thing we need to do is to basically put all of these together to perform a fit.
10:20
So we compose our model, we say AutoFit, create a model of a Gaussian, we create our analysis, we pass it the data, which we've already loaded. We now choose our nonlinear search, so I've chosen a Markov chain Monte Carlo MCMC fitting algorithm called EMC. It's very popular in cosmology, but we of course have
10:40
many of the sci-fi maximum likelihood estimators, we have Bayesian inference tools, and this is sampling, if you've heard of that. And by passing the model and analysis to this EMC, we perform the fit, and we'll get the result, which corresponds to the red curve. It gives us the Gaussian that fits the data. And just to really emphasize, this result object that you
11:00
get from AutoFit, it has everything you would need to kind of interpret and inspect how well the model fits the data. So it has the best fit red curve, but it also has tools for error analysis, it has all of the parameter samples of your nonlinear search, and it also has visualization tools for creating these sort of probability distributions. So you can see here that the value of centric was 50, the input value corresponds to high probability
11:23
when we visualize it in this way. Okay, so that's nice, you know, it's straightforward to compose and fit a model in AutoFit, but at the moment it's not clear what this library is allowing one to do that you couldn't do with another PPL, or indeed just kind of write the Python code to do it yourself. It makes the process easier, but there's nothing overly
11:43
compelling about this yet. So now let's start to look at how AutoFit makes it straightforward to customize different aspects of your model fitting. And this is where the use of Python classes really starts to come into its own. So this has downsides of course. Python classes are a bit of a less concise interface, it requires a basic understanding of Python classes
12:03
and object coordinate programming, but as I said it really allows us to build a far more customizable model fitting experience for the user. And so here's an example of how one would customize the model. So in this example I don't want to fit one Gaussian to my 1D dataset,
12:21
I want to fit two Gaussians, which you can see I've instantiated here, and in order to do this I just create the Gaussians and then at the end I combine them in an AutoFit collection object. So this is the beginning of how we're going to start building models of more complexity in AutoFit by combining individual model components. But along the way I take a number
12:41
of steps to customize the model that I ultimately fit. If I'm doing Bayesian inference I can manually set the priors on each parameter, which is shown here. I might know that the sigma value of a Gaussian is 0.5, let's pretend I knew that. I can just set that to a float and this will then automatically have pi AutoFit reduced to the dimensionality of parameter space
13:02
by one, and all of my model fits will have this value for this Gaussian. Along the same lines I could link two parameters in the model, here I make my two Gaussians centrally aligned with one another, again reducing parameter space's dimensionality by one. And I can do other things like make assertions, we have sort of lots of tools to customize the model fit specific to
13:21
what you specifically need to do for your office. The other nice thing about the model API is having now seen that we use Python classes to define our model components you can imagine that if you've got a problem where there's lots of slightly different models you want to fit and compare, the API naturally allows you to do this. So here I've got two more examples
13:42
of 1D profiles, there's a Gaussian kurtosis class, I've just added an extra parameter to Gaussian, and you could also imagine maybe I've got one dimensional exponential. So if you've got these problems where the model could be broken down into these different pieces and you often want to compose a model by sort of building them up together like lego, this is the sort of API
14:01
that this software really tries to facilitate. We also have a lot of customization on the analysis, I'm only going to show one example here but it's pretty cool. So if in your analysis class you add a visualize function, this will allow you to have pyautofit output the current results of the best fit model on the fly. So I've told this visualize function to output the
14:24
images I've been showing you that is via matplotlib output a one-dimensional schematic of how well the model fits the data and when I'm fitting models for cosmology that could take months on the supercomputer, having this tell me after a couple of days if the model is working or not can save me months because I can stop that from running by getting this immediate feedback.
14:44
And on this line of customization there's lots of customization for the actual non-linear search itself, all of the libraries we support you can customize their input parameters and we also add functionality on top of these libraries for example for those familiar with Markov chain Monte Carlo we have inbuilt tools for autocorrelation analysis and we're just trying to
15:03
basically add value to these libraries if one adopts pyautofit to undertake their model properly. And so I'm about to talk about the astronomy but I just want to quickly sort of list some of the advanced features that are a bit too technical a bit too detailed to discuss in a talk like this but at least allude to the sort of things one can do with autofit if you
15:23
you know if you really start to go into into it. And so one that I really want to highlight is our support for outputting results into a database. So you can output the results to hard disk it will create an ordered folder structure that you can navigate with your mouse it's all quite nice but that works if you've got 10 of data so you've got a very small amount of
15:45
data sets yes that will work but we are fitting thousands of galaxies with hundreds of models our results correspond to hundreds of thousands of entries so we had to develop an SQLite relational database that basically streamlined the management interpretation of these results the basic model is as follows as you do your model fits of autofit all of the results
16:04
are automatically written to an SQLite database you can then load this via a jupyter notebook you can query the database for the results you care about and from there begin to investigate how your model fits when what the results are telling you and all of our sort of result API the visualizations I showed earlier they're all built around this kind of database Jupyter
16:25
notebook API. There's also lots of advanced model fitting techniques I'm not going to cover but basically these are sort of very bespoke statistical methods that in certain problems could be really really useful and the one I'm going to briefly mention is we had a problem where a non-linear parameter space was so complex it was so difficult to fit we could not do it
16:45
efficiently so we built a non-linear search of grid searches that basically carved up the dimensionality parameter space over a couple of dimensions and then fitted those reduced parameter spaces in a massively parallel fashion it's a bit of a weird thing probably don't want to do
17:01
it too often but if you do want to do it it's a really powerful tool to overcome the sorts of problems you often face with these types of things we have other tools and I don't have time to talk about them these are fully described on the read docs under the features of that okay so that's autofit that's the type of model fitting that we're trying to do and hopefully you got a sense of what this library is about and I now want to kind of
17:23
describe how we got to autofit from our initial cosmology use case and so you get a sense of the I guess the actual application that drove this and it might sort of ring some alarm bells and the sort of things that you do and so to do this I obviously first need to explain to everyone what strong gravitational lensing is and so most astronomers when they study galaxies
17:46
they look at things like this this is a galaxy in the milky way the typical astronomer you know you get your favorite telescope you point at a galaxy you get an image like this and then you would perform model fitting on this image to you know study your particular scientific interest
18:03
strong gravitational lenses are a very unique phenomena where instead of observing one galaxy you observe two galaxies perfectly down our line of sight so this red galaxy is the foreground galaxy that's closer to us you can see it's emitting red light but it also
18:20
has mass and that mass curves space-time in on itself such that the light from the background source galaxy doesn't travel straight into our telescope but it bends around space-time and therefore becomes stretched sheared and distorted into this ring-like appearance and this is called an Einstein ring after Einstein so these are the example of the sort of problem we
18:43
drove the development of autofit this is a two-dimensional schematic just to make sure we're really on the same page of what a strong gravitational lens is we've got a foreground galaxy here it's curving space-time such that the red light of this background source travels in this curved traversal around into our telescope and this actually means the background source
19:02
appears multiple times as seen here so just you know if you're interested in astronomy there are galaxies that we genuinely observe more than once due to this model we see the same object in the universe multiple times so I'm going to talk about how this informed pyrotic but just for the machine learning aficionados in the audience people who are into
19:22
learning this phenomena has a growing literature of machine learning studies you should check it out if you're interested in this sort of stuff it's a great use case because you can generate large training data sets very cheaply there are these large astronomical instruments that are going to find thousands of this it really is your sort
19:41
of stereotypical needle in a haystack machine learning data science big data problems so if you're interested in astronomy you should definitely sort of just do a google search in this have a read of what's out there I'll obviously say that I develop an open source library that does this analysis for pyotr lens so google pyotr lens if you're interested it's sort of the same it's this is what ultimately led to us developing autofit and again we have
20:04
all these stupid lectures so if you want to get into this check them out okay so that's enough shameless plugging of my other software let's get back to the science what drove the development what was it about these strong gravitational lens systems that pushed us in the direction of making this open source statistics library and it's basically how this phenomena
20:25
makes one think about model composition in particular multi-level model composition something that I haven't yet touched on in the earlier descriptions in this talk so when I look at a strong gravitational lens now I don't look at a pretty picture and think wow that's that's an awesome thing we see in the universe I think about how
20:43
it decomposes into distinct model components that I want to then fit so I have a strong gravitational lens here and my sort of my scientist brain says well there's a foreground galaxy this has a set of model parameters associated with it I want to learn there's a background source galaxy this has another set of parameters associated with it I want to learn
21:04
but then I sort of I break them a lot further I say well this lens galaxy it doesn't just have one model component it has a model describing its emissions or a light model and it also has a separate model that describes its mass and this is what sort of defines how the light in the universe is curved conversely my background source it only has a light model I don't
21:23
need to know anything about its mass so this type of problem it sort of forces you to break the models you imagine into these sort of distinct components that have multiple levels and it's with autofit that we now are going to try and construct a multi-level model according to this schematic and so the lower levels of this model we can use the exact same
21:44
API that I showed previously for the Gaussian we write model components this is an example of a light profile these are the input parameters that would describe the light of a galaxy we can attach functions that we use to fit the data which is shown here and again with
22:01
autofit as I sort of alluded to before we can write other Python classes with a very similar format a very similar API describing the mass of galaxies and the light galaxies this is where we got to with the Gaussian before but we have a slightly different problem now because now we want a multi-level model that doesn't just have light and mass profiles but has distinct model components describing these galaxies that may have their own parameters we need to basically
22:24
use Python classes to construct a multi-level model in particular we need to use hierarchies of Python classes to build models that go up and down and this is where I think the the real compelling aspect of the pyautofit API comes in so this is another Python class where we've
22:45
now written a galaxy object and the crucial thing to understand is that the inputs of its init constructor are themselves lists of the model components I just showed you so pyautofit will see an object like this galaxy it will understand that its init constructor contains other pyautofit
23:05
model objects and it will use this hierarchy of Python classes to construct a multi-level model and you can also adapt additional parameters to these objects so you can again have this very nice level of customization on the model that you build I've got functions here that we
23:20
used to fit the likelihood I'm not going to go into the details of how they're fitting I'm trying to really sell this you know the the composition of these types of models yeah so we're trying to construct a multi-level model that's like this using this galaxy class this is the Python code we've been writing it's the same tools that we saw before but instead of just passing a Gaussian to this model we are now passing it a galaxy and with that galaxy
23:46
we're filling in the light and mass profiles that you would use so if you've got a model fitting problem that you can really break the model down into these distinct components pyautofit has this API that allows you to basically use those components to build models of arbitrary dimensionality arbitrary complexity and so on and so in this example
24:05
this model has 16 free parameters but you could easily make this model have hundreds just by making more galaxies making more light profiles this is the analysis class there's not a lot to say here the key point is this instance that previously only contained the Gaussian this instance now contains multiple levels you know there's a galaxy here the galaxy
24:25
has a light profile the light profile has a parameter so the the multi-level model will be constructed by pyautofit in the nonlinear parameter space and come into this likelihood function in the sort of in the most convenient usable way you could imagine and just to sort of
24:42
I'm wrapping up now just to sort of try and sell why this is so compelling in certain problems we then had a data set where we had an object that was like this which is called a galaxy cluster this has hundreds of galaxies it has hundreds of background source galaxies these galaxies can have multiple mass profiles multiple light profiles
25:03
but because we designed the composition of models in the way that I just described we could compose and fit a model to this without having to write any more source code the pyotofit API was extensible such that we could fit models of any nature given this new data set so that's really what we're going for we're trying to create this model fitting library
25:23
that compels one to design their model fitting problem in the most object-oriented way possible so that you can then build and fit these models in a fully extensible way so this is the summary I've timed myself at about 25 minutes I'll quickly mention we have a whole other use case to do a study in cancer that I've not had time to talk about today
25:43
yeah absolutely check it out if you're interested in speaking thanks for listening yeah I'll stop there thank you James for that very amazing talk we do have some time to take on a couple of questions I'll just show them right here and then we can take
26:05
them for someone who is new to probabilistic programming what does intensity signify in addition to mean and standard deviation okay so that's my so that's my normal let's go back to the slide in this example intensity is a parameter of the Gaussian
26:25
uh it's along with that that's good so in this sense the intensity is basically the normalization of the Gaussian it's one of my three model parameters that in the way I've chosen to parameterize a Gaussian defines how high up this red curve goes so if I doubled the value of intensity here the Gaussian the model Gaussian I'd create would be twice as high
26:44
as pictured here so it doesn't signify it's not like a mean or a signal or forward part maximum it doesn't really signify anything meaning in the context of a Gaussian it was just how I chose to parameterize this Gaussian in this particular setting so sorry for the confusion there all right thank you for answering that uh let's get to the next question
27:08
how does pi out of it compared to pi mc3 and when should one be used over the other yes this is a great question it's something I've thought long and hard about a lot of the time and I sort of tried to allude to that um earlier where there are a lot of use cases
27:25
which you should absolutely use pi mc3 and there'd be no point using auto fit but if you've got a very data-driven use case if you're fitting large data sets you need this high level of customization on your model fit auto fit can be a lot more useful so I'm going to try and answer this a bit more concisely my experience with pi mc3 was if you're trying to
27:45
under integrate a function but you're not really using it to fit data the API for pi mc3 was very sort of um trying to understand the integral of a function trying to sort of do a thermodynamic analysis or something it was never clear with pi mc3 how one would feed data and noise map through the analysis and you have to use specific tools whereas with autofit
28:04
the API that greets you is like this is where your data goes this is how you fit your data so I'd say it's an extremely hard question to answer succinctly but the the notion of having data and fitting that data of a model is something that I would say sort of strikes you in the face immediately with autofit where the other ppls often deal with statistical problems
28:25
in a slightly different context but then it's really hard to give a straightforward simple answer about when you should use one ppl as the other well thank you for that answer anyway
28:40
okay the audience would love to connect with you in the breakout optoverse room and thank you for this amazing talk thank you very much