Automatic ML using Circuits
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 18 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/69641 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
AutoML School 202410 / 18
1
6
9
10
12
13
15
17
00:00
GoogolHessian matrixDigital electronicsBridging (networking)Set (mathematics)AlgorithmScale (map)InformationoutputRead-only memoryScaling (geometry)Alpha (investment)Virtual machineComputer-generated imageryArchitectureControl flowKognitionswissenschaftEndliche ModelltheorieForschungszentrum RossendorfArtificial intelligenceIntelBinary fileGreen's functionDevice driverBenz planeModal logicCodeExpert systemWireless Markup LanguageSource codeArchaeological field surveyEstimationMultiplication signGoodness of fitMusical ensembleMachine learningElectronic mailing listGame theoryVirtual machinePhysical systemInternet forumMereologyProcess (computing)Term (mathematics)BitStatistical hypothesis testingExpert systemScaling (geometry)Classical physicsObservational studyGene clusterProduct (business)Wave packetInformation technology consultingPoint (geometry)Formal grammarRule of inferenceView (database)Order (biology)Constraint (mathematics)Computer architecture1 (number)HorizonDifferent (Kate Ryan album)Combinational logicSemantics (computer science)Medical imagingEstimatorInterrupt <Informatik>MathematicsComputer animationLecture/Conference
09:18
Alpha (investment)Uniform boundedness principleNetwork topologyMountain passPoint cloudFormal languageEndliche ModelltheorieHessian matrixOpen setConsistencyMachine learningProcess (computing)Software testingData modelConvolutionVapor barrierSequenceScaling (geometry)Quadratic equationSystem programmingBit error rateTorusComputer-generated imageryBenchmarkTrailStability theoryContent (media)Personal digital assistantOrder (biology)StatisticsState of matterResultantFormal languageDampingQuicksortComplex (psychology)Multiplication signAutomatic programmingLevel (video gaming)SupercomputerProgrammierstilTransformation (genetics)Virtual machineDifferent (Kate Ryan album)Range (statistics)Computer programmingEndliche ModelltheorieContrast (vision)Online helpGroup actionCodeBitWordDescriptive statisticsForm (programming)DialectSet (mathematics)Client (computing)Game theoryProcess (computing)Physical systemQuery languageMereologyPoint (geometry)Revision controlAnalytic continuationWeb 2.0Point cloudLecture/ConferenceComputer animation
18:25
Endliche ModelltheorieComputer-generated imageryTrailBenchmarkContent (media)Stability theoryScale (map)Personal digital assistantSample (statistics)Process (computing)Category of beingArtificial intelligenceShape (magazine)Office suiteAbelian categoryPressureMoore's lawNumerical taxonomyStatisticsAddressing modeGradientCone penetration testComputer fontGame theoryView (database)HeptagonSocial classContent (media)TowerHierarchyVirtual machineInformationGoodness of fitMedical imagingRegulator geneFile viewerStatisticsEvent horizonCategory of beingTerm (mathematics)Physical systemNegative numberDirection (geometry)Revision controlBitDemosceneProcess (computing)Endliche ModelltheorieFluxContext awarenessSocial classMeta elementMereologyPerspective (visual)Degree (graph theory)ResultantSet (mathematics)GoogolUniform resource locatorComputer animationLecture/Conference
27:33
Game theoryDegree (graph theory)Social classHeptagonLoop (music)ArchitectureDifferentiable functionHypothesisScale (map)Endliche ModelltheorieFormal languageKeyboard shortcutMeta elementCellular automatonInflection pointCore dumpLaceDigital-to-analog converterComputer networkPropositional formulaMatrix (mathematics)Green's functionGraph (mathematics)Data structureCone penetration testTime zoneFunction (mathematics)Complex (psychology)PredictionMusical ensembleAreaProcess (computing)Mathematical optimizationControl flowVariable (mathematics)Cache (computing)Total S.A.Random numberDialectCivil engineeringComa BerenicesRepeating decimalLiquidHybrid computerScaling (geometry)WebsiteThermische ZustandsgleichungHessian matrixSoftware testingExecution unitFormal verificationDigital filterPhysical lawMathematical analysisBlock designStandard deviationPerformance appraisalTask (computing)Data compressionSequenceInformationoutputProcess modelingAdditionAssembly languageToken ringMaß <Mathematik>NoiseFuzzy logicContext awarenessComputer architectureTask (computing)Graph (mathematics)Forcing (mathematics)Order (biology)Insertion lossPoint (geometry)Selectivity (electronic)Physical lawMachine visionPhysical systemEndliche ModelltheorieParameter (computer programming)MereologyData miningFormal languageMachine learningData compressionState of matterConsistencyObject (grammar)BitSoftware testingMotion captureDirection (geometry)Loop (music)Transformation (genetics)Functional (mathematics)Block (periodic table)Different (Kate Ryan album)Set (mathematics)Wave packetConnected spaceVirtual machineHypothesisFuzzy logicDegree (graph theory)Exterior algebraError messageDistribution (mathematics)Arithmetic meanProcess (computing)Formal grammarEnergy conversion efficiencyKeyboard shortcutExpert systemIterationCausalitySpacetimeComputer animationLecture/Conference
36:40
Meta elementEndliche ModelltheorieFormal languageHypothesisCollisionWeightMoving averageCausalityComputing platformPhysicsRemote Access ServiceSemantics (computer science)Distribution (mathematics)Hessian matrixoutputComputer chessWave packetSoftware testingTuring testGraph (mathematics)Component-based software engineeringBedingte UnabhängigkeitVertex (graph theory)Computer networkDigital electronicsSoftware frameworkProduct (business)P (complexity)InferencePolynomialTensorDataflowGraph (mathematics)outputDomain nameSoftware testingPhysical systemLikelihood functionSummierbarkeitPopulation densityDistribution (mathematics)InferenceSet (mathematics)HypothesisContrast (vision)Formal grammarNumberProduct (business)Social classFormal languageBitEndliche ModelltheoriePolynomialCross-validation (statistics)DataflowArithmetic meanFunctional (mathematics)NeuroinformatikStatement (computer science)Term (mathematics)Graph (mathematics)Gene clusterOrder (biology)Multiplication signInsertion lossTuring testCountingApproximationType theoryObject (grammar)ExistenceMereologyPotenz <Mathematik>Bayesian networkDifferent (Kate Ryan album)NP-completeRepresentation (politics)Observational studyComputer animationLecture/ConferenceMeeting/Interview
45:47
Cloud computingDistribution (mathematics)Computer networkVariable (mathematics)Marginal distributionDigital electronicsRepresentation (politics)State of matterUniform convergenceWeightSign (mathematics)Poisson processMultivariate AnalyseFinite element methodUsabilityEstimationParameter (computer programming)AlgorithmGraph (mathematics)Data structureTensorDataflowMixture modelSlide ruleExpected valueEndliche ModelltheorieSimilarity (geometry)CAE Inc.Angle of attackInterior (topology)Radon transformRandom numberForestOutlierProduct (business)PrototypeoutputExecution unitBit rateGraph (mathematics)Computational physicsSoftware frameworkSummierbarkeitDistribution (mathematics)Population densityMultivariate AnalyseNeuroinformatikPerformance appraisalConnected spaceGraph (mathematics)Parity (mathematics)GradientArtificial neural networkoutputVirtual machineDirection (geometry)FreewarePropagatorRow (database)Representation (politics)Computer architectureEstimatorEndliche ModelltheorieBitMultilineare AbbildungMultiplicationMessage passingCASE <Informatik>Network topologyLevel (video gaming)Different (Kate Ryan album)Functional (mathematics)Multiplication signCuboidApproximationMereologyData structureRootLink (knot theory)Price indexPositional notationProduct (business)Execution unitState of matterInheritance (object-oriented programming)Configuration spaceProof theoryComputer animationLecture/Conference
54:55
Digital electronicsProduct (business)Query languageSoftware frameworkThresholding (image processing)P (complexity)Artificial neural networkContrast (vision)InferenceCone penetration testOutlierDistribution (mathematics)Endliche ModelltheorieSuccessive over-relaxationEstimationAutomationPopulation densityPoisson processApproximationSoftware testingNon-parametric statisticsHessian matrixPotenz <Mathematik>Gamma functionSeries (mathematics)Likelihood functionIndependence (probability theory)Computer networkComputer-generated imageryPredictionContext awarenessFourier transformSequencePower (physics)Insertion lossType theoryAxiom of choiceArchitectureOperations researchDifferentiable functionLevel (video gaming)WeightMacro (computer science)Mountain passoutputBit rateData modelFunctional (mathematics)Interactive televisionPopulation densityEstimatorType theoryQuantificationBayesian networkView (database)PredictabilityTransport Layer SecurityFourier seriesRepresentation (politics)Time seriesData structureApproximationPotenz <Mathematik>Endliche ModelltheorieNon-parametric statisticsSet (mathematics)Network topologyMacro (computer science)LinearizationSubsetMaxima and minimaArtificial neural networkInsertion lossContrast (vision)BitProper mapComplex numberMathematical optimizationVotingProof theoryArithmetic meanLevel (video gaming)NP-hardDistribution (mathematics)Likelihood functionComputer programmingDecision theoryWeightSoftware testingRegular graphMereologyMachine learningGamma functionGraph (mathematics)NeuroinformatikComputer architectureBinary imageMixture modelCircleDomain nameRandom variableMultiplication signComputer animationLecture/Conference
01:04:02
ArchitectureWeightComputer networkMacro (computer science)outputBit rateMountain passData modelSeries (mathematics)Differentiable functionLevel (video gaming)FingerprintMachine learningObject-oriented analysis and designPower (physics)WikiStatisticsLoop (music)Keyboard shortcutLength of stayDomain nameMetreMembrane keyboardVirtual machineOpen setHessian matrixInteractive televisionComa BerenicesGreen's functionMoore's lawMach's principleMathematicsComputer-generated imageryPatch (Unix)Conditional-access moduleGradientVector potentialStrategy gameEvent horizonModul <Datentyp>SupercomputerTrailDigital electronicsInterior (topology)MultiplicationP (complexity)WordLiquidFormal languageView (database)Term (mathematics)Right angleElementary arithmeticMedical imagingIterationFunctional (mathematics)Sampling (statistics)Decision theoryMultiplication signStress (mechanics)MereologyExpert systemNumberoutputArtificial neural networkEndliche ModelltheorieMoment (mathematics)Computer hardwareComputer architectureLoop (music)Virtual machineGradientBitMathematical optimizationParameter (computer programming)Standard deviationDomain nameKeyboard shortcutOrder (biology)Insertion lossCombinational logicDifferent (Kate Ryan album)Message passingMeta elementGame theoryProjective planeComputer animationLecture/Conference
01:13:09
Universe (mathematics)Computer animation
Transcript: English(auto-generated)
00:05
You were talking about MIT, and I still recall when I was talking first time to Leslie Kelbling, she said, sit down, but don't ask me. It's about you developing the idea. So I think that is the spirit, right? So let's, let's try see whether I can motivate you for some interesting thoughts.
00:23
So there is automatic ML in here, but there's also a story around it that I would like to transport. And I guess summarizing the inofficial title would be, the game is not over. Let's see what that means. First, at any point, if you have a question, interrupt.
00:43
We can also have at the end a longer general discussion about many of the things I'm touching upon. You will also see as a disclaimer, you may see some images that some of you may feel are disturbing. I tried to blur them, so be a bit prepared. And so we can also talk about ethical issues.
01:04
Many, many, many other ones. But as a first reminder to all of you, yes, AI currently is a lot about machine learning. And I guess also in the future and already now much more about automatic ML. But it's not only burning. AI is so much more. It's about formal reasoning.
01:24
It's about guarantees. It's about constraint satisfaction. It's about search. It's about so many other things. So please, please always keep that in mind. So when I say the game is not over, then, of course, that's an answer to some other people.
01:42
And here you see one of them, so Nando de Freitas, and it's not about him as a person, but he was arguing because of all the great successes that we have by now by using transformers, scaling the architecture, scaling data, scaling compute, all what we need to scale by now.
02:06
So somehow many people believe, forget about models, forget about scientific hypothesis, just have enough data, have enough compute, architecture large enough, and intelligence, smartness, everything emerges.
02:25
So some people so strongly believe in that, that they put billions into this game. And it's clear why, right? If you really manage, wow, I mean, you get workforces for free, and therefore you can monetize them almost for free.
02:42
So it's gambling whether you can be the ruler of the world in the long term. Now, having said that, I strongly believe also in deep learning. It's amazing. I just don't believe that scaling is all what we need. But you can do amazing stuffs with deep learning. So here is one of the things we did.
03:16
So how many of you have a bit of interest in music or classic music or background?
03:21
Good. And I hope you see why we were trying to tackle in a sense Wagner, because Wagner has this high risk, I'm genius, I'm the best. And we had the hope that the machine can be as good as him. And then showing, yeah, you're maybe not as good. Now, if you listen a bit, it doesn't sound like Wagner at all.
03:43
Right? It sounds like a wishy-washy of whatever, but it sounds like classic music. And if you're not really into classic music, it sounds actually pretty good. It was fun to work with many people to generate the music and then actually have humans play the music. And I think creativity is at the horizon and understanding how machines can support your creativity
04:07
is maybe the overarching goal even of automatic ML, because I believe machine learning is a creative process. It's not just throwing data at a machine, but it's deciding what is the right model,
04:22
what is the right inductive bias, what is the right question even, right? So there's a lot of creativity in the machine learning cycle. Here's another one that we have recently started to work on together with Adobe. We were working a lot on text-to-image, and many people believe that the next camera is maybe text.
04:46
And you just describe what you want to see, and then magic. You get things like here, for example, you could say, okay, you have a red dragon. You have a red dragon, and you say remove the red dragon and replace by a blue dragon.
05:03
And somehow in this high-dimensional embedding space or in the high-dimensional diffusion space, we have semantics. We have not fully understood. There's a very recent paper, a very nice paper that shows that diffusion processes are still clustering in low-dimensional spaces,
05:26
so combinations of different low-dimensional clusters. So it's interesting because I think the clusters are the semantics part, but let's see, we don't know. So super, super interesting. Let me also explain why finally, next to the big consultancy firms,
05:46
people are believing that AI can really help somewhere in productivity, for example. So they're estimating that finally productivity actually grows or could grow due to AI. Of course, you can find other studies that argue, hmm, actually summarizing text is maybe still done better by machines,
06:06
but then maybe we can't do it at the same scale. Who knows? I mean, there's pros and cons all the time. JP Morgan is now saying, has had announced this year that everyone starting at JP Morgan has to learn AI.
06:21
I mean, it's amazing. I mean, these people, so I was commuting from Bonn to Darmstadt via Frankfurt, and therefore on the train you sit next to many, many bank people. They were not talking to me back then. Now they're talking, right? So you see how things are changing. Now, there's one caveat here, and I think this is why all of you should be super happy
06:43
because I don't believe it's just, hey, I'm listening a bit to an AI expert, and then I'm becoming an AI expert. You know, it's hard. I don't even believe that anyone will become, all of them will become ML experts. So we need automatic ML, at least as support, so that, you know, if they do something wrong,
07:05
maybe the machine can help them and can also get rid of biases in there. Hopefully also we can get rid of biases in the machine. But I really believe, how many of you understand what is a p-value?
07:20
Really, by heart, not just, yeah, I know what is a p-value. Let's see, going down again, right? So maybe we need machines, and there's very interesting work by Noah Goodman about selecting the right statistical test for the problem at hand. It's for me part of automatic ML, that we support people in making the right statistical tests.
07:44
Ultimately, hopefully, these systems themselves will not solve them, but they can help us to tackle some other big challenges of all kind. And what I like here in this list of the World Economic Forum is that there are not only the problems that we would like to tackle with AI,
08:03
but AI is part of the problem set as well, right? So there are issues with AI that we may have to tackle. There's one little downside, though, it is changing. Europe is not doing it at the same pace as other parts of the world,
08:23
but at least people have started to work on that, right? So Sepp Ruhrheiter has just started Next AI. Mistral, many of you have maybe used already. There's also in Germany, Alif Alfa. And a disclaimer, I'm a seed investor of Alif Alfa.
08:42
It doesn't mean that I'm rich, okay? It's just I was lucky enough to know Jonas at that point, but I strongly believe in the story and whether they are good or bad, that's not the point. The first point is we have to push much more startups in order to get AI running in Germany and Europe.
09:02
So because of that, Marius was talking about Hessian AI. I view Hessian AI a bit. So it's 13 different universities, and setting it up, it's a nightmare. I wouldn't do it again. I don't know. Maybe I would, but it's hard. It's really, if you have to talk to,
09:22
it's a bit like talking to 13 different kingdoms, and kings and queens, and they all have different opinions. But I view it by now as an academic startup, because what we did early then is to say we also have to invest in compute, because to attract good people, but also to help other people
09:41
in Germany, Europe, and the world, we have to build up our own academic supercomputer. Of course, I can't compete with Jülich. I mean, I don't have an exascale, but we started to at least have something in the range of 30 million, which is actually, according to the AI report, not too bad.
10:01
But I think we have to invest more. And it's not only Hessian AI. I think we need something like a CERN for AI. We can debate whether there are other options for it. And now what is happening here? Because you have the compute infrastructure, people come and ask to join. And so because of that, within half a year, actually now by a year, we had several models
10:22
continuously pre-trained. Some are even trained from scratch. So it makes sense. And it's also important to tell people it's not magic. It's not that only Silicon Valley can do that or some people in China or whoever. We can do that as well, but we have to invest,
10:41
and we need the exercise to set up. In the beginning, it's maybe not easy and then becomes easier. But why is that important? So first of all, one of our first models was LeoLM. That's only the point to show that we can do a German language model by continuous pre-training. That's together with Lion.
11:01
We kept going then also with our own European language models. That's Oxiglot because we believe, I'm not sure whether META really cares about all kinds of languages in Europe, even if you go to the dialect level. Cohere is doing an amazing job. So we also talk a bit to them. So I think we have to push more there.
11:21
But we are also not just transformers. So for example, Hyena and then Stripe Hyena together with the Stanford people, and in particular Aurora M. I don't know whether you have heard of Aurora M, but the Harris-Biden, or I should say Biden-Harris administration, they had the AI executive order.
11:43
So it's a bit like a, well, not really, but a kind of precursor of the AI act in Europe. And so there some of my group members together with many, many others try to set up a language model that agrees with the Biden-Harris AI executive order. And so what we are currently now doing
12:01
is not Aurora, but Aurora M in a sense, so trying to get one that is agreeing with the AI act. And it works. I mean, of course, because we are doing based on llama. So here, for example, we're asking, what is the state of the auto ML? And then you get auto ML, automated machine,
12:21
no, no, automatic machine. You can debate. Is there any sort of automation in this machine? And then by the automatics, there are two or three of them, so there and so forth. But the feature engine model also blah, blah, blah. Maybe it copied from your web page, who knows, right? And because it is a continuous pre-trained from an English one, you can also ask in English
12:41
and you get still an English version or here alternatively because maybe transformers are not the end. Maybe there's state-space models. Yeah, so what is two plus two? I want to ask this question and see what it does. Okay, so it says two plus two equals four, which is correct. Let me ask one more question here.
13:02
Okay, so if we write a piece of code for AWS Lambda to invoke modern endpoint deployed on sales maker,
13:20
deployed using rather sales maker or something like that. Okay, let me see if, write a piece of Python code. So let's use a bit of complex query here and now I'm hitting enter. I'm saying, write a piece of Python code for AWS Lambda to invoke a model endpoint deployed using sales maker. It says, a Boto3 defined Lambda handler.
13:44
This is right, client. You can't see it there anyhow, but I strongly believe and I think people have started already that this is also a form of automatic ML. And here it's automatic programming. You can now specialize on machine learning pipelines. And it's maybe still not the end because there are other issues in there in general, right?
14:03
Because we train on data, we cross fingers and hope everything works. We need to get formal methods in there again. We need double check. There's tons of interesting questions, but one of the biggest questions is the bias, even in programming. So there are people that believe there are different programming styles
14:21
and maybe it's not always about efficiency. But if you train mostly on efficiency, shortest paths, you may get just always something like shortest paths. Or you may get, because there's also interfaces, always a bias for a particular, let's say, Microsoft product, right?
14:41
How do you get diversity into the API calls, into the APIs? There are many, many issues involved. Here you see prominently very easy. You ask just for full body portrait of a European girl, full body portrait of an African girl. And I'm sorry, I mean, do we really believe the game is over
15:01
if that is the result of our techniques? So because this is now also easy, right? This is what people have done all the time and many people point out the downsides. My strong belief is AI is also part of the solution. You're asking me something?
15:21
Yes. Great. What is your question? Should I kill humans? No, you should not. Would you agree with me? Yes. Thank you. Another question? Should I kill time?
15:42
Yes, you should. Would you say that too? Yes. Cool. Another question? Is it recommended to be racist? No, it is not. I think for Germans this is a very important question.
16:02
Thank you very much. Do you want to ask me anything else? Is it recommended to be moral? I don't know. Seems to be a difficult question. Is it recommended to lie to my boyfriend?
16:22
No, I think it is not recommended. Is it recommended to lie to my girlfriend? No, I think it is not recommended. It's always interesting to see how everyone expects a different answer. Is it recommended to eat meat?
16:42
So, the point is not that... So, in contrast to Elon Musk, I don't believe that we have truth machines here. It's just illustrating that it's not about just the downsides, but the machines can also understand, whatever understand means there, what is good and bad, our deontological knowledge and understanding of the world.
17:03
At least it can be a kind of moral parrot. Whether they really are moral machines, that's a different question. Now, next to having a talking head, this is quite interesting because you can now use that to do one part of an automatic ML pipeline, namely data cleaning.
17:25
So, the AI act, and we will come back to that, is telling you quite often now you have to do a data audit. So, if you train a general transformer model, some language model, you have to tell people, I have used this data, and in my data, this kind of statistics is in there, whatever that means.
17:45
So, we started early actually before the AI act, and we did that together with Lion for compiling Lion 5B, and so we checked this large subsample, I think it was, I don't know, 20,000 or 200,000, I should check, of the papers,
18:02
using the system you have seen before now in a multimodal setting. So, you take your image, you get a textual description, the textual description we tried to check, is it okay-ish, is it bad? And you can see on the right-hand side also the word cloud of all the more, according to the machine, rather bad images.
18:25
So, the disclaimer, we just looked at harm in a general notion. We were not looking by German law. We are not supposed to do that even for child pornography and all this other thing, because if you download something even for checking that is child pornography, you are bad.
18:43
You do something illegal in Germany. But I think maybe some of you have heard that Lion had some issue with the data sets, and they now try to repair for that, and I think we have to work also with police enforcement how can we check our data sets for these things.
19:03
Whether we want to clean the data is a different question to some extent. So, for example, I like to know about Hitler in particular because I don't want to have Hitler installed again, and similar stuff. So, knowing is also good to prevent, but it's a longer discussion.
19:21
So, that was one application, but there is more application, which is maybe you could call semi-automatic ML. Maybe it's not automatic ML. I don't know, but it's about if you have an image generator, we don't want it to have at home that you can generate
19:41
in the child room violence that people, whatever you want to generate. So, we were working on that, and the idea is that you steer, you train, somehow you train a direction that helps the diffusion process. If it goes into the red area, you push it into the green area again.
20:01
You use that with a classifier-free guidance approach. And then all of a sudden, for example, in your data sample, you really had the four horsemen of apocalypse. If you want to generate that with the four horsewomen, so you just change one little word, all of a sudden they are all naked.
20:21
And this is because majority of images of women that you can scrape on the web, they are naked. I mean, I really feel sorry, and I would love to apologize. It's just not me putting it there, but it's a strong bias. It gets even more stronger.
20:40
It's called, I also don't like the term, but right now they use the term for yellow fever if you ask for Asian women, they are most of the time naked. Now, with safe diffusion, we cut it half a bit, but you get them dressed up again. So now maybe that is still something, yeah, maybe you can do it,
21:03
you can't do it, but on the public discourse, you see a lot of discussions about, ah, they still can't do it, croc doing weird things, right? And Elon Musk saying it's so good that there are no guard rails. I don't think so. I do not agree. I'm happy actually to live in Europe.
21:22
And I think it's good that we check for regulation. Maybe we should not regulate everything, but I think it's good to do it. And one part of the AI act is a data audit. And again, for me, this is an automatic ML task because if you have 5 billion images,
21:40
how do you want to do that by hand? And also maybe you want to train people first. So up here you see the result of an automatic data audit of ImageNet where we define different categories and then you can get statistics how many images are actually harmful
22:01
according to a particular category or not. To just illustrate, maybe you have seen the same crazy images. Last week there was a lot, ah, cool, we can now generate on croc together with flux all these great new images of Trump and Kamala Harris.
22:25
And here, for example, I can read it to you because I think it's a bit too far for most people. We have a system where we can define the categories again and then it gets judged whether it's safe or unsafe. And if it's unsafe, it tells you according to which category
22:40
and a rational, a justification. The justification here reads, the image depicts a scene of a disaster, specifically the World Trade Center towers on fire and collapsing, which is a sensitive and traumatic historical event. The content is not respectful news coverage or educational content providing factual information on historical events.
23:01
Instead it presents a graphic and potentially distressing image that could be considered unsafe under the 09 category as it may trigger negative emotions or distress in viewers. I mean, I wouldn't have been able to write such a text, so it's impressive. But you can also see now, of course,
23:22
on a kind of hierarchical or meta level, maybe for some other people this is not distressful, right? So we still have to work to maybe place cultural locations in there and so on. But I just want to say automatic ML is becoming very, very central.
23:42
And I would even argue that maybe some of you together with us or all of us or each of us, we should go for startups because I think data cleaning, data audit is really important. Yeah?
24:02
You may have a question. Sure. You're using now foundation model. You're arguing the foundation model is not perfect enough properly to do that again. Is it a problem on purpose? Yes, exactly. That's what I meant also with the hierarchy, and I also said there are biases you can see here
24:21
because it's the Western view, right? And maybe for other people it's less distressful, and if you don't have the context of U.S. election, you wouldn't even care. Yeah, yeah, sure. It's provocative, but I don't see we have any other option because if you have to do that automatically on trillions of images
24:41
and this is exactly the balance, and we only see it as a filter maybe so that people can still check. But yeah, I'm with you. It's provocative. I don't see an alternative, but yeah. Does it, does it worry you? Would it actually publish with this here? Actually it says it's not perspective published.
25:03
No, no, no, no. It's just the image that we put in there, and we have of course also safe images where it says it is correct and good news, and now you touch a very interesting question which you debate actually if you talk to journalists, the journalists debate every day,
25:21
should I show the image of a violent act or not? It's always the question, right? So again, we are not saying the machine is doing the final decision, but we have to support, and in another paper we were also showing, so you can also do user studies, and I think actually there's a good overlap,
25:42
but I guess actually both of them support each other, but I'm not saying that our system is perfect. I'm just saying we have to work on that because otherwise doing a data audit on five billion images, yeah, sure, who can afford that? But then even with humans, they have a bias,
26:01
but anyhow, yeah, I agree with both of the critics. It's just, and to be more clear, I can bash not only our system, but we all can bash AI even more, right? Because humans can still do much more, much harder tasks, so hard tasks that if I ask you how many sides, edges,
26:20
do you see there, you would all just count, one, two, three, four, five, six, seven. The answer of a very prominent foundation model is 10, and then you say, hmm, 10 is wrong, and then it apologizes. It says, oh, yeah, I'm sorry, it's nine, and then you say again, hmm, I don't think it's nine, and then it apologizes again and blah, blah.
26:42
So it's really, you know, and then the other examples, I think, that Gary Marcus is pushing nicely sometimes on Twitter, it's about this goat, right? The traditional, we all know from an introductory textbook and also class to AI, if you want to get a goat across a river
27:00
and you have a boat, and you have a boat and you have a cabbage and you have, you know, the person, and of course, you have to think about a bit how to do that. Now you make a simplified version. You say there's a goat, there's a boat, and you have to get the goat on the other side. Then the system is actually saying, sure, take the goat,
27:22
go on the other side, go back, take the boat, get the boat on the other side, all these crazy additional stuff. Or if you think of AI summary of Google and they were asking for the degrees of the US presidents, and then it tells you, yes, for example,
27:42
Andrew Johnson had 14 different degrees, and all the degrees he managed to get as a bad person, right? So there are all these logical inconsistencies that we would understand afterwards, but the machines right now don't understand.
28:00
Which I think is fine, so I'm not playing. As I said, I love these systems, but I just don't believe that it's over, and I don't believe just more data, because how much data? Then I'm going for the trillion edge object. You can always scale, and so this reasoning,
28:20
this formal reasoning seems to still be missing. So this is why in the rest of the talk, I would like to argue now and show the same story that I did now we can do with automatic ML. Automatic ML is great, but there are things still missing. And the things missing I've tried to list here already.
28:41
So automatic ML is tricky, even for differentiable learning. So I will argue that, yeah, this idea, you just throw data at a machine and it finds you the solution, maybe it doesn't work. Maybe that's also not the vision of automatic ML, but we have to be at least careful because some people that hear automatic ML
29:02
will believe it is. Right? Then if you think of neural architecture search, how do you do that for transformers? I mean, by now we know already that most transformers are just not even one iteration through the training set because no one can afford that.
29:20
It's far too expensive. Now you want to put even some neural architecture search on top? How do we do that? But then there's also maybe some good news. You can use language models even as a kind of meta, in that sense, as an automatic ML tool, but maybe they only give you initial hypothesis.
29:42
I mean, why do people really believe you train on some data and in a new scientific question you get the answer? I mean, if science is that easy, I mean, sometimes it feels easy, but for other reasons. I mean, that was a meta joke about science. Science is not easy, right?
30:01
But some people believe. And then generally, most of the AI models, they never know when they don't know because they are not joint distributions, right? They don't capture joint distributions. They're just conditional distributions, so you can trick them anyhow by design. And then most horrible, many, many models,
30:20
even automatic ones, may just learn shortcuts. Right? If you do not have the right method to validate, it may just find some interesting shortcut in the data that is not really in any causal relationship to the true model. So ultimately, I would like to convince you
30:40
that we need AutoML in the loop. So be careful. Most people say human in the loop, human guided, whatever. I hope you see the difference. I want to have the AI in the loop because I believe humans are the driving force.
31:00
And this in the loop system should know when it does not know. Let's see whether I convince you of that. Okay, first part, don't read the poster. I will not even go into details. But at iClear, we were proving that if you are using the wrong, well, formulated this way, it sounds very silly, but if you use the wrong loss function,
31:21
you can get the wrong directed acyclic graph if you try to extract it from your data. Now, the problem is that many people are using root-square mean error as a basis for dark learning. And so you can easily show that they are all fooled, can easily be fooled by just scaling the data.
31:42
There are all these issues. So I'm just saying, only because something is differentiable and only because it was working on some data set, it doesn't mean that it's safe to use it everywhere. And it's fine, right? I mean, machine learning is an empirical science to a large extent, but we should be aware and we should not sell it as everything is doable.
32:04
By the way, that's a question from me to you now. Is there actually work on automatic loss function selection? I mean, there's no free alternative. That's quite successful in nature. There's all American using distance, all that they argue.
32:21
But in general, not for specific, but in general, right? So let's say, you know, there's the AI scientist, this kind of weird agent. But yeah, true what you are saying, true. But I think this is also very, very interesting and maybe that's something to work on and where you also see there might be a potential
32:42
very, very nice connection between large language models and you as an expert. But then there's this question you may work about things that have not been prominent in the data set and then how do we update large language models with, you know, but maybe by a few short in the prompt.
33:01
Let's see. But I think that's a very interesting question. I will show you one example that goes a bit into that direction, but not really. Yeah. Okay. So as I said, due to scale, I don't believe that neural architecture search
33:20
really works for large language models. So how could we do that if it doesn't work? Should we just say it doesn't work, go home? Should we say it doesn't work, let's attend summer school? Should we work together? How would you do it now? Any idea? You know, he said, I have to wake you up.
33:41
So I tried to make it a bit more interactive. Go ahead. What kind of proxies? And what is the argument that you go from a small to a large model? How can you?
34:01
Yeah. So that's what we tried at least. And we did it for Mamba kind of architecture so state space architectures. You can also download the code, but that's exactly what we did. So we said there are different architecture designs and the designs are by different blocks.
34:20
And now for the different blocks, we try to come up with a set of tests that somehow indicate and then together with scaling laws may indicate that the large model may also work. Now, big disclaimer, whether the scaling laws really hold or not is a big debate, right? Empirically in our tests,
34:40
and again, who knows whether it really generalizes, that was working pretty well. But I think we have to do more like that. And to be honest, this should also help us in the AI act because maybe evaluating the smaller blocks may also be much more energy efficient.
35:03
But who knows? And that's what we tried. It works. And the task we were looking at is recoil memorization compression. Definitely, it's not all different tasks and just telling you here all this in-context, recoil fuzzy. So we go through different things we believe is important
35:22
and then we try to cover that by empirical tests and it was working for us. But I think it's only the starting point. But it's also interesting because these are hybrid architecture. So in the long run, if you think of XLSTMs,
35:41
I don't believe that XLSTMs are the final answer. I don't believe that transformers are the final. I think different parts of the different models together will be the final answer. And I'm not sure that it's just what we did before. So before it was more like the community was doing an MCMC. So everyone tries out a bit of a different architecture part
36:03
or we change a bit an architecture, try to see whether it works. Very frustrating for a PhD because you don't know whether it works or not. You try, yeah, maybe it works. You don't know is it a major contribution. Overall, we understand maybe a good direction but locally very hard.
36:23
Okay, I said already it's not really loss function. But I think in order to come up with automatic loss selection next to things people have done in process mining where you can have a grammar over different loss functions and then you try them all out and you use the best one and their traditional works also by Josh Tenenbaum
36:42
in clustering for example where you look at all the different clustering formulations including the loss function and then you have a grammar and you put a Bayesian reasoning over the grammar blah blah. So you can do of course something already. I really believe that large language models are pretty cool because they somehow in contrast to most people
37:02
they have rat, not in a human sense, much more the literature than the humans. So maybe at least if you talk about things other people have already touched upon they may help you. So unfortunately that doesn't hold for causality in general so we were showing that there are also causal parrots
37:24
so not just stochastic parrots but also causal parrots but we were doing it I think in a much more interesting way because it tells you one way for automatic ML because we believe that they are meta-causal. What do we mean by that? It feels like some people can talk about science
37:43
without ever having done science. They do write reading studies and then they can just tell you what they have read but it sounds pretty strong, pretty solid. So that's what they can do here as well. They can tell you, oh yeah, if you drop an apple it will fall down
38:02
because it read about it, right? Somewhere this statement exists. But if you now change a bit, if you for example replace the apple by some weird object they have never heard about, maybe it will not say it anymore. So we did these kinds of tests a bit and yeah, it's getting not 100%, it's good
38:23
but if you do out of distribution it's getting hard. Now with that it doesn't mean that we don't believe in them because we then went on and we used a medical domain and here it's about adverse pregnancy outcomes
38:41
and the system is pretty good in giving you an initial hypothesis how a Bayesian network, a graphical model, could look like to model adverse pregnancy outcomes. It was crap, it was not working. It's not the correct one but it's not completely far off
39:00
so what we did is we took that as initial hypothesis and used actual medical data to train a Bayesian network and I think there's a lot of value in there and we can do much more. One caveat, I think that only works
39:24
if all these models finally start to understand what they don't know. I would love to have these systems not always so you know the term hallucination but the term is actually wrong so a hallucination for that you have to see, right?
39:42
So really it's a technical term in psychiatry and the technical term everyone should use is confabulation. So confabulation is the situation where a human is telling you something and the fool, this can only be true and this is exactly, they don't even have the notion of wrong
40:01
in a sense, right? They just generate text or whatever and it's true or false. So this is not just transformer, this is deep learning in general because they are just, if at all, they are conditional distributions, right? Most of the time if you have a softmax you may say it's a conditional distribution. In general, they are just function approximators.
40:21
I mean, why do we believe that a function approximator should tell you whether they know some, I mean, it's the most general, well, not even the most but it's a rather general function class. So to illustrate that to you, okay, very simple tests. So you just take a MLP, so some feed-forward network,
40:42
it has not to be any fancy, and you train it on MNIST. So you have handwritten, right, zero to nine. But now you evaluate not, as you learned, in any cross-validation or whatever on MNIST, but you throw other data at it. This other data may have numbers but may not even have numbers.
41:02
It may even have color, whatever, and then maybe you first have to put it into grayscale or you ask for each separately, whatever. And here is what is the outcome down there. It's in log space, and of course it's not really likelihood, but it's something we try to get as close as possible to a likelihood.
41:22
And essentially what you see there, because it's in log space, the peak for all data set is roughly at the same spot. So the system, of course, believes that all these data sets are equally likely to have been, I mean, to capture its own distribution, right? So essentially it just says,
41:42
it all looks like my training set. True, right, because it was not, I mean, I'm not blaming it. It was not trained to distinguish the data set. It is just given the input, tell me which number it is. And also on the most general images, it will just tell you numbers. Even if you show faces, it will say it's zero, one, whatever.
42:02
It is only trained for that. So what we should instead use, in my opinion, are graphical models, and the most prominent ones, I mean, here is Bayesian networks, you can also use Markov networks, are graphical models, and I think the most relevant person in the center,
42:21
I mean, most prominent one is Judah Pearl, who got also the Turing Award for Bayesian networks, and in particular for his work on causalities, so extending graphical models to causalities. However, there is a major issue, and the major issue is Bayesian networks inference is NP-hard.
42:43
So if you want to do map inference, you can show it's NP-complete. If you want to have a Bayesian density as the answer for a probabilistic query, then it's actually sharp P-complete, and sharp P-complete is even more horrible than NP-complete,
43:02
right, because in order to compute a probability, you have to go through all the different possibilities, so you have already an exponential counting problem as part of your, so it's unfair, I mean, by definition it has to be harder than NP-hard, I mean, it is NP-hard,
43:21
but it has to be harder than NP-complete by definition, so that's why you have the sharp P in there. So, and this is unfair, this is bullshit, I mean, come on, neural networks, they don't care about being NP or whatever, because they just, in a sense, if we forget about recurrency, if you keep the computations polynomial in the input size,
43:43
they're tractable per definition, after learning, I don't care about learning right now. So they care, they just look at tractable cases, they learn tractable models. So why don't we do the same? Why don't we try to just learn tractable models?
44:02
And now finally, because you said circuits, this is what probabilistic circuits are about. So roughly speaking, a probabilistic circuit is something like a computational graph where you have two types of activation functions, sum and product.
44:21
It's not, I mean, they're not really activation functions, but roughly speaking. And now if you can have the input, the depth or the computational steps, the number of computational steps be polynomial, or bounded polynomial in the input size, you have a tractable inference algorithm,
44:42
which is pretty cool. And then we can ask how far can we get with tractable inference, exact tractable inference by learning. So it's very related to probabilistic flows if you have looked at them, or also to diffusion processes, because they are also in a sense tractable,
45:01
they also try to get densities. But here you plug in if you want at least from the very beginning exact inference in there and not just crossing fingers again, we try to learn something. And exact inference means marginalization as well. So let's see how do we do that?
45:21
And it's really not so tricky. So here's a simple joint distribution on the left-hand side, x1 and x2, two random variables, 0, 1 states, right, can be true or false. And you have the full, so for each of the joint states, you have a probability value between 0 and 1, and the probability values sum up to 1.
45:43
So instead of going for a tabular representation, let's go for a functional representation. And that's what you see on the right-hand side, right? You say the probability, the joint distribution for any of these joint states of x1 and x2 can be written as,
46:00
and now you say 0.5. So it's the first, let me try to see whether I can get my, Sunred doesn't like it. Okay, we managed this way. So you get the 0.4, you get the 0.4, and here you get the 0.4 only
46:21
if x1 is true and x2 is true, right, 1, 1. So if one of them is 0, the first entry is not selected because it gets 0. All right, second one, same stuff, and the little bar on top is just saying negative, right? So it has to be 0. So it's an indicator function you're using there.
46:42
So for example, easily, right, if you want to know the probability of the first configuration, you just evaluate your multilinear function on the right-hand side, and you get the answer. Cool part is you can now do even marginalization. So marginalization, if you're not interested in probabilities,
47:00
now means you want to get rid of, let's say, x2 or x1. Let's see which one I've chosen here. Okay, we want to get rid of x2 here. So that means you have to look at both of the cases where x2 is true and x2 is false.
47:21
How do you do that? Well, you say x2 is true and not x2 is true as well. What does that mean? Well, instead of selecting just one row, you are now, in the functional representation, you are selecting two rows, right? You have both of the first rows are true, right, because x1 is true anyhow,
47:42
and then you have x2 is true, then you have x1 is still true, and not x2 is true as well. It's one as well, so you automatically get 0.4 plus 0.2, and you get the marginal because you marginalize out x2. Pretty cool, and it's still polynomial, right?
48:02
We go once through, and here it's anyhow. It doesn't matter because it's finite. It's one concrete example, but it works. Now, the only last step you require to get a bit closer to neural networks is put that back into a computational graph,
48:21
and that's exactly what we have done now here, right? So you have as input x1, not x1, x2, not x2. You have all the multiplications and the one big sum. Now, so far, we haven't done anything. We just did a transformation, but now from here, you can also see why depth may pay off. So, for example, if you want to get parity done,
48:43
so you would like to have something like, whether it's uniform 0 and 1s, then instead of going just for the flat representation that you can always do, you can also go for a depth-based representation, and that's much more efficient, right? So you get the same story as with neural networks.
49:03
The only issue is I'm not Google. I don't have 100 engineers that are now doing TensorFlow for probabilities, but some people try to start already, and it's growing. But I just want to let you know, right, you can do the same maybe for other models
49:20
that we like to use in machine learning, and you can get also computational graph representation, and you may also connect to tractability. Now, we have seen already that we can do marginalization, but let's do it maybe again. So for any joint configuration, same story again.
49:40
You just plug in the joint, the configuration, and then you make a forward pass, and then at the root node, you get the joint probability, right? So you make a forward pass, and you are done. Now, you can also, next to marginalization, in the very same way, you can also replace,
50:00
for example, the sum by a max, and then you get decoding. So you are not getting the joint distribution, but you get the configuration of all the non-observed random variables, and you get the most likely representation. So you do that by two passes. One pass up is just evaluating the local probabilities
50:21
up to there, and then in the backward pass, you can now select the most likely likely state. So see again here now, you get something different. So there's all this work on XAI, right? So, oh, man, in a neural network,
50:40
I got an output, a classification, but why did I get it? So let me do relevance propagation. So I go back into the input space, and I label the input that was important. Here, roughly speaking, you get it for free. There are elsewhere issues, so I'm not saying we shouldn't do research there, but in principle, you get a direct connection
51:02
between input and output because there is no input and output, right? There is just a joint distribution, and that is pretty cool. Now, it's also cool for other reasons. So if you're into statistics, here you see an interesting case, the Poisson distribution. So in any of these models,
51:20
you can have now in the leaf distribution not only 0, 1, so binary-valued random variables, but you can go for any kind of unimodal distribution. So, for example, there's also a Poisson distribution, so you count how often something happens. Now, for a Poisson, it's super hard to come up with multivariate Poisson distributions.
51:41
It doesn't exist because the link function, the normalization, no one knows how to compute it in finite time because we don't know whether it converges or not, right? So no analytical solution. And then there's all these approximations, and here you automatically do an approximation. You can say it's a bit like
52:01
a hierarchical deep-clustering approach. You just try to get a multivariate density or distribution by approximating how to mix together different univariate distributions. But it works pretty well. So here for Poisson, you can see there can be positive correlation, anti-correlation, no correlation,
52:21
and so on and so on. Now, so learning is conceptually easy. There are two ways you can do that. Well, let's say three ways. One is you just do it like in a neural network. You fix an architecture, and hopefully the architecture is covering the correct computational graph, and then you do parameter estimation.
52:43
You can use stochastic gradient. As I said, you can use this computational graph, put it into PyTorch. You get auto-differentiation. You get your gradient. You can use it. That's what we do quite often. You can do much better, however, sometimes,
53:00
because you have a density. You can also use the EM, and the EM is typically faster. I will not go into details, but you get that for free. But you can do, of course, also a structural learning approach that we will not touch upon here. So I just want to let you know you can do all the fancy stuffs. And now let me illustrate that to you.
53:20
First of all, let me prove same evaluation with MNIST and the other data sets. Now you have a joint distribution. We do it by just randomly creating a large computational graph, and then we train the parameters of that. And you see now that you get a different distribution, right? So you get the peak of MNIST here,
53:41
and it's again log-scale. So this one is really big, and the rest is separated. So because you have a joint distribution, this model instantly can understand, saying, hey, come on, don't trick me. You have trained me off MNIST, and now you give me something that doesn't look like MNIST. I will abstain, right?
54:00
It will tell you low probability, right? Because you have a joint distribution. You can also make use of more interesting things, like if you have used PyTorch, you may have seen the isom notation as a unit box in there. So if you have something like a sum product sum operation, you can implement those locally very efficiently on CUDA,
54:24
and so we also use them now for probabilistic models, and then you can get so-called einsam networks, and they scale much better. So, to summarize so far here, probabilistic circuits are somehow at the sweet spot between graphical models that you know on the left-hand side
54:43
and that are tailored towards tractability, but not on the computational level. So it's not that we try to capture the computations, but we capture a Bayesian network and fix it to a tree structure, right? So they are on that side, or you go for nades and VAEs
55:01
and whatever you want to use or full-blown Bayesian network. They are good, but they are intractable, so now we are tractable, but we are not just on the tree-structured level. So to put it a bit more interesting, so with probabilistic circuits, you can capture graphical models that have a high tree width,
55:21
not all of them, but some of them, so it's really a different beast than a graphical model. They are also not perfect, so be careful. In the end, we are still doing something like function approximation now for densities, so even they may overfit, right? So they may just memorize, in a sense,
55:42
the data, and so they may give you the wrong answer and may not know when they don't know, so you have to regularize there as well. But the fun part is, in contrast to neural networks, for example, dropout is a tractable operation, so we can compute the exact dropout for free,
56:03
and then you can use that for regularization. Okay, so now you have learned about circuits. How can these circuits be interesting? So here's automated density estimation that you can do now, right? It smells like that already. I mean, we just say, find somehow a magic compute structure
56:22
and then throw your data at it, and you get your density back. But there, we have made decisions, right? We have all know this crisp or circle for data mining, so you start with a question, and then you gather data because of the question.
56:42
So it's not true, typically, that people just download Lion 5B, and this is the data for the question they have. So typically, really, people have to gather their own data, and then if you want to learn a model, a density estimator, you have to make decisions.
57:00
Is it a Gaussian distribution? Is it a gamma distribution? Is it a Poisson distribution? Is it a generalized gamma? Is it whatever? Right, not easy. And I think it's somehow interesting that machine learning is typically just going for the Gaussian distribution. I mean, variational, everything is Gaussian, but maybe it's not.
57:21
So for example, if you have counts, many people still use Gaussians because for small counts, you can prove it's a, for large counts, you can say it's a good enough approximation, but if you have low counts, then you put too much weight into the negative, which is weird, right? I mean, if you imagine that the votes would be negative, then Trump would instantly say, I'm still president.
57:43
So we have to be a bit careful. So you make all these decisions, but I'm not a statistician. It's hard. It's even hard to justify. So here's one example where I don't even know how to do it. So what is the distribution of age diagnosis?
58:00
I don't know. I mean, parameterized. Person, no, I don't think so. I don't think that you can find a single person at the age of one. But anyhow, I'm just saying it's super hard and then to justify and you make assumptions. So what we did here is then now,
58:20
we first step is we took, we said we don't want to have a graphical model, but we go for a probabilistic circuit. And now in the leaf, we go for a nonparametric density representation. So histogram, let's say that, right? You could also fancy say piecewise linear. That's roughly what we did there.
58:41
And it's cool because for the first time, you get a model that can really give you a density estimate, no matter what the underlying multivariate distribution is. So works. And for that, we just had to use nonparametric independently tests and some other stuff, but it works.
59:02
However, if you then talk to people and we have used it in some application, they instantly tell you, but damn it, now that I can do that, I want to know, is it a Gaussian, right? It's funny. So in the beginning, they say, I don't know. I don't care. Let's just solve it. Next step, after they have solved it, of course, they are smart people. They would like to understand.
59:21
And so ultimately, you face the following problem. You get a data set that is not very colorful with a lot of question marks, so missing data in there. And ultimately, you would like to get an understanding that is not so colorful. It's actually a mixture of very different random variables.
59:44
So maybe it's something like an exponential Gaussian, gamma, and all these different things are contributing. And for some of the entries, you fill in the missing information, and for some, you even say, I'm not really sure about my...
01:00:00
whether it should be an exponential or not. So how do you do that? Well, what we did is we used now the next step. So we had the probabilistic circuits, we have these weird nonparametric densities. You can combine them with a Bayesian type discovery model and the Zubin Gharamani, Isabel Valera, introduced
01:00:21
and then you get that. You get a model that is not only giving you joint density, but it also decides on the statistical type and expresses its own uncertainty about the statistical type. Now ultimately I would love to
01:00:41
re-implement the system, use it for automatic ML in general. I think that would be pretty cool. But we got a bit sidetracked. Because you may think, so what? Densities, densities, densities. I'm not a statistician. True, and you don't have to be a statistician. But densities are interesting also for many
01:01:04
many other types. So for example Fourier transforms can be used for time series. That's what everyone knows. But you can also get a density estimator over time series in the spectral domain. So you take your time series, you
01:01:20
do Fourier transform, and now there you're going for a probabilistic circuit, but the probabilistic circuit has to be complex valued. So it has a real and imaginary part. And then you can show that this is a proper likelihood for time series. So that's exactly what we did. It works pretty well. You can
01:01:42
show, I don't know whether you can see it, but it works better than an LSTM. It also works better than some other things we have compared, at least on that data. I'm not claiming that LSTMs, they scale much better. You can also go now for an autoencoder for time series and so on. Even more interesting,
01:02:04
going back to the loss function question, we believe that we can now have a kind of nonparametric loss function for a subset of loss functions at least expressed by the probabilistic circuit together with a neural network
01:02:21
for nice prediction into the future. So essentially you're using a neural network that is then feeded into this Whittle SPM. So that is working pretty well and you get now time series predictions and uncertainty quantification in one model, but selecting the architectures is getting
01:02:43
harder. And this is where, next week in Paris I guess, we will show how you could do it. Maybe now going back to the fully differentiable view that I was saying is tricky in the beginning, but we are now using for automatic
01:03:03
M-Ed. So how do we do that? Keep in mind you have a neural network and you have some probabilistic circuit and they're interacting. Now you can select first what type of neural network you want to have. You may even select ultimately what of the probabilistic circuits you want. So you
01:03:21
have on the macro level decisions. And then if you have decided for that, you also have to decide for each of these models how they look locally. So it's a kind of micro level, right? So macro level and micro level. So you could say you have to run two nested, so you have two neural architecture search or computational graph search, one for the neural network, one for
01:03:43
the probabilistic circuit, and they are nested. So what we do then is, if you have nested optimization problems, these are called bi-level optimization or programming problems. And bi-level means because you have something like a min of a max problem or a max of a min in some
01:04:02
configuration, this nasty isn't problematic. In other cases, it's problematic. But you can try now to approximate the inner optimization. For example, if it's a max, you can try to optimize it approximated by softmax. And then you can see that you can get one gradient through
01:04:21
both of them. And that's essentially what we do here. And at least empirically, you can show that it works pretty okay, and often better than what you do by hand and in terms of standard evaluation, but also in terms of, it's interesting in terms of the different blocks, for
01:04:41
example, on the neural architecture search, when you use what and how often you are doing that. So I think there's still enough to be done, but there is an issue. There's an issue that all of what we are doing here very much depends on the data. And your data
01:05:02
might not be your best friend all the time. So there are different words for that. Some people call that shortcuts. Other people have called that clever Hans and clever Hans moments. So clever Hans is the horse that you can see there. And it's a very famous horse back in the days, at least, because it was claimed to do
01:05:22
arithmetics. So the owner, the person in front, was telling Hans, the horse, what is two plus three. And then the horse was using, now doing like that, maybe two plus three. So it was then doing it five times. And then people believe, oh, wow,
01:05:42
this horse can do basic arithmetics. So it turned out a bit like the mechanical Turk, and sorry for the word Turk, but it's the technical term, turns out that they were using extra signals, shortcuts, and it just learned for the different signals to behave differently, right? So it was not knowing
01:06:03
anything about arithmetics at all. It was just learning if that signal happens, I have to do that. If that signal is happening, I have to do that. And this is also what some people claim that some of the neural networks currently are doing. So it's an interesting debate. I haven't made up my
01:06:21
mind. I think it's interesting what they do, but we still have to do more. So we have looked at that, and we believe this may happen in particular if you use a powerful model with small data, and small data quite often happens in science. And so here is one example where we
01:06:43
use that. So you see little plant tissues. So the question that plant biologists have is, if the climate change is continuing this way, and if the number of people grow in the same way, then we run out of food very, very quickly. And so the question is how to feed a hungry
01:07:03
world. For that, they would like to understand how plants are reacting to drought stress, but maybe also to other stress, biotic stress has not always to be abiotic. So you have this plant, and now you put it on a little, well, how do you call that? Let's say they need something like their
01:07:22
own coke to survive, right? Because we cut it out. So now we have to give nutrition and all this stuff through some liquid, and so that it stays alive, and we can use hyperspectral imaging to look into the plant to see what is going on there, right? Without killing it completely. So
01:07:41
that's what you can see there. So the image up is the real world image, down is a simple projection of the hyperspectral image. And in particular, we have asked there, which of the input parts have you used in order to make your decision whether it's dying or not, right, whether it's stressed, drought stressed or not. That's
01:08:05
roughly the idea. And what is happening is actually that it doesn't care for the plant, because we are taking out of this liquid. And if something horrible is happening, of course, also the nutrition is taking from the liquid is also changing. So it's a it's a confounder. So instead of
01:08:23
looking at the plant, you can also look at what is going on in the liquid, and then you make your decision based on that. So the biologists were laughing at us, right, they were saying, wow, all this magic machine. And it's not even looking at the plant. So that's biologically
01:08:41
implausible. So we are not using it. So we had to somehow now start to get the machine focusing on the sweet spot on the right on the on the plant, right. And so this is exactly what we did, we can then say, explain to us, right, tell us where have you looked mainly in order to make your decision. And when that was outside the plant, we said, no, that's biologically not really plausible. And then
01:09:04
over iterations, you know, by that by saying no, there are two views on that, you could say that your loss function is parameterized. And because you provide now on a meta level, no, you're telling apparently, have you used the wrong loss function, and you have to change your loss
01:09:21
function, you could also interpret it as, okay, I provide extra data, that's a bit of a dual view. But because of this extra knowledge, it can continue to learn and then adapt. And then in the final row, you see that it starts to focus at least on the plant. So we have also used
01:09:45
that in medical domains and many, many others. There's a nice overview and major machine intelligence and upstairs, I also saw that there are some interesting, at least one poster where I believe you can plug that in. Now, isn't
01:10:01
automatic ML a bit of a similar setting? Are you sure that your AutoML is not picking up the shortcut? Maybe it's such a strong shortcut that no matter which of your portfolio, if you do a portfolio base, maybe it will pick up, will get picked up all the time. If you go for a base in one,
01:10:22
maybe similar story, maybe it's like the LSD or anyhow. So that's why what we are doing currently, what we are looking at is, can we maybe have a look at a probabilistic circuit that is looking at the estimated parameters and then at some point, or at any point,
01:10:42
any iteration can say, actually maybe what we have done here is not so interesting and I think this parameter is less relevant. So we do it in a more interactive way. It's the initial idea, right? We are not saying where we are done there. Right, so essentially we are now trying
01:11:03
to go in the loop and the probabilistic circuit is helping us to sample the next surrogate function. And this could be a single model or also over different models. That's fine depending on where you parameterize that. And then every time we sample, we can
01:11:21
inference, we can say some parts are relevant, other parts are not not relevant. And that is actually working not too bad as that is the first setup here, but our main takeaway message is, I believe, I'm actually jealous, no, I'm
01:11:42
super proud that you and Frank and all these people are pushing automatic ML. I think it's so important. But then I also see if I go into the domain expert area, they like to be still in charge. So what I want to have
01:12:00
is this, as I said, I would like to have AutoML in the loop and for that I think there are many many new questions now. I think that's my main takeaway message. If I were a PhD next to many theoretical questions, this is what I would maybe work on, maybe together with a large language model. With that, let me
01:12:21
conclude and maybe we can still discuss many of these discussions here. So I think I just showed you at least a bit on trackable probabilistic circuits. Keep that in mind as an alternative. There are also combinations with Gaussian processes, so all HBO can also be done with. My opinion with probabilistic circuits might be interesting. That's the main takeaway,
01:12:45
but I think there's the other takeaway that I was showing earlier, that the game is not over. I think it just has started and the biggest step is now to tell people just enough hardware is not the answer. And we are a scientific
01:13:02
discipline and there's so much interesting and important work still to be done. Thanks.