We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

OpenML, R, mlr

00:00

Formal Metadata

Title
OpenML, R, mlr
Title of Series
Number of Parts
4
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
I will first introduce an R package to interface with OpenML. We support querying and downloading, running experiments and uploading results, so that all your experiments are organized online. R itself allows many forms of machine learning methods and experiments, from completely custom code to powerful semi-automated frameworks. The OpenML package is framework-agnostic in that regard. The mlr package provides a generic, object-oriented, and extensible interface to a large number of machine learning methods in R. It enables researchers and practitioners to easily compare methods and implementations from different packages, rapidly conduct complex experiments, and implement their own meta-methods using mlr's building blocks. Classification, regression, survival analysis, and clustering are supported and virtually every resampling strategy. Meta-Optimization can be performed by tuning, feature filtering and feature selection, and most modeling steps can be parallelized. Its object-oriented structure provides in many cases a close match to the OpenML structure, and it can already be connected to the OpenML R package in a simple manner. The talk will conclude with an outlook regarding the next steps, open challenges and ideas to improve upon the current state of the project.
Presentation of a groupMultiplication sign2 (number)Software developerLecture/Conference
Coma BerenicesMoving averageData miningConvex hullHill differential equationLevel (video gaming)Law of large numbersTask (computing)Interior (topology)Physical lawSurvival analysisAlgorithmCodeFormal languageComputer programmingLibrary (computing)Type theorySurvival analysisConstraint (mathematics)Line (geometry)Group actionLinear regressionCore dumpVirtual machineVisualization (computer graphics)EmailLink (knot theory)Revision controlServer (computing)Covering spaceNormal (geometry)AbstractionDistribution (mathematics)WordLatent heatInternet service providerStudent's t-testLattice (order)ExpressionEndliche ModelltheorieUnit testingMeta elementMachine learningMultiplication sign2 (number)OntologyInterface (computing)Data structureSensitivity analysisSoftware testingProgrammer (hardware)Renewal theoryPhysical systemOperator (mathematics)WebsiteDisk read-and-write headDifferent (Kate Ryan album)Object (grammar)WritingRight angleComputer animation
Survival analysisWireless Markup LanguageRing (mathematics)EmulationNP-hardInterior (topology)Coefficient of determinationTask (computing)WeightState observeroutputFrame problemMetreMultiplication signComputer animation
Survival analysisMKS system of unitsRing (mathematics)RepetitionDuality (mathematics)Special unitary groupInformationType theoryVariable (mathematics)Programmer (hardware)Task (computing)Letterpress printingNumeral (linguistics)Object-oriented programmingMeasurementCASE <Informatik>Electronic mailing listBit rateWebsiteEndliche ModelltheorieObject (grammar)Multiplication signCategory of beingSubsetState observerMetreIRIS-TBlock (periodic table)Machine learningComputer animation
Likelihood-ratio testPhysical lawMaxima and minima3 (number)Ring (mathematics)Hill differential equationLink (knot theory)LTI system theoryMedical imagingSurvival analysisSoftware testingLinear regressionQuicksortVapor barrierSocial classDifferent (Kate Ryan album)Wave packetState observerMultiplication signLecture/Conference
Survival analysisInsertion lossMany-valued logicRing (mathematics)Programmer (hardware)Multiplication signNumberAlgorithmReduction of orderDecision theoryFunctional (mathematics)Linear regressionMereologyParameter (computer programming)Set (mathematics)Interface (computing)Wave packetSensitivity analysisArithmetic progressionRight angleLecture/ConferenceComputer animation
Hill differential equationMaxima and minimaPhysical lawRule of inferenceThermische ZustandsgleichungAlgorithmNetwork topologyPlotterLetterpress printingTable (information)DivisorCategory of beingDecision theoryNumeral (linguistics)MereologySocial classComputer animationLecture/Conference
Survival analysisSet (mathematics)Multiplication signAlgorithmNetwork topologyType theoryDecision theoryConstraint (mathematics)Parameter (computer programming)Default (computer science)Variable (mathematics)Vector spaceCountingDifferent (Kate Ryan album)LengthLecture/ConferenceComputer animation
Survival analysisParameter (computer programming)Data structureKernel (computing)Endliche ModelltheorieSpacetimeMathematical optimizationNumberMeasurementSummierbarkeitOrder of magnitudeFerry CorstenMultiplication signRight angleAssociative propertyReceiver operating characteristicDifferent (Kate Ryan album)Lecture/Conference
Ring (mathematics)Survival analysisEmulationSurvival analysisLinear regressionMeasurementMultiplication signRight anglePredictabilityBitCategory of beingComputer animation
Infinite impulse responseVirtual machineMetropolitan area networkEstimatorAlgorithmData structureNetwork topologyArithmetic meanError messageObject (grammar)Descriptive statisticsDecision theoryAdditionReading (process)Lecture/ConferenceComputer animation
IRIS-TAlgorithmDescriptive statisticsMeasurementCASE <Informatik>Error messageTask (computing)PredictabilitySet (mathematics)Wave packetSoftware testingResultantInformationKeyboard shortcutObject (grammar)Computer animationLecture/Conference
Task (computing)Pairwise comparisonSoftware testingCompilation albumCross-validation (statistics)Interactive televisionKeyboard shortcutFilm editingInformationWave packetTask (computing)Functional (mathematics)Parameter (computer programming)Distribution (mathematics)Web pageIRIS-TMachine learningTheory of relativityDecision theoryBenchmarkCASE <Informatik>ForestInterface (computing)1 (number)Computer animationLecture/Conference
MultiplicationCore dumpLocal ringStapeldateiNetwork socketParallel portSet (mathematics)Right anglePhysical systemTunisParameter (computer programming)Parallel portStatement (computer science)Selectivity (electronic)Lecture/ConferenceComputer animation
MultiplicationCore dumpLocal ringStapeldateiNetwork socketFermat's Last TheoremMIDIMereologyHidden Markov modelTask (computing)Maxima and minimaLevel (video gaming)Parallel portNetwork socketVirtual machineGene clusterDisk read-and-write headSupercomputerPhysical systemProjective planeStapeldateiOperator (mathematics)CuboidOrder (biology)Type theoryComputerArithmetic meanProcess (computing)WebsiteNeuroinformatikComputer animationLecture/Conference
Task (computing)Pairwise comparisonSoftware testingSet (mathematics)Level (video gaming)Right angleAlgorithmProcess (computing)Video gameBootstrap aggregatingLecture/ConferenceComputer animation
Local ringStapeldateiNetwork socketForestTask (computing)Level (video gaming)Software testingAxiom of choiceGroup actionSelectivity (electronic)Run time (program lifecycle phase)TrianglePhysical systemDot productVisualization (computer graphics)Operator (mathematics)Posterior probabilityRandomizationPoint (geometry)Social classStudent's t-testGraph coloringEndliche ModelltheorieIRIS-TFront and back endsVideo gameCodeDistribution (mathematics)Coefficient of determinationForestLecture/ConferenceComputer animation
Wave packetForestComputer-aided designForestRing (mathematics)Linear regressionMathematicsPlotterFinite differenceOperator (mathematics)Object (grammar)Lecture/ConferenceComputer animation
Hill differential equationMaxima and minimaInterior (topology)Principal component analysisEmulation3 (number)Digital filterCurve fittingArmSummierbarkeitConvex hullLinear multistep methodDifferent (Kate Ryan album)Mobile appSlide ruleLevel (video gaming)QuicksortOrder of magnitudeComputer animationLecture/Conference
Principal component analysisMaxima and minimaRaw image format4 (number)Interior (topology)Convex hullEmulationInsertion lossAlgorithmCodeNetwork topologyMaizeFunctional (mathematics)PRINCE2MetreWrapper (data mining)outputSelectivity (electronic)Different (Kate Ryan album)Object (grammar)Right anglePhysical systemLecture/ConferenceComputer animation
Maxima and minimaUniform resource namePrincipal component analysisArmHill differential equationComputer programmingSoftware testingOrientation (vector space)Physical systemResultantObservational studyEndliche ModelltheorieConstructor (object-oriented programming)Mathematical optimizationTunisParameter (computer programming)Musical ensembleoutputChemical equationComputer animation
Maxima and minimaInterior (topology)Hill differential equationPrincipal component analysisAlgorithmMathematical optimizationChainConfiguration spaceOperator (mathematics)Endliche ModelltheoriePerturbation theoryBitParameter (computer programming)RandomizationOcean currentLecture/Conference
Principal component analysisDatabase normalizationAlgorithmBus (computing)Maxima and minimaHill differential equationUniformer RaumAxiom of choiceData modelParameter (computer programming)Task (computing)Endliche ModelltheorieAlgorithmCodeMathematical optimizationLine (geometry)NumberEmailInformation securitySelectivity (electronic)Endliche ModelltheorieDrop (liquid)Data miningLecture/ConferenceComputer animation
ChainData modelParameter (computer programming)Maschinenbau KielCodeLevel (video gaming)Mathematical optimizationLine (geometry)Maxima and minimaTerm (mathematics)Virtual machineMaximum likelihoodRegular graphParameter (computer programming)CASE <Informatik>Kernel (computing)Insertion lossEstimatorBayesian networkEndliche ModelltheorieDifferent (Kate Ryan album)HoaxClique-width2 (number)Digital electronicsQuicksortGoodness of fitWeightCodierung <Programmierung>VotingBit rateRainforestForestRight angleComputer animation
Order of magnitudeAlgorithmConstraint (mathematics)ResultantTunisParameter (computer programming)Resampling (statistics)CuboidWrapper (data mining)SpacetimeLine (geometry)Group actionLecture/ConferenceComputer animation
Maschinenbau KielInterior (topology)Data modelTask (computing)Modal logicAddressing modeLevel (video gaming)Mathematical optimizationEstimatorWrapper (data mining)IterationSurvival analysisTask (computing)BitLine (geometry)System callOperator (mathematics)CASE <Informatik>Parallel portSet (mathematics)Cross-validation (statistics)PreprocessorEndliche ModelltheorieAlgorithmCodeSampling (statistics)Multiplication signExploit (computer security)Lecture/ConferenceMeeting/InterviewComputer animation
Task (computing)Grand Unified TheoryCodeBitLine (geometry)Table (information)NumberRevision controlOperator (mathematics)Set (mathematics)Intrusion detection systemWritingData qualityTask (computing)Functional (mathematics)Casting (performing arts)Lecture/ConferenceComputer animation
Grand Unified TheoryComputer configurationDisk read-and-write headMathematical optimizationMaxima and minimaLine (geometry)Social classDiscounts and allowancesCharacteristic polynomialNumberSet (mathematics)MetreObservational studyChemical equationFunctional (mathematics)Vector potentialRight angleLecture/ConferenceComputer animation
Computer configurationComputer fileRepository (publishing)Pointer (computer programming)Lie groupSemiconductor memoryBitComputer fileTrailMiniDiscTask (computing)Set (mathematics)IRIS-TRepository (publishing)Computer animation
Repository (publishing)Computer fileSimulated annealingTask (computing)Semiconductor memoryComputer fileObject (grammar)SpacetimeRight angleInformationSet (mathematics)Frame problemTask (computing)State of matterOperator (mathematics)Cross-validation (statistics)Lecture/ConferenceComputer animation
DivisorTask (computing)Level (video gaming)Data typeProduct (business)FamilyAmsterdam Ordnance DatumPiSample (statistics)AlgorithmResultantVirtual machinePredictabilityAdditionRun time (program lifecycle phase)NeuroinformatikCross-validation (statistics)PreprocessorMachine learningComputer animation
AlgorithmServer (computing)MathematicsAuthenticationStructural loadResultantPredictabilityLecture/Conference
Hash functionInfinityMathematical analysisOrder (biology)MeasurementProgram slicingSurvival analysisServer (computing)Task (computing)BiostatisticsMultiplication signComputer animationLecture/Conference
Observational studyTime zoneMUDInfinityLine (geometry)Mathematical analysisComputer scienceInformationType theoryProfil (magazine)Decision theoryRing (mathematics)Group actionCommutatorMeasurementDependent and independent variablesCASE <Informatik>Musical ensembleObservational studyExpressionEvent horizonMultiplication signRight angleSurvival analysisLinear regressionLecture/Conference
Survival analysisInfinityStatisticsMoment (mathematics)AreaMeasurementWeightWeb pageMultiplication signNatural languageSurvival analysisVirtual machineNormal (geometry)AutocovarianceComputer animation
Survival analysisHill differential equationCurve fittingInformationElectric generatorMusical ensembleSet (mathematics)Sinc functionMultiplication signType theoryEvent horizonComputer animation
Survival analysisMereologyExecution unitSurvival analysisFunctional (mathematics)EstimatorStatisticsPhysical systemAmalgam (chemistry)Endliche ModelltheorieMultiplication signHazard (2005 film)File formatLecture/ConferenceComputer animation
MereologySurvival analysisServer (computing)Convex hullSummierbarkeitUser interfaceVarianceSign (mathematics)SupremumRight anglePerspective (visual)Virtual machineMathematical analysisStaff (military)INTEGRALState of matterMereologySystem callPlanningComputer fileWebsiteMultiplication signMechanism designSurvival analysisOcean currentLecture/ConferenceComputer animation
CodeBitLine (geometry)Moment (mathematics)Pairwise comparisonPhysical systemResultantVisualization (computer graphics)Open setInternet service providerEndliche ModelltheorieWritingLecture/Conference
Level (video gaming)Observational studyCAN busStandard deviationGene clusterDatabaseInformationMathematicsLevel (video gaming)Well-formed formulaTunisData storage deviceWordData miningFile formatSelectivity (electronic)NeuroinformatikStandard deviationResultantQuicksortMatching (graph theory)Point (geometry)Order of magnitudeObservational studySign (mathematics)Multiplication signComputer animationLecture/Conference
Observational studySupremumLevel (video gaming)Standard deviation4 (number)Information managementDivision (mathematics)Maxima and minimaMUDExecution unitInterior (topology)SummierbarkeitInformationText editorBlock (periodic table)Multiplication signComputer animationLecture/Conference
Transcript: English(auto-generated)
So the next talk is by Bernd Bischl. He's the main developer of the MLR package. He'll explain how OpenMLR and OpenMLR work together nicely. Hopefully, yeah. What do I have to do with this here? Just use it?
Yeah, thanks for the intro. Give me one second. So, let's hope this works. I'm going to grab this one here. Can I borrow your presenter by any chance?
Let's see if this works automatically. If it is Linux, it will work automatically. Great, thanks a lot. I'm actually going to talk about four things today, I think,
and I hope I will be able to do this in 30 minutes, so we have some time left for discussion, because I'm going to end with a couple of suggestions or questions. Like Joachim said, I'm going to introduce MLR, which is a package that I and many other people have been developing now for a couple of years.
I have started working on this OpenML project, I think, one and a half years ago, and Louis and I were actually the two guys who wrote probably the first lines of this in maybe the second meeting of this group here that Joachim kindly invited me to.
Yeah, let's go through all of these guys here quickly, because without them, all of what I'm presenting now here today would look probably pretty differently. So, there's Michael, who's over there, Jacob Richter, Lars Krotoff from the Core Constraint Programming Group in Ireland,
there's Julia Schifner in Dusseldorf, and Erich Studer is now from Basel, and they all work together with me on this MLR package. Dominik Hirschhoff, who is now working at Michael's group at the TU Dortmund as a master student and as a student assistant,
is putting many hours now into the OpenML package, and Michael also has now started working on the OpenML package, and he's collaborating with me on probably a dozen other R-related packages. So, what's MLR? MLR is short for Machine Learning in R.
It exists on CRAN, which is the official server to distribute R libraries and packages. It also exists on GitHub, so if you just want to remember one link, remember the GitHub link and everything else is referenced from there. It's version 2.1, which is out right now, and we will release 2.2 during the next few days,
probably while I'm still here, because if I don't I will get another hate mail from Brian Ripley, I'm pretty sure. The idea behind the package is to give you a unified interface for every basic machine learning supervised or unsupervised technique in R. Actually, at least that's the goal,
and the reason why we want to have that is that machine learning experiments are pretty well structured. They have many common operations, and what we really want to do is we don't want to tackle one specific sub-aspect of machine learning. We really want to have an expressive language. We want to have an ontology of machine learning,
and stuff gets interesting if you can combine all of these things that you need to do in your experiments and model them as a whole, use them as a whole. For that you need a language, and the basic aspect of the language, the words are basically these underlying machine learning methods,
like regression and classification techniques, and then many other things can come on top of that that use this common interface. We can actually use, I sometimes call this a Lego approach, where you combine and plug in different of these steps together,
and you can end up in the end with something like this here, which I think we copied from the RapidMiner tool. We don't really like to do this type of visual programming, but you can do the same by writing small expressive R programs, and I will show you how this works hopefully in a few slides.
The idea is to provide abstractions for everything, glue code together, so many, many things in the package are not programmed by ourselves as such, so we didn't program the SVM algorithms that are in there, but we programmed many meta algorithms on top of that. I will come to that in a second.
The package has grown quite huge. I just checked, I think, a few minutes ago. It's now 14,000 lines of R code, just code, and associated oxygen documentation, and I think 6,000 lines of unit testing. What's in there for basic machine learning tasks?
We can currently cover normal regression, normal supervised classification, cost-sensitive classification, a pretty general definition of that, although not that many algorithms yet for it, so this is pretty experimental. Lars Kothoff made it possible that we can now talk about clustering tasks
and try to solve them with MLR, and Michael worked quite a lot on making survival modeling possible, and I'm going to introduce what that is to you at the end of the talk, because I really want to see this in OpenML as well, and I'm going to argue why also during the end of the talk.
Internally, such a task. Joachim already told you about tasks in OpenML. Tasks in MLR are pretty similar, right? It's a data frame with the input data, so your features, your outputs, and that's annotated. You know which are the target columns. You might have weights in there to weight observations.
You might have misclassification costs in there, so it's data with extra meter information, like you want to have it, and everything in MLR is an object, so it's pretty object-oriented, and you can program on that as well, so everything is tagged with meter information, so you can list everything. You can ask what is the subset of, I don't know,
learning algorithms or measures that have a certain property and so on, which in my opinion makes it really nice to combine with something like OpenML. This is how a first modeling step might look like with MLRs,
so you create a classification task, you put in the data, in this case the famous iris dataset, you specify the target variable, and you have a task, and here you see what OpenML then constructs from it, so this is the printout, a nice summary of the task. You see how many observations there are, what types of features are in there,
so just four numeric features, no factors, no ordered features, whether it has missing values, weights, blocking. Blocking is exactly what you asked about when you said, so one of you guys asked about whether it's possible to leave one person out or so, so that's what you can do with this blocking here. You can say, well, these observations belong together,
and if they belong together during resampling, they will either go to the training set or either go to the test all of them, because in some scenarios that's very common that you need to do something like that. Think about, I don't know, looking at images where you look at different sub-segments of the image or different sub-segments of songs and so on, sound bites,
that's where you need to do something like this. I need to watch the time probably a bit. So how many, what we have on the side of the learning algorithms, about 40 classification algorithms, a couple of clustering algorithms, these are still growing because this is very new,
23 regression techniques, seven survival methods. And by the way, all of this is, actually the whole talk is programmed with NITA that Luis introduced, and this is, well, I didn't write those numbers, those are kind of, I asked the package how many of those things do you have.
We have reduction algorithms for a cost-sensitive classification, so these are, again, MITA techniques where you rely on using the basic regression techniques and classification techniques, usually in a weighted sense. There are functions in there, of course, to train and predict
for each of these methods. That's what the interface is made of for these learning algorithms. And each of these learners has an associated parameter set that we can also ask about. I think that's on the next slide exactly. So how would you use such a learning algorithm?
Well, you call make-learner, put in the name in here, so this says, give me a classification algorithm, I want to have our part that's recursive petitioning, that's just a decision tree, right? So that's CART, if you know that. And you can, again, print it and summarize it, and it tells you, well, I'm from package rpart. By the way, this will get automatically loaded.
I'm in a classification algorithm. My name is decision tree. I have a short name, which is rpart, which is useful for tables and plots and so on. It has a class. It has many different properties, so it can handle two-class classification, multi-class. It can handle missing values, numerics, factors. I mean, it's a tree, right?
It can do basically everything. That's nice about trees. And it has a certain predict type, so it will predict class labels, but you can change that to probabilities. The method knows whether it can support that as well. Here. And there's only one hyperparameter setting actually changed,
and that's don't do cross-validation internally, because we do this with MLR. And we don't want to waste time, so that's switched off by default, but you can change that again. And you can ask, well, what are all possible hyperparameters? And it took a huge effort to manually enter all of this
for all of these associated algorithms, but now it's there. So you can see all of the different hyperparameter settings for the decision tree. You can see the type. It might have a length if it's a vector. There's a default value taken from the documentation. Constraints. So this guy goes from one to infinity, well, because it's a counting variable.
You can say, well, this parameter is actually a dependent parameter. So think about a kernel for an SVM, right? So the kernel width, which is usually called gamma, that only makes sense if you use the RBF kernel or maybe one or two of the other kernels, but definitely not if you use a linear kernel. So you can model dependency structures in parameter space.
That gets interesting if you think about tuning those guys. And you can associate transformations with this. So you can do something like automatically tuning on a log scale. So you can say, here's a parameter. Every time you apply it, please do two to the x before you apply it, right? Because that's what we do when we do optimization of things that go from zero to infinity, right?
We optimize on a log scale. Again, there are many different performance measures in there, quite a lot for classification. This huge number is due to the fact that we have all of these ROC measures, right? Like true positive rate, g mean, F1 score, and so on.
Regression measures, only one for survival analysis because it's pretty hard to implement this technically. Some clustering stuff for general measures like timings that you can do for everything, right? You can ask how long timing or prediction or so on took. And again, these classification measures have properties
so you can see whether they are available for multi-class classification or only binary and so on. You know whether they should be minimized or maximized, their best and their worst value, and so on. Now, this is all pretty basic. Here it becomes a bit more interesting.
We can do resampling now. This is what Lewis already talked about. It's about basic performance estimation and doing this properly in machine learning experiments. Probably you all know this by heart. So we have cross-validation, bootstrapping, subsampling, stratification is supported,
there's this blocking structure that is supported. Yeah, I think that's basically it. All of these esoteric variants of bootstrapping like B63+, if you want that, for your really small datasets with less than 200 observations.
Here's how you call it. So you create a learning algorithm first, again, let's use a decision tree. We do a 10-fold cross-validation, so we create a description object. We create resampling description, cross-validation, 10-folds. We create some measures, so we measure the mean misclassification error and, stupidly, also the accuracy.
I mean, there's just one minus the error in this case. And you call resample. Put in the learning algorithm the task. Here, the resampling description and the measures. And you get back all the predictions, all of the measurements on just the test sets.
You could also evaluate on the training sets. And in the end, you're usually mainly interested in the aggregated result over the whole cross-validation. But if you want to have the detailed information, you get that out of here. So there's one S3 object that contains all of this stuff.
Nowadays, we even have a shortcut for this, so you can just say cross-val and do a 10-fold cross-validation or a 5-fold or whatever. Actually, for many of the things that I'm showing today here, there are nowadays shortcuts for interactive work. But the main aspect of the package
is to have a clean interface so you're able to do everything you want. I should also mention that if a certain learning algorithm is not implemented, it's well documented on the web page how you do this. So you take basically one page of text, you copy-paste it, and you fill in a few details. So how to call the training function,
how to call the test function. And the most annoying aspect is that somebody has to put in all of the parameter information once, but you are allowed to leave that out. But then you don't get some nice extra checking functionality of the package, but everything will work. You can even optimize and tune on this. So you can be very lazy.
But the most important thing is if you do something like this, if something is missing, please tell us, please inform us, go to the GitHub tracker, put up an issue and tell us, well, learner X is missing, here's my first try, and we will integrate this pretty quickly. You can also do benchmarking. You can join a couple of tasks.
In this case, we took the Iris task, we took the Sonar task, we took three learning algorithms, Decision Tree, Random Forest, and SVM, and you can call one function benchmark, and this will run all of the tenfold cross-validations on all of the sets through all of the learners. So it's just a couple of nested loops, and the nice thing is it's not really loops.
It's L apply statements, parallel L apply statements internally, so you can parallelize all of this. You can parallelize resampling, you can parallelize this benchmarking, you can parallelize feature selection that I haven't talked about yet, and parameter tuning. And you can do this on,
well, I'm not going to say every system, but virtually every system that I have worked on yet, because we use our parallel map package, which works on local machines, it will work on multi-core machines, it will work in socket mode, on MPI clusters, and we also integrated another package
that we wrote for HPC computing, so high-performance computing. This is also another project that Michael and I did together, batch jobs, and this enables you to virtually use every high-performance batch computing system. So if you know what a Torx system is, or a Slurm cluster, or LSF Condor is not supported yet, but most of those are,
and this will work out of the box once you have configured this once for your site. Another really nice thing is, and why I actually wrote this parallel map package, is that you can tag these operations that should maybe run in parallel with names, and you can then from the outside
select whether they should be parallelized or not. So look at this here, right? There are already different stages of the experiment, and depending on what you do, depending on whether you do a holdout, or 100-fold bootstrapping, and how many datasets you have,
you might want to parallelize on a different level, right? So what you could do is, you could say one parallel job is running one cross-validated algorithm on one dataset, right? So that's the outer level. But you could also parallelize, well, the resampling itself, right? That's the inner level. And maybe then there's feature selection or tooling
in there as well, and then there's a third level. And what's technically the best thing depends on sizes, right? It depends on runtime behavior of algorithm, and it depends on sizes of datasets and settings, and we don't know. I mean, how should we know how big your dataset is, or whether you use holdout or cross-validation?
I don't know. So you just select the right level, because usually it's a pretty easy choice if you know how your experiment looks, and you just tell the system. So you do parallel start, you tell the system the backend that you're using, maybe MPI or whatever, and you say, I want to parallelize resampling, or I want to parallelize this benchmark operation.
There are many visualizations in there, so these are nowadays used for teaching, sometimes to show the students how well a certain model might look like in 2D. So this is a random forest, again, on the iris dataset, and what you see here are the iris data points
in the first two features of the iris datasets. You can see the classes visualized by these dots and triangles and squares, whether there were mistakes made by the model, and the color encodes the posterior probability distribution of the random forest.
You can do this for clustering or regression as well. So this is just nice for working with beginners, but I sometimes still learn something new if I look at this. And this is, by the way, again related to Lewis Talks. We always use ggplot, because we like it as well, and you get back a ggplot2 object, and you can change it in the end, right,
because the plot is an object, and you can apply different operations to it if you didn't like the color coding or whatever that we did. There are many, many things that I cannot talk about today, so I just summarized them here on this slide. So we have pretty many pre-processing apps.
Well, I should start differently. So this is all pretty basic that I've shown you so far. It's nice, but this thing here really opens up another level of scaling, and it gives you another order of magnitude
to do interesting stuff with a package. So what you can do is you can take a basic learning algorithm, and you can wrap around certain meter algorithms or add functionality to it. So what you can, for example, do, you can say, well, I want to add a certain pre-processing method. Actually, you can do any method you want,
because you can use a pre-processing wrapper that can contain custom code. So you can say, well, before I want to apply this learning algorithm, maybe it's a tree, I want to rotate my data by doing a PCA. You can do feature filtering, right? We have about, I don't know, more than ten different techniques now implemented.
There's feature selection in there, sequential forward search that Ross already talked about, if I got that correctly, and a genetic algorithm. We are now working on multi-criteria methods as well. There are different imputation methods. Actually, we have a whole object-oriented system
to do any kind of imputation you want, and Michael already implemented many useful variants of those for you already. You can do generic bagging to do ensemble construction. Stacking is in there if you want to
create ensembles of heterogeneous models over and under sampling for imbalance classification problems. We just did a huge study on this and are trying to publish the results of these. There's more than, so there's also something like SMODE and so on in there. Something that I'm very interested in, which is parameter tuning
for hyperparameter optimization. The nice thing of this is you cannot only do parameter tuning for the basic algorithms. You can construct a chain of pre-processing, modeling, and post-processing operations. All of these will have parameters. You can tune this jointly.
Because this now might get a bit bigger than just tuning two parameters of the SVM. We have smarter algorithms in there than random search or grid search. There's iterated F-racing in there or model-based optimization. This is about algorithm configuration, a topic that Joachim already hinted at. This is a pretty hot topic currently.
We like this model-based approach and are also developing stuff for this as well. But you can also do racing if you like that more. That's already nicely integrated into MLR. This is possible as well, but not officially. If you want to use Bayesian optimization, model-based optimization, drop me an email. This will take maybe one or two months
until this hits the ground. We use this now for one or two years to optimize our data mining models. Finally, I want to show you something pretty complex. I will show you how to do efficient model selection,
including different learning algorithms, including different hyperparameters with just a few lines of code. The idea is, we have understood how this first level of model optimization works. It's something like maximum likelihood estimation, penalized ML, or you might call this
regularized loss minimization. We have all understood how this works. Why not take the same principle to the next level? The next level usually is, I have this model and this and this and this, and then I do manual experiments and look at what works best. The SVM looks nice. Now I'm going to fiddle around with their parameters so it gets even nicer.
Why not do this algorithmically with a machine? We will just do the same thing that we did on the first level on the next level. This is sometimes called second level estimation. I think it's a good way to look at this. This is not a term that I invented, but I stole a paper from Isabel Guyon about this.
You can do this in MLR. This term we invented, because we didn't know how to call this a multiplex model. You take different learning machines with parameters, you plug them together in a multiplexer, and that multiplexer has one additional hyperparameter,
and this hyperparameter encodes what learner you actually selected. The idea is not to do an exhaustive search on all of these, but to do either Bayesian optimization or F-Racing to focus on the well-performing things. Do efficient optimization on this second level.
You can use every tuner for that that you want. Here is how this code looks like. I don't know how many lines are these, ten or so. You construct the... It's not really an ensemble, this multiplexer, right? Out of the random forest, out of the SVM.
You change the parameters a bit. You say I want to do three-fold cross-validation, in this case because I don't want to wait so long. You say I want to do iterated F-Racing with 200 experiments. You create the parameter set for the model multiplexer. For the random forest, you want to optimize the node size.
For the SVM, in this case, you would optimize the kernel width. It's a fake example, right? In reality, you would put in more parameters. You optimize this on a log scale. You have constraints here, so box constraints on your parameters. Then you do a tuning wrapper.
You wrap around the tuning and put it on top of the learning algorithm. Now you can do two things at once. First of all, you optimize over this pretty large space here. It's not so large now, but just imagine a few more lines.
You do this efficiently with racing. The other thing is, if you use this wrapper approach here, you can do nested resampling. We know if we do this tuning, we cannot just report the best result in the end, because maybe we did one billion experiments on the same data with the same cross-validation. We know that this will be optimistically biased.
Our papers will get rejected, at least hopefully, if the reviewers know what they are doing. We have to nest this into another level of performance estimation. If you use this wrapper approach here, now you just say, well, cross-validate the wrapper, call resample again, and this will do everything at once, another line of code, and you do it in parallel.
Actually, that's on here. If you want to read about this a little bit more, especially for survival analysis, but what we did in there is pretty generally possible for other supervised modeling techniques.
Michel and I wrote a paper about optimizing pre-processing operations and survival models, in this case with iterated F-Racing. Let's get to the OpenML package now, which is, well, I might say fortunately a little bit smaller,
so I'm not going to use up that much time again. The current API allows you to explore data sets and tasks on the server, to download data sets and tasks. You can register learning algorithms, and you can upload runs. That's what we got so far, and hopefully we'll have more at the end of this week.
It's too diagnostic, so you don't have to combine it with MLR. The idea is you can use any R package, or you can write custom code if you want to. It might get less convenient, but you can do everything you want. MLR is already a bit more integrated, well, because I like to use this. This is how it looks like if you want to explore a data set.
You can ask, well, please, server, tell me what are the OpenML data sets you see here. Well, the name of the data set, the data set ID, and the version number. You can ask, well, what are the registered tasks? You can see the IDs of the tasks, the names of the data sets associated, and again, the version. Maybe the most useful operation gets data qualities.
By calling this function, you will get a table which consists of one line per data set, but which tells you the meter features or the characteristics of this data set. So how large it is, how many features it has,
whether there are missing values in the number of classes. So what do you do if you do a study? I don't know. For example, I did the study for imbalanced data sets. I called this function. I looked at everything that had two classes, which was pretty imbalanced, I don't know, at least 10 to 1. A 10 to 1 ratio between the class sizes.
Usually we dislike missing values, so maybe you want to exclude data sets with those, and so on. In the end, you maybe have 30 to 50 data sets that define your benchmark study. Then you can just iterate over them and download them. Actually, it's still a technical problem that this is
available only for the data sets, so we really want to have that for the tasks. We already have a issue for this in the tracker, and hopefully we can figure out how to do this this week as well, because this is holding us back a bit. This is how you can download a data set. You can either use a name or the data set ID. Here you see downloading data set iris from OpenML repository.
Stuff gets stored in files on disk. In the end, the files are again parsed, so it's XML files and R files. Those files are transformed and read into memory back again, and you get a nice S3 object that you can again summarize and now work with.
You have all of the data in there. It's basically again a data frame annotated with extra information. You can download a task, and this will download not only the data set object, but information on what you are supposed to do now with this data.
It will also download the cross-validation splits and so on. Again, you get this in a nice object that you can now query and use. You can now leave the OpenML package if you want. Even these two operations would be incredibly useful, I would say, for practical work. But you can also now use every machine learning algorithm you wish on this.
Here again I create a learning algorithm of MLR, and there's run task, which is a very convenient way to produce all of the predictions for a certain MLR learning algorithm. You see the cross-validation runs here, and we have some basic pre-processing in there,
so we are dropping apparently a couple of columns because they're completely constant, and most of your learning algorithms don't like that. Then you get a result, so 97% accuracy, and of course most important the predictions, and those we might want to upload again to the server. This is the only thing that I didn't switch on
because I don't want to change the server. You authenticate now at the server because you're going to change data. You basically register the learning algorithm of MLR. I think you don't have to do this anymore in a couple of weeks because all of these learning algorithms will already be registered.
And you upload the run results, which are basically the predictions, and then the server will evaluate them for you, I think for every performance measure which the server knows. You could also evaluate yourself and upload this, and then the server is going to believe you.
I now have two slides. Does everybody know what survival analysis is? Because then I'm going to skip them. I'm not going to invest much time into this because I am maybe over time anyway already. The thing I want to talk about is that I want to convince you guys
to also make this possible in OpenML because it's a pretty important task for biostatisticians and biocomputer science guys. Survival analysis is about predicting how long a certain patient will survive.
Maybe a certain group of patients that have all a type of cancer. I don't know anything really about this, so Michael is going to interrupt me immediately when I'm going to talk nonsense. So we have a couple of patients, and we want to know when are they going to die, right?
It's a sad thing, but the good thing about this is that we want to relate this information to their gene expression profiles and then hopefully figure out which genes are responsible maybe for shortening their lifetime so drastically so we can help them, right?
So it's basically like a regression. You want to estimate how much time is left there for that person with one little twist which makes it more complicated. So we will have people in a medical study and we have a certain event, that's the death, right?
And we can measure the amount of time that happened until the event happened. But there might be the case that some of these patients actually leave the study for whatever reason and we don't get a measurement.
So that's the censoring. You can get a right censoring which is when people leave the study and you can't contact them anymore. So we know they survived at least five years, but maybe seven or ten or whatever, right? They can also enter this study and we didn't know when.
So this might be a left censoring and it could be a censoring on both sides. And this makes it more complicated, right? Because we know some information but we don't have the true measurement and why. And there's a whole area in statistics that deals with that which is called survival analysis.
Now, probably most importantly for OpenML, how does data look like? So we have clinical covariates. These are just normal features. Age of the patient, their weight, I don't know, whatever. And we have high dimensional genetic gene expression data.
So these are, I don't know, 10 to 100 I would say, is that right? Clinical? No, no clinical. 10 to 100. A few, a bunch, right? Normal size for machine learning. This will be high dimensional. So this might be, like Michael said, maybe 50,000.
And for next generation sequencing, I don't know, half a million. And the nice extra information is you have probably 100 or 300 observations. So that's a really tough problem, right? And we have these timing informations and whether an event happened or not.
So we can infer what type of censoring took place. To motivate you that this is not an esoteric problem looked at by a few people in Dortmund. It's looked at by a couple of people in Dortmund. But it's very, very relevant to many people working in statistics. So there's here this Kaplan-Meier estimator, which gives you a basic estimation of how this survival function looks like.
That's, I think, Michael googled this a couple of days ago. One of the most cited statistical papers. And there's one of the simpler models to predict survival time, which is the Cox proportional hazards model.
It's the second most cited statistical paper. Get this in, draw this into OpenML, and many people from statistics will be, hopefully, even more interested in OpenML. And hopefully it's not so hard, because the format looks pretty similar to what we already have.
And, I don't know, my personal impression is the more we open up this to other communities, the more successful this will be. And we will also learn from drawing in all of these different people. Because they will give us another, hopefully, informative perspective on what we're doing here. We cannot get this right with two or three guys designing how machine learning experiments should be done.
Completely correct. Two slides. This is at least my plan for the workshop. I want to discuss some technical stuff with Joachim and Jan, and hopefully get this out of the way.
I want to discuss possible integration of survival analysis. We need to refactor and clean up some parts of the package, because the next step should really be publish this to CRAN and make this available to people. So again, people can comment on this and use this, and we are not far away from this. There's a tutorial which is in a bad state currently.
Michael is already working on a file caching mechanism, so we don't need to download everything again and again. I really like that you guys provide all of these on the OpenML server, provide the visualization and so on, and the comparisons. R is a really nice tool for that as well, and I want to be able to do
all of the nice statistical testing, the ggplot visualization in R with the results, so we need the data. What's not perfectly integrated at the moment, because it's a bit hard to get this right, I want to discuss with some people here how we really can support custom modeling just a few lines of code that somebody wrote.
This must be able, because this will again open up the system to anything. What might be the general next steps for the R package and OpenML as a whole? We support in MLR this cost-sensitive classification and clustering.
I don't know, do we support clustering already? Might be a good idea as well, because the format is very similar. We should be able to map standard tuning and feature selection experiments, this nested resampling, and store the information from these experiments on the server, because in my opinion that's very, very standard for many papers.
We must be able to map studies, scientific studies, what people do in their papers. Every relevant aspect of this must be mappable to some OpenML word or concept. This is key, because then all the scientists are going to work to use this, hopefully.
I think we already have a problem that scientists are lazy. They develop a tool and then they stick to it, because they want to, I don't know, think about mathematics and formulas and they don't want to fiddle around with computational technology, at least most of the scientists that I know of.
We must make this general, flexible, and easy to use. I'm going to skip this. We should do a large-scale study, in my opinion, and populate the database with interesting results, because actually, I mean, why do we need anybody to run a standard SVM on our data? We can do this, right, and we have clusters for this, so let's just create millions of baseline experiments
that we can then, again, data mine and learn something from. Actually, maybe the smartest thing would be to do this internally and not give it to anybody else, because I am of the opinion that if we do this, this will be, well, a richness that we have.
I'm not sure, I think there are many, there's quite a lot of information in there that we just need to exploit, and we just need to find a way to do this. Maybe it's also a good idea to have some kind of a block where we present how OpenML works in simple scenarios
to get people interested in this. That's it. I'm sorry that I, I don't know. Am I at ten minutes over time? Thanks, thanks.