We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Estimating the Performance of Predictive Models in R

00:00

Formal Metadata

Title
Estimating the Performance of Predictive Models in R
Title of Series
Number of Parts
4
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
This talk will start with a very brief introduction to R and the main concepts of this data analysis environment and programming language. We will then shift focus to predictive tasks and models obtained from data to solve these tasks. Finally, the main topic of the talk will be on how to solve the critical issue of estimating the predictive performance of alternative models to solve some task. This estimation process is key to answer the question of which model is the "best" for a problem we are facing. We will describe the facilities provided by the R package performanceEstimation to address this model selection problem and provide some illustrative case studies. We wrap up with the ongoing plans of interfacing this package to OpenML.
Keywords
CASE <Informatik>Metropolitan area networkOpen setDirection (geometry)Mathematical modelComputer animationLecture/Conference
Multiplication signIntegrated development environmentData analysisProgramming languageData miningArchaeological field surveyWebsiteLecture/ConferenceMeeting/InterviewComputer animation
Source codeIntegrated development environmentGame theorySoftwareTerm (mathematics)Computing platformParameter (computer programming)Bookmark (World Wide Web)Physical systemCausalityMultiplication signFunctional programmingSet (mathematics)Lecture/ConferenceComputer animation
WebsiteSeries (mathematics)Function (mathematics)Physical systemOpen sourceCartesian coordinate systemAreaTerm (mathematics)Functional programmingDirection (geometry)Interactive televisionLecture/ConferenceComputer animation
Key (cryptography)Form (programming)Video game consoleLine (geometry)Type theoryPoint (geometry)Menu (computing)Online helpPhysical lawInterface (computing)Interactive televisionType theorySource codeSet (mathematics)Scripting languageLecture/ConferenceComputer animation
Variable (mathematics)InformationObject-oriented programmingGraph (mathematics)Mathematical modelNumberComplex (psychology)Function (mathematics)CausalityMobile WebInsertion lossType theoryQuicksortSet (mathematics)NumberFunctional programmingDifferent (Kate Ryan album)InformationVariable (mathematics)Mathematical modelVector spaceData storage deviceObject-oriented programmingLecture/ConferenceComputer animation
Function (mathematics)Arithmetic meanVariable (mathematics)Functional programmingSet (mathematics)Extension (kinesiology)Programming languageBoss CorporationResultantSign (mathematics)Content (media)QuicksortVariable (mathematics)Task (computing)Lecture/ConferenceSource codeComputer animation
Maxima and minimaFrame problemData structureTable (information)Term (mathematics)Data miningQuicksortFrame problemObject-oriented programmingCentralizer and normalizerTable (information)Data storage deviceData structureDifferent (Kate Ryan album)CausalityProjective planeLecture/ConferenceComputer animation
Data structureProduct (business)Client (computing)Address spaceFrame problemTable (information)CASE <Informatik>Row (database)InformationComputer fileSpreadsheetCentralizer and normalizerFunctional programmingSpreadsheetFrame problemBitOrder (biology)Mathematical singularityIdeal (ethics)DatabaseData managementPhysical systemMetropolitan area networkComputer animationLecture/Conference
Library (computing)System programmingDatenbankverwaltungNP-hardMereologyLipschitz-StetigkeitOperations researchFrame problemStandard deviationSource codeObject-oriented programmingQuicksortOrder (biology)SequelDatabaseFunctional programmingQuery languageType theoryCategory of beingExploratory data analysisSource codeComputer animationLecture/Conference
Operations researchStandard deviationSource codeFrame problemLipschitz-StetigkeitPerpetual motionDifferent (Kate Ryan album)Bit rateFreewareData managementFunctional programmingDatabasePoint cloudCASE <Informatik>Flow separationTerm (mathematics)Source codeOperator (mathematics)Theory of relativityDemosceneLecture/Conference
Digital filterSet (mathematics)Group actionFunction (mathematics)Metropolitan area networkFunctional programmingMomentumPhysical systemDatabaseSet (mathematics)Data managementGroup actionRule of inferenceSubgroupCommitment schemeLecture/ConferenceComputer animation
Library (computing)TorusDigital filterCASE <Informatik>Physical systemData managementDatabaseStandard deviationSource codeService (economics)Functional programmingSet (mathematics)Logische ProgrammierspracheCondition numberCAN busOrder (biology)QuicksortSequenceRange (statistics)Function (mathematics)Operator (mathematics)Computer animationLecture/Conference
Uniform resource nameDigital filterLocal GroupOperator (mathematics)SubgroupQuicksortComputational complexity theoryTerm (mathematics)Set (mathematics)MereologyLecture/ConferenceComputer animation
Formal grammarComputational physicsStatisticsVisualization (computer graphics)Term (mathematics)Color managementTerm (mathematics)Visualization (computer graphics)Representation (politics)Mathematical analysisPhysical lawMachine visionExploratory data analysisStatisticsGraph (mathematics)Computer animationLecture/Conference
TorusInternational Date LineFormal grammarComputational physicsStatisticsVisualization (computer graphics)Term (mathematics)Library (computing)LengthGraph (mathematics)Complex (psychology)PlotterObject-oriented programmingVariable (mathematics)Cartesian coordinate systemSet (mathematics)Category of beingCoordinate systemGraph (mathematics)QuicksortPoint (geometry)MappingCASE <Informatik>Level (video gaming)Metropolitan area networkInformation securityComputer animationLecture/Conference
Lipschitz-StetigkeitNP-hardMagneto-optical driveMaxima and minimaAmsterdam Ordnance DatumMathematical modelLibrary (computing)Support vector machineArmPredictionComplex (psychology)QuicksortRepresentation (politics)Multiplication signGradientView (database)CASE <Informatik>Series (mathematics)Exploratory data analysisWell-formed formulaMathematical modelInterface (computing)Pointer (computer programming)Visualization (computer graphics)Virtual machineParameter (computer programming)Different (Kate Ryan album)Lecture/ConferenceComputer animation
PredictionLibrary (computing)Mathematical modelLie groupPoint (geometry)Generalized linear modelProgramming languageType theoryTerm (mathematics)Variable (mathematics)Task (computing)Mathematical modelFunctional programmingParameter (computer programming)Lecture/Conference
PredictionHill differential equationFunctional programmingPoint (geometry)Structural loadBookmark (World Wide Web)Mathematical modelFrame problemParameter (computer programming)Multiplication signBitMathematical modelPredictabilityLecture/Conference
AlgorithmNumberTraffic reportingStandard deviationAerodynamicsComputerComputer programData analysisComputer programmingNumerical digitVideo gameCausalityGame theoryInsertion lossMereologyNumberPoint (geometry)Task (computing)Projective planeTerm (mathematics)Cluster analysisMathematical modelAlgorithmPointer (computer programming)HierarchyData miningFuzzy logicPredictabilityLecture/Conference
3 (number)Traffic reportingStandard deviationAerodynamicsComputerComputer programData analysisComputer programmingComputer fileInterior (topology)Graph (mathematics)CodePoint (geometry)Library (computing)IntegerSingle-precision floating-point formatFunction (mathematics)Demo (music)Mathematical analysisExploratory data analysisSet (mathematics)StatisticsSpecial unitary groupInstallable File SystemSpeciesGraph (mathematics)Type theoryPlot (narrative)EmulationLengthValue-added networkIRIS-TAngleConditional-access moduleDean numberComputer wormMoving averageNewton's law of universal gravitationPerspective (visual)Different (Kate Ryan album)QuicksortMultiplicationDynamical systemResultantPresentation of a groupComputer clusterBookmark (World Wide Web)MathematicsMathematical analysisWhiteboardGame theoryReading (process)Web pageBoss CorporationSlide ruleAreaCodeComputer fileGroup actionData managementMereologyTraffic reportingExergieMultiplication signBuildingSoftwareData miningProbability density functionWordTextsystemDirection (geometry)File formatComputer animation
AerodynamicsMemory managementWorld Wide Web ConsortiumResultantMathematicsPoint (geometry)Computer fileData miningClient (computing)WeightChainWeb applicationGraphical user interfaceCross-platformSoftwareWeb browserComputer animationLecture/Conference
Set (mathematics)IRIS-TRule of inferenceConditional-access moduleUniform resource nameWeb browserWeb applicationComputer fileResultantNeuroinformatikDynamical systemQuicksortDirection (geometry)MathematicsGoodness of fitRoyal NavyWeb 2.0GUI widgetLecture/ConferenceComputer animation
Moving averageComputer fileMobile appGUI widgetData miningMathematical modelPredictionCASE <Informatik>User interfaceServer (computing)Data structureElement (mathematics)CuboidNeuroinformatikMultiplication signEstimatorDifferent (Kate Ryan album)Functional programmingGUI widgetTelecommunicationData miningCodeMedical imagingMomentumComputer animationLecture/Conference
Open setBitCASE <Informatik>MappingFunctional programmingEstimatorQuicksortWave packetSet (mathematics)Performance appraisalLinear regressionSimulationLecture/ConferenceSource code
Execution unitContext awarenessEstimatorPredictabilityMathematicsBoss CorporationSet (mathematics)MiniDiscBookmark (World Wide Web)Mobile WebSubstitute goodMathematical modelMetric systemLecture/ConferenceComputer animation
ResultantType theoryMathematical modelEstimatorError messageSet (mathematics)PredictabilityFlow separationDistribution (mathematics)Mathematical modelCASE <Informatik>Software testingRevision controlLecture/ConferenceComputer animation
Sample (statistics)Software testingFlow separationMultiplication signCASE <Informatik>Wave packetProcess (computing)Game theoryEstimatorCausalityWordPhysical systemStandard deviationMathematical modelStandard errorQuicksortProcedural programmingLecture/ConferenceComputer animation
Heegaard splittingStandard deviationRandomizationSet (mathematics)Flow separationMetric systemDepictionMultiplication signEstimatorMathematical modelEqualiser (mathematics)Cross-validation (statistics)Software testingError messagePhysical systemBit rateTheory of relativityDisk read-and-write headWave packetLecture/ConferenceMeeting/InterviewComputer animation
InfinityDifferent (Kate Ryan album)QuicksortRow (database)Decision theoryMultiplication signMetropolitan area networkFunctional programmingEstimatorPredictabilityTask (computing)Performance appraisalSet (mathematics)Parameter (computer programming)Lecture/ConferenceComputer animation
Musical ensembleTask (computing)Parameter (computer programming)Set (mathematics)InformationFunctional programmingEstimatorLecture/Conference
Computer chessPointer (computer programming)Support vector machineError messageLibrary (computing)Protein foldingLinear regressionTask (computing)Plot (narrative)EstimationFunctional programmingRange (statistics)Task (computing)Distribution (mathematics)CuboidPlotterSystem callResultantObject-oriented programmingType theoryError messageStandard deviationMaxima and minimaLinear regressionEstimatorArithmetic meanTheory of relativityMathematical modelCASE <Informatik>Decision theorySet (mathematics)Physical systemComputer animationLecture/Conference
Object-oriented programmingInformationSocial classSource codeVariable (mathematics)Well-formed formulaTask (computing)PredictionFunction (mathematics)Parameter (computer programming)Physical lawAlgorithmMultiplication signStandard deviationMetric systemTestmengeMathematical modelTask (computing)Variable (mathematics)StatisticsObject-oriented programmingConnectivity (graph theory)Video gameGoodness of fitSocial classFunctional programmingSoftware testingPredictabilityAlgorithmError messageMetric systemParameter (computer programming)RandomizationWave packetState observerResultantMathematical modelTerm (mathematics)Source codeFrame problemState of matterDescriptive statisticsSet (mathematics)ForestMultiplication signPerimeterDisk read-and-write head4 (number)MereologyComputer animation
TunisPredictionFunction (mathematics)Parameter (computer programming)Control flowMathematical modelElectronic mailing listDefault (computer science)Interpreter (computing)Constructor (object-oriented programming)Metric systemAlgorithmMultiplication signTask (computing)Standard deviationTestmengeFunctional programmingParameter (computer programming)Standard deviationRandomizationPredictabilityPreprocessorView (database)ForestRun time (program lifecycle phase)Process (computing)Staff (military)Structural loadDigital Equipment CorporationFlow separationException handlingLecture/ConferenceMeeting/Interview
PredictionInterpreter (computing)Function (mathematics)Control flowParameter (computer programming)Mathematical modelElectronic mailing listAddressing modeDefault (computer science)Metric systemGradientMountain passFunctional programmingPredictabilityPerimeterDefault (computer science)Parameter (computer programming)Metric systemStandard deviationArithmetic meanProduct (business)AreaSystem callLinear regressionType theoryError messageVariable (mathematics)Computer animationLecture/Conference
AlgorithmMultiplication signTask (computing)Function (mathematics)PredictionMathematical modelTestmengeStandard deviationMetric systemConstraint (mathematics)Well-formed formulaSoftware testingParameter (computer programming)Object-oriented programmingVector spaceLipschitz-StetigkeitDefault (computer science)Problemorientierte ProgrammierspracheMatching (graph theory)Functional programmingObject-oriented programmingParameter (computer programming)Software testingVector spaceAlgorithmWave packetConstraint (mathematics)Well-formed formulaVolume (thermodynamics)Source codeComputer animationLecture/Conference
Function (mathematics)Task (computing)Constraint (mathematics)Software testingParameter (computer programming)Vector spaceObject-oriented programmingMetric systemDependent and independent variablesArithmetic meanPredictionWorkloadMultiple RegressionPredictabilityDecision tree learningSet (mathematics)Vector spaceMathematical modelWeightMetric systemFunction (mathematics)Functional programmingWave packetSoftware testingParameter (computer programming)Linear regressionGoodness of fitComputer animation
Function (mathematics)Dependent and independent variablesPredictionOctahedronPairwise comparisonParameter (computer programming)Different (Kate Ryan album)Electronic mailing listPredictabilityArithmetic progressionTelecommunicationCombinational logicLinear regressionWave packetWeightFunction (mathematics)Error messageDecision tree learningMathematical modelDefault (computer science)Arithmetic meanResultantSimulationWeb 2.0EstimatorParameter (computer programming)Run time (program lifecycle phase)Functional programmingRandomizationVarianceStandard deviationTask (computing)Lecture/ConferenceMeeting/InterviewComputer animation
Different (Kate Ryan album)Parameter (computer programming)Pairwise comparisonHamiltonian (quantum mechanics)Thermische ZustandsgleichungFunctional programmingBookmark (World Wide Web)Combinational logicWeightResultantVarianceType theoryParameter (computer programming)Linear regressionSet (mathematics)Different (Kate Ryan album)QuicksortTask (computing)CASE <Informatik>Electric generatorGoodness of fitInstance (computer science)TelecommunicationComputer animation
Different (Kate Ryan album)Pairwise comparisonParameter (computer programming)Task (computing)Plot (narrative)EstimationRandom numberNetwork operating systemWindowSeries (mathematics)3 (number)Electronic mailing listSoftware testingHypothesisSet (mathematics)Thresholding (image processing)Confidence intervalFermat's Last TheoremTerm (mathematics)StatisticsInformationParameter (computer programming)Different (Kate Ryan album)CASE <Informatik>NumberMultiplication signFlow separationDimensional analysisDegree (graph theory)RankingPerimeterAdditionPairwise comparisonBlogMobile WebFunctional programmingResultantPhysical systemSign (mathematics)Set (mathematics)Single-precision floating-point formatMessage passingFreewareVector spaceSoftware testingCausalityLinear regressionIndependence (probability theory)Task (computing)Order (biology)Event horizonMathematicsThresholding (image processing)Film editingMathematical modelPhysical lawTable (information)Right angleStatisticsExterior algebraQuicksortSpecial unitary groupWhiteboardAverageBitProduct (business)Confidence intervalMetropolitan area networkUtility softwareLine (geometry)Metric systemStandard deviationDecision theoryEstimatorCross-validation (statistics)Mathematical modelError messageObject-oriented programmingMereologyVisualization (computer graphics)Characteristic polynomialTime seriesStatistical hypothesis testingDecision tree learningGraph (mathematics)CodeNon-parametric statisticsTheoryBit rateHypothesisPerformance appraisalPlotterP-valueParity (mathematics)1 (number)Type theoryRandomizationSquare numberLimit (category theory)Arithmetic meanFehlerschrankeComputer animation
Transcript: English(auto-generated)
Okay, thank you.
So this talk, as Joaquin was mentioning, would be essentially about a relatively recent package that I've developed for the issue of estimating the performance of models in R, which of course is related with the goals of OpenML, but unfortunately Joaquin asked
me to do a very short introduction to R because I should assume that people here were not familiar with R and I was stupid enough to accept the challenge, which basically is an impossible challenge. So anyway, I'll try to do a very, well, not really an intro to R, but kind of show off
of some things that you can do in R and at the same time, you know, giving you some illustrations of these things. Okay, so for those that are familiar with R, I'm very sorry about this boring stuff that you hear again, but anyway, that's my best shot at a kind of 20 minutes intro
to R. So basically R is a tool which is a programming language and an environment for data analysis that is known as being currently one of the most used tools for data mining,
and at least according to recent surveys, and it has the nice feature of being free and open source, which is related with trustable software and also with some goals OpenML in terms of reproducibility. And it has a strong impact both on academia, but also in industry.
And so for me, it's my favorite tool, of course. So only good things to say about R and I hope that you somehow get some of these arguments from this very short intro. So it's obviously available for most of the computer platforms. It comes with a base installation, comes already with an impressive set of functionality,
but you know, you can extend this through a system of extra packages that, you know, there are more than 5,000 currently available. So R is really very broad in terms of application areas that you can use it. Okay, so as I mentioned, these packages and one of them is OpenML.
That will be a package that interfaces with OpenML, but these extra packages provide, you know, some extra functions and eventually data that people can use to some more specific purpose, okay. So this involves, of course, installing and then loading this package to be able to use it.
Okay, so the basic interaction with R is the first eventual cultural shock that most people will face, which is the common, the usual common line, which people, you know, nowadays get kind of upset because they are used to all these point-and-click buttons
and stuff like that and menus. So that's the first thing that people get shocked when they see R is where are my menus. But basically it comes, believe me, to your help because it, you know, it means that you need to know what you are doing, which is a good thing. And so essentially you type common, so it's a kind of interactive interface where you
type commands and then you get answers as a return of that, okay. So you can, of course, also source a set of commands or a script if you want and do all these things sequentially in R. So basic concept in R, so you have things like variables that can store from simple
things like numbers to complex models to graphs. So everything in a way is a variable that you can store, and you can store different type of information like, you know, numbers, an SVM, a graph, wherever you
can store these sort of objects, these different types of objects and variables. Another important concept in R is the notion of function because everything in R basically is a function. So R, of course, has lots of built-in functions like, for instance, creating sets of numbers, applying some function to this set of numbers, and sometimes,
you know, this also important notion of vectorization where you apply a function to a set of things and you get a set of results. So all these are basic notions of the R language that you get used to it sooner or later, okay. So you can, of course, extend the language by creating your own functions. And that's another thing that you can assign to a variable.
So you essentially assign the content of the function to a variable, and after that, you can reuse it later on as you wish. And that's, you know, the sort of things that one does when we create the package is that we create, you know, a new set of functions, for instance, a function to upload your task to OpenML or to download a task or whatever.
So that's basically the sort of things that you tend to do. Okay, so data frames are probably the central object in our, in terms of data mining, which is what we are essentially talking about in OpenML.
So essentially, they store, they are bi-dimensional structures that store data tables. And in R, they are called data frames. Okay, so from now on this, you know, 15 minutes that I have left, I will try to go through these data frames and then try to give you illustrations of different steps
of a data mining project, of course, very briefly. Okay, so the first thing, of course, that you need to do is to put your data on a data frame, if you want to use with R. Okay, so you can do that in many different ways. You may eventually have a text file, a CSV file, and you, of course, have functions in R to read these data into a data frame.
And you can eventually have your data on Excel spreadsheet. And, of course, you also have, you know, functions that read this data from a spreadsheet into a data frame because, as I mentioned, the central thing in R is our data frames to store data. Okay, and you can also eventually even have your data on, you know, a fancy database
management system. And, of course, R has lots of packages to interface, for instance, with MySQL. And then, you know, just type your SQL queries and get your data pulled down from the database into your data frames. Okay, so that's the sort of things that you can do in order to get your data into
this object, these data frames, which are used by most R functions. Okay, then, of course, the next step as soon as you get your data into R is to try to, you know, do this kind of exploratory analysis of your data, which typically involves
querying and summarizing some properties of your data. And that can be done in, you know, in many different ways in R. But one very nice way, in my opinion, is this deployer package, which is a package kind of devoted to this issue of querying and summarizing data that may be stored,
I don't know, data frames, but even, you know, on the cloud or even on database management systems. Okay, so this function has several interesting features, like implementing the most basic data manipulation operations and also being able to handle different data sources.
Okay, so you can have your data already on a database, but you can actually work with the player on, you know, data which is stored on a database or even on the cloud. Okay, so it's very good in those aspects in terms of isolating the data source from the operations of the data manipulation operations.
Okay, now it essentially has a very basic set of functions that, in a way, emulate what you have on any database management system. Okay, so we have functions for filtering the rows, for selecting the columns, for reordering your rows, for adding eventually new columns, for summarizing your data, and
then for creating groups or subgroups of your data. So that's the stuff that you usually are accustomed to have on database management systems, but you get that through this deployer. Okay, then I have here, you know, a few examples. That's just the first, the start where you essentially tell the player where is your
data. This case is on a standard data frame, but as I mentioned, it could be on a database management system. And that's the first thing that you say, well, my data source is this thing. Okay, and as soon as you identify your data source, then you can apply this
summarization function. So for instance, you can filter your data by, you know, a set of logical conditions, and then in a way you are doing the same things that you do on SQL, but you are doing them directly in R. So you can, for instance, apply through this kind of chaining operator. You can apply a sequence of, you know, data management operations, like
filtering this, and then selecting these columns, and then arrange the output in descending order of something. Okay, so you do this sort of querying that you are accustomed to do on databases, but you do that directly in R. Okay, you can also create new columns in your data and then eventually
summarize these new columns. You can create, of course, subgroups and then apply summaries to these subgroups. So all this sort of, you know, data summarization and data manipulation operations get very easily done through this very small set of verbs that
this package supplies to you. And this, moreover, is very fast in terms of computation efficiency. So if you are into exploring data with R, I would strongly recommend that you try at least have a look at this multiplier package. Okay, then of course, again, within this export or analysis of your data,
it's usually also a good idea to try to get some visual representations of your data set. And for that, I'll probably say that ggplot is at least one of the hottest packages in terms of the data visualization in R. So not by coincidence, it's developed by the same guy that developed
the plier. But anyway, it's a very good package for creating nice visualization in R, nice statistical graphs in R. So it has some very nice theory behind and you may eventually want to browse through this reference. So just to give you an idea, so essentially what ggplot does is that
it has this nice concept of mapping the properties of your data, which essentially are the variables that your data set describes about entities, mapping these things into properties of a statistical graph. Okay, so at the first stage, you essentially make this mapping,
these aesthetic mappings, if you want. That's the jargon that they use here. So essentially here you are saying that, well, my data has this property that is going to be mapped into this property of a graph, which is the X coordinate. And this other property of my data is going to be mapped into this other
property, which is the Y axis. Okay, and then there is another property of my data that is going to be mapped to color, which is another property of the graph. Okay, so you do this sort of mappings and then you kind of over plot some geometrical objects in this thing. In this case, I'm over plotting points on using these mappings.
Okay, and so that's the kind of general concept of ggplot graphs. Of course, you can do much more complex graphs with ggplot. That's just a brief overview of the sort of things that do. You can even build on top of that.
You can, for instance, if you have like spatial temporal data, you can use this other package that builds on top of ggplot, ggmap, and you can build these fancy spatial temporal representations of your data. Okay, so that's, again, my two minutes short on data visualization with us,
just to give you some pointers on interesting packages in my personal opinion for doing this kind of export. Okay, then of course, modeling is also, most existing modeling techniques are available in NAR. Just a very brief overview, but probably the most interesting aspect here is
for you to see that the interface with these different modeling techniques is essentially very similar. So you have most of the times, independently of the technique that you are applying like multiple linear, no Mars, for us, gradient boosting machine.
Typically, you have as the first argument what is known in R as a formula which essentially is a language for specifying the functional dependency of the predictive task. So you essentially tell R what is the target variable and then what are the predictors, okay? And if the predictors are all the remaining variables, you put this dot
representing everything else is to be taken as a predictor. And that's kind of very general in terms of all the, at least most of the modeling functions available in R. And then the second parameter typically is the data that you are going to use for obtaining your model.
And then eventually, some modeling techniques take some other parameters that for you to tune your model, okay? But that's the general concept. So you have specific functions for obtaining different predictive models. Some of them come on different packages that implement these models.
And then you load these packages and then you call the functions to obtain the models and assign these to a variable that will store your model. But that's the general way of obtaining a predictive model in R. So you have to find out what is the function that implements
your favorite tool. Eventually, you have to find the package that implements this function. And then you need to know how to use it. But typically, the first argument will be a formula. The second argument will be a data frame. And then eventually, any parameters that are accepted by this tool, okay?
So that's the kind of things that you do in terms of predictive modeling with R. Of course, you can also do unsupervised learning like, for instance, clustering. Again, a large number of clustering algorithms are available. Like, for instance, k-means-like stuff, fuzzy clustering, hierarchical clustering. These are just a few examples.
Most of these things are implemented already in R. So you can play with them if you want. And again, these are just some pointers to some packages that implement common algorithms in R. Then, of course, another important aspect of any data mining project is the part of
reporting, which is a tedious task that you need to communicate your things to other people. That's life. And this typically involves different tools, which is a problem. Because normally people use their favorite data mining tool,
like that I hope it's R sooner or later. But then they go into their favorite word processing or presentation software or whatever. And then they go on into a kind of mad copy paste between the two tools with lots of manual and tedious work by trying to produce some text around your analysis to communicate to people.
And that's, of course, very prone to errors. Not only it's tedious, but it's also prone to errors. And the idea of these dynamic documents, which was another thing that I would like to illustrate with you, which is not, of course, an idea only existing in R.
And it goes back, you know, far back in time. But it's implemented through this package connector, which allows you to create these dynamic documents that mixed your solution, your analysis, with your comments, your reporting.
And so what you do is that on a single document, you write your story and your analysis and you show your results. So instead of having these two separate tools, your analysis tool and your reporting tool, and then manually go through this copy pasting thing, then you can do that on a single document.
So that's the idea of these dynamic documents. So you have this single document where you mixed your story with your code that implements your analysis, and then you pass that to R and Knitter, and then you get your final report that you show to your manager or supervisor or whatever.
One of the biggest advantages of that is that if the person here doesn't like your report and asks you to change anything, typically you just change some slight thing on here, and then you just go and produce with a button the report again. Just to give you a small illustration of what I'm talking about.
So this thing here, it's one example of a dynamic document. So you can see that I have here text with any particular formatting details, although there are some kind of tags for putting something on bold,
as you are going to see. And then you see here these gray areas which are known as code chunks. So that's things that usually people will do on their data mining tool and then will paste the result here on the document. But you can do actually everything on a single document, and then you can pass this document
to R, and for instance, suppose you want as an output, let's say, an HTML file. So you just press this, so R goes on and compiles your document and produces the report. And so that's what a dynamic document looks like.
So you see here, of course, you can only show the results as I've done here, or you can mix the code and the results so you can produce your reports dynamic like that. But if your game is not HTML, if you prefer PDF, then you just change... Sorry, wrong button.
Just change this into PDF and then you get hopefully a PDF on your... So you get your paper or whatever directly. And if you want Word, then you can also get Word and stuff like that. And you can also produce slides, HTML slides like that.
The part of reporting can be tightly integrated with the tool, which is a way of avoiding many mistakes. And by the way, for research projects, if you have some scholarship, some person that probably will live away sooner or later,
then it's a good idea that you ask them to build reports like that, because you stay in the end, you have not only the results, but also the way that the results were obtained. So there are several advantages on using these sorts of tools. So not only sometimes we need to do reports,
but sometimes we need to deploy our data mining results to our file net clients or wherever. And there are also nice tools nowadays in R for deploying these data mining results, which would be the end of the chain if you want. And one of them is this package shiny, which allows you to very easily create web apps directly from R.
And so you can create very easily a graphical user interface that runs on a normal browser that you don't need to take care about software cross-platform compatibility issues and stuff like that, because you have a standard browser as the only requirement.
And so essentially a shiny web app is formed by two files, one that takes care of the user interface, and the other one that takes care of the R computation behind to produce the results on the web app. And that's actually very simple. So if you want, I have here a very simple example.
So if I run this thing, I get something like that, that's runnable on a browser, and then it's interactive in the sense, this of course is a very naive example, but you have, you know, widgets, graphical widgets that somehow change the outcome that you see.
And so you can use these shiny web apps to create, you know, something dynamic, which has on the background R producing your output, but then you can create, you know, fancy user interface, not actually a very good example, but you can create this sort of web apps.
And of course, then you can, you know, provide these as your own server, at your own server, or you can even, you know, shiny as some RStudio, as some hosting facilities, okay? And that's very easy. Essentially, as I mentioned, you have two files, like the user interface that, you know, it has some elements that controlling the layout of the interface,
and then the server that actually does the R computation that provides these graphs, in this case that you have seen this kind of box plots, okay? There are then some data structures providing the communication between the two files, but essentially it's very easy to produce something runnable or showable to someone with, you know,
a very short amount of code, because this shiny provides already lots of functions that create these widgets and stuff like that on an HTML file, okay? So that's my... Well, it was fast. A short introduction to our illustration,
more of different data mining steps that you can carry out in R. So my main talk then is this issue of the estimating and comparing the performance of predictive models, and which is of course related with the goals of OpenML,
and I just wanted to talk a bit about a package that I've developed one and a half year ago, something like that, for doing this, okay? So just for us to be aware of the common setup, so essentially we have a predictive task,
which in a way is a known function that maps the values of a set of predictors into a target variable, and typically this is either a nominal or a numeric variable that is a classification or regression problem, and then we are given training data, and then typically we decide upon
some performance evaluation criteria, and then the goal of these sort of experiments is how to obtain a reliable estimate of the values of these performance criteria on given this data using some particular tool, okay? So that's the general goal of performance estimation
in the context of predictive tasks, and then one wrong way of doing that that sometimes people use is to use this kind of research substitution estimates where, you know, given this data set we obtain our favorite model
and then we apply it on the same data and collect whatever performance metrics we get on using this model. That of course is kind of unreliable because these estimates tend to be over-automistic because, you know, if your model is good at approximating the data that you gave it,
then of course the results will be very good, okay? Of course that depends a lot on the type of technique that you are using, and so the seriousness of this unreliability depends on the technique, but generally it is a bad idea to proceed this way, okay? So the main goal then of this performance estimation
is this issue of reliability, of estimating the expected prediction error and on an unknown data distribution using only the data that we have available. And for that to be possible, typically we want to test our models on a separate set of tests of cases
and which we usually call a test set instead of applying it directly to the data that was used to obtain the model, okay? So there are many ways of course doing this, and moreover we typically tend to repeat this training and testing process several times
to increase the statistical significance of our estimates, and typically then we average these scores over these repetitions together with, you know, some standard error estimate, and that's the game that we play here on performance estimation, okay? Now there are many ways to,
many methods if you want, or procedures to obtain this sort of reliable estimates. One of the simplest is this idea of holdout, so we typically randomly split the available data in two subsets, one for training the model and the other one for testing it, okay? And that of course gives us a single score if you want,
and then frequently what we do is that we repeat this random splitting several times, and then we collect, you know, a set of metrics that we average and then calculate our standard deviation. A slightly different approach is this idea of a k-fault cross-validation
where you start with your data, you randomly permute the data, and then you split this data into k-faults, or equal size faults, and then you iterate k times leaving one of them aside of the test set, you train your model on the remaining k minus one,
and then test it on the k, on the left out partition, okay? And then for each of these k repetitions, you get the score, and then in the end you average these and get your kind of cross-validation estimate of the error of your model, okay? So that's just two examples of estimation methodologies.
There are, of course, many more around, and the goal of this performance estimation R package is exactly to try to facilitate, you know, carrying out these sort of experiments on different tasks, different methods, using different methodologies for obtaining the reliable estimates, okay?
So the infrastructure provided by this package can be applied to any model, any task, and any evaluation metric, okay? So that's the kind of main design goal is to be completely general and flexible, to adapt to the user's approach to any predictive task, okay?
And so the main function of the package has the same name of the package, it's called performance estimation that has essentially three arguments, okay? The first one is the set of predictive tasks that you want to use on your experiment. Sometimes it's only one, but, you know, when you are writing papers,
you tend to evaluate algorithms on many tasks. So you can provide on the first argument the set of predictive tasks, and I'm going to detail in a while what is a predictive task, okay? Then the second argument is the set of workflows, or if you prefer, solutions to these tasks that you want to compare or evaluate, okay?
So that's the second argument. And the third argument is the experimental methodology that you want to use, holdout, cross-validation, bootstrap, whatever, okay? So that's the three main pieces of information that the function takes. What are the tasks, what are the solutions that you want to compare on these tasks, and which experimental methodology you want to use
to obtain reliable estimates of their performance on these tasks, okay? Now, as I mentioned, the function implements a wide range of experimental methodologies, including holdout and cross-validation, but also others, as I will mention. Now, just a very simple example,
so suppose you want to estimate the mean squared error of an SVM on a certain regression task using tenfold cross-validation, so you could, you know, these are just, you know, initializing stuff to get the SVM and the data, but essentially that's the function call that you have. So you call the function with a particular task, and essentially a task is defining what is the target,
what are the predictors, what is the data set, okay? And then you have a workflow, which is your solution, and I'll go into detail what I mean by that, but essentially it's a workflow, so that's a solution to this task, and then you say that you want to use cross-validation with tenfold, okay? And then that's it.
So you type this thing, you hit enter, and you get your results. This object, which is the result of this performance estimation experiment, can then, you know, you can obtain textual summaries of that, like the average mean squared error of the model, the standard deviation, minimum value, maximum value, if there was any problem during these runs,
and you can also plot it. You can obtain a kind of box plot of the distribution of the scores across the ten repetitions in this case, so you can in a way explore this resulting object in the standard ways, either visually or through some statistical summaries, okay?
Now, let's go into more detail onto these three components. So the predictive tasks, essentially they are objects of one class, which is defined on the package, that, as I mentioned, should define the task as a formula, which essentially, as I mentioned,
consists of saying what is the target variable and what are the predictors, okay? And then it should also supply the source data, which typically will be a data frame in R, and then it can optionally have a kind of ID for this task, okay? So you have here two examples like, you know, for with the Iris dataset, try to forecast this variable using all the others, or for the Boston, try to forecast this one
using all the others, okay? So that's the notion of prediction tasks within this package, okay? Then in terms of workflows, essentially a workflow is a solution to a predictive task, and workflows should be implemented by an R function, okay?
So you should have an R function that solves a task, okay? Now, so essentially these workflows are objects of this class that have the name of the function that implements the solution, and then any extra parameters that need to be passed to this function. It can be zero or, you know, extra parameters that need to be passed. And optionally you can have a kind of,
again, a kind of internal ID, okay? So, you know, just a few examples. So this is assuming that there is a function with this name that accepts these parameters, okay? And so I want to apply to whatever task with these parameters. Or in this case, you know, I will talk about this function soon, okay?
So that's workflows. And this issue of this standard workflow, you know, results from the observation that most of the times, most users, really what they want is to apply a standard out-of-the-box algorithm like an SVM or random forest, whatever, to a task, okay?
So they essentially want to apply that algorithm to the training data, then use the resulting model to obtain predictions for the test set, and then with these predictions calculate some standard metrics on the test set with these predictions, okay? Like accuracy, error rate, whatever.
So that's, I mean, my view that this is the most common setup for users, okay? And because of that, I wanted to make this thing as easy as possible within this package, okay? And that's why I created this standard workflow function. So that's a function that allows the user to say, well, I want really to carry on this approach as a solution,
and I want to use this learner and do the rest for me. I don't want to be bothered about anything else, okay? Of course, you can also have the flexibility of writing your own workflow function because maybe you want to apply before, you know, fancy data pre-processing step, and then to the predictions you want to apply again
a very fancy, you know, a post-processing or whatever. You can do that with the package, but if you want to apply this kind of standard workflow, then you can use this function that is already provided by the package, this standard workflow, okay? So that's the idea of this function. So essentially this function accepts several parameters that allow you to control these three steps, okay?
It accepts one parameter that where you specify the R function that implements the algorithm, let's say SVM, random forest, whatever, okay? Then it accepts another parameter where you provide the arguments that should be passed to this function, okay? Then there is another one
which where you tell what is the function that should be used to obtain the predictions, which in R typically is a function named predict, okay? And that's the default. You usually don't need to specify it, but maybe you want to, you know, to have your own very picky, you know, very special way of obtaining predictions, okay? So then parameters to this function.
Then there is another one that controls what is the function to be used to calculate metrics, performance metrics with the predictions, okay? And then by default, I have again two functions that calculate standard classification metrics and standard regression metrics. So if you say, if you don't say anything, so the function will look at the type
of the target variable. If it is a nominal variable, it will call classification metrics with some standard thing like error rate, okay? And if it is a numeric variable, it will call that and it will calculate mean squared error. But you can also control that through this thing. So you want different metrics, okay?
So that's the idea, to try to allow the user to control these three steps and have reasonable defaults for each of them, okay? To allow you to do at least possible work in most situations, okay? Now, of course, as I mentioned, you may not be happy with that.
You may have your very fancy, very sophisticated algorithm or that includes sophisticated pre-processing steps, whatever. So you can do that also with the package. So you can create your own user-defined workflows. So there are a few constraints for these functions. Essentially, it must be an R function that accepts a formula in the first argument,
a training set in the second, and a test set in the third. These two are data frames. Anything else, it's up to you, okay? So you can write your own function, your own workflow function. The first argument must be a formula, the second a training set,
the third a test set, and then whatever is required for your workflow, okay? And then you implement that, you know, and I'll give you just an example. And then the function must return an object of this class, which should contain at least a vector of the scores of the metrics that are being estimated. It can contain more things.
It can even output the model or output the predictions, but the minimum thing is to return the scores, okay? Now, let's suppose that I want to create a workflow that essentially combines a linear regression model with a regression tree. Let's suppose that's a good idea, okay?
And so I want to create a workflow that, given a training set, obtains a linear regression model, a regression tree, and combines the predictions, and that's the output, okay? So I could create an R function that I just call my workflow. As I mentioned, it's to assume that the third three arguments are like a formula,
training set, and test set. And then this workflow also considers another argument, which let's say it's the weight that I'm going to use on the combination between the predictions of the linear regression and the regression tree, okay? So what this does is essentially obtains a linear model with the training data, a regression tree with the training data,
then obtains the predictions of both of them, the linear model and the regression tree, and then combines their predictions using this weight. By default, it gives me the same weight to the two models, okay? And then I just output the mean squared error of these predictions, okay? That's it. That's a workflow that I,
of course, it's not a very fancy workflow, but this could be as complex as you want. If you want to have fancy pre-processing steps, you can include that like this, okay? And as soon as you have this function, then you can call it in the same way. So performance estimation, this task workflow, this function, okay?
And I want to try with these parameter values. And that's the thing. So you get the results as it was any other, and that's where the flexibility comes from, is exactly for allowing the user not only to use these kind of standard workflows, but also to write their own workflows if they want, okay?
Sometimes not only you want to have these workflows, but you want to try several variants of the same workflow. Let's say varying the parameters of an SVM or a random forest, whatever. And for that, I have this workflow variance function that, again, what this function does is it generates a set of workflows.
And each of these workflows will be variants with different parameter values of a workflow, an existing workflow, okay? So you can, for instance, say, let's say I want to try several parameters of an SVM on a task. I could say, well, I want workflow variance of this base workflow
where the learner is this, and the learner parameters are cost, which is one of the parameters of the SVM. And I want to try these two values, not only one, but these two values. And gamma, which again is a parameter of this, and I want to try this thing. So what this does internally is it will generate all possible combinations between, so it does a kind of all combinations of this parameter value.
So in this case, it will generate four SVM variants, okay? And it will do that automatically for you, and then you get the same sort of results. And of course, you can do the same with your own workflow functions. Let's suppose that I want to try the weight of the linear regression to go through this set of values, okay?
So that's the way that you can use this function to try different parameter combinations of your favorite workflows or solutions, okay? And then, of course, then when you type to see the result, you'll get like variant one, variant two, and so on and so forth, okay?
And you get the different scores for these things. And by the way, here I was playing with this evaluator pass to say, well, I don't want only the mean square error, I want the mean square error, the mean absolute error, and the mean average percentage error. And so these are kind of standard statistics already implemented on the package, okay? So you get these sort of results.
Of course, you can also then ask, well, SVM v1 looks interesting, the result, what are the characteristics of this? And you get the information on the parameter settings of this variant of the workflow. And you can also do things like ranks of the models.
You can, of course, plot everything and see for the different metrics and for the different variants, you get a kind of visual comparison of the scores, again, using cross-validation in this case, okay? You can also rank the things for the different metrics
and, of course, all sorts of other things that I don't have time to go through here, okay? Then regards estimation methods, we have already seen a few examples of using cross-validation through this function, but it is also implemented old-out and random subsumpling, basically repetitions of old-out, live-on-out cross-validation,
several, two variants of bootstrap, and also, it is also implemented these Monte Carlo experiments for time series or time ordered data, okay? So that's also implemented in the package. Now, this is just an example with a classification data set. In this case, using old-out, so the syntax is the same,
but instead of having your CV settings, I just put old-out settings, okay? And that changes the experimental methodology. And, of course, maybe the parameters are slightly different. In this case, I need to specify the size of the old-out, and this is the number of times I want to do this random repetition, okay?
And this is a slightly larger experiment. This time, you know, instead of having a single task, I have a vector of task, in this case, two tasks, and then I have a vector of several workflow variants, okay? So in this case, I'm having four variants of an SVM
being compared with three variants of a regression tree on these two, well, in this case, a classification tree on these two classification tasks, and using three repetitions of tenfold cross-validation, okay? And that's the code that is necessary to run this experiment. You get, of course, then different graphs
for each of the tasks and for each of the metrics. In this case, it's just error rate that I selected, okay? And then you get the scores for each of the variants of each of the learning systems, okay? Again, you know, ranking, getting the top performers and stuff like that information on the parameter of this thing.
So you have a set of utility functions then to play around with this object that results from the experiments. Now, finally, a part that I wanted to talk about, just to wrap up this thing, would be the issue of trying to check if the observed differences
between these different workflows are statistically significant or not, okay? So that's, of course, related with statistical hypothesis testing, and then essentially the null hypothesis that there is no difference among a set of alternative workflows. And we typically use this kind of thresholds
to have some degree of confidence on the observed differences, okay? Now, most of these experimental methodologies do not ensure independence among the different observed scores. And so because of that, you know, all sorts of theoretical background behind these experimental comparisons, we use the nonparametric Wilcoxon sign rank test
to carry out this thing, although it is also implemented the parity tests, but, you know, I will not recommend that you use that for this type of experiments, okay? So you can carry out any experiment and then pick this object
and obtain this paired comparison. So there is this other function, paired comparisons, that picks the object resulting from this experiment and calculates these statistical significance results. So essentially, by default, it picks the first one, gets the average score, the standard deviation,
and then the other ones are compared against it, okay? So that's the difference, and then that's the p-value that this difference is or not statistically significant, okay? So you get these sort of tables for each of the data sets. In this case, it's just one, and for each of the evaluation metrics, okay? And there is a perimeter baseline
where you can select the baseline workflow to be compared to which all the other ones are compared to. And then there is a kind of auxiliary function that allows you to prune these results by some limit on this p-value to get a more table containing only the statistically significant differences
according to some threshold, okay? But that's basically it. And then, of course, that kind of makes easy to obtain these nice tables that we all like to put on our papers because it's very easy to, for instance, pick this thing and just put it as a LaTeX table on our paper, okay?
So that's more or less it. All right, so I was very fast. I was, yeah, that's okay, no? Yeah, that's it, sorry. So I don't know if you have any questions.