We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Good features beat algorithms

00:00

Formal Metadata

Title
Good features beat algorithms
Title of Series
Number of Parts
132
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
In Machine Learning and Data Science in general, understanding the data is paramount. This understanding can come from many different sources and techniques: domain expertise, exploratory analysis, SMEs, some specific Machine Learning techniques, and feature engineering. As a matter of fact, most Machine Learning and Statistical analysis strongly depends on how the data is prepared, thus making feature engineering very important for any serious Machine Learning enterprise. "Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data." In this talk we will discuss what feature engineering and feature selection are; how to select important features in a real-world dataset and how to develop a simple, but powerful ensemble to measure feature importance and perform feature selection. Familiarity with intermediate concepts of the Python programming language is required to follow the implementation steps. General knowledge of the basic concepts of Machine Learning and data cleaning will be useful, but not strictly necessary, to follow the discussion on feature selection and feature engineering.
35
74
Thumbnail
11:59
PhysicistOperator (mathematics)Normed vector spacePhysical systemCompass (drafting)Local GroupSoftware testingData centerGraph (mathematics)Computer hardwareLeakShape (magazine)Metric systemDigital filterMachine codePersonal digital assistantLinear regressionData modelEstimatorVotingoutputTorusMathematical analysisParsingSubsetTask (computing)Matrix (mathematics)Type theoryGame controllerPairwise comparisonResultantCross-correlationInformationBitWave packetMultiplication signWaveDrop (liquid)NumberGoodness of fitScripting languageSemiconductor memoryMereologyConstraint (mathematics)BefehlsprozessorSet (mathematics)Server (computing)Noise (electronics)TouchscreenVideo gameDifferent (Kate Ryan album)PermutationStability theorySlide ruleRow (database)Near-ring1 (number)Mathematical optimizationGreatest elementCurveRight angleData miningMetreSampling (statistics)Service (economics)Chemical equationComplex (psychology)Social classCASE <Informatik>Directed graphElectronic mailing listConnectivity (graph theory)Arithmetic meanNP-hardDecision theorySelectivity (electronic)Network topologyForestCore dumpWebsiteRange (statistics)VotingRegular graphThresholding (image processing)Moving averageVariable (mathematics)Data modelAlgorithmProcess (computing)CoefficientPoint (geometry)Physical systemSign (mathematics)ImplementationNatural numberProjective planeMusical ensembleTransformation (genetics)IdentifiabilityInterface (computing)Linear regressionCursor (computers)WeightMachine codeGeneric programmingPrice indexDemo (music)Object (grammar)Virtual machineRandomizationValidity (statistics)Data storage deviceGodVarianceDimensional analysisWrapper (data mining)DistanceSoftware testingPhysicistGroup actionCartesian coordinate systemScaling (geometry)Form (programming)Mathematical analysisMeasurementComputer hardwareLibrary (computing)LeakData dictionaryMetric systemRepresentation (politics)Information privacyGeneral linear modelCross-validation (statistics)LageparameterEstimatorFreewareEndliche ModelltheorieStandard deviation2 (number)TheoremPresentation of a groupElement (mathematics)Time seriesNormal (geometry)Exception handlingExistenceArtificial neural networkTowerMechanism designVideoconferencingMathematicsCausalityBuildingBlock (periodic table)Computer clusterWater vaporAsynchronous Transfer ModeLeast squaresCycle (graph theory)Optical disc driveHeegaard splittingSingle-precision floating-point formatAttribute grammarOrder (biology)Word
Transcript: English(auto-generated)
Good morning everybody My talk is gonna be about feature selection and title is up there on the screen. I'm going to do a few things today with you Basically just gonna introduce myself because I'm not rude at least not too much and I'm gonna tell you a story
But I hope you will like and I think some of you might relate to that And then I'm gonna show you some code and some results some numbers. I know you're all wondering. Can you be any more vague? I could have but you know, this would be enough for today. So first things first a little bit about myself
There's a few things going on in my life, I am a physicist I Translated into a data scientist for some reason I work with many different types of technologies that I'm very passionate about and that wouldn't have fit on the slide, so I just added the first few ones and
I'm a ham radio operator. Does anybody here is one one, two, three four great I'm not alone because I usually am and There's a few things I'm but this can be summed up in the fact that I'm a generic nerd. So some of you might relate to that as well and I want to tell you first a couple of things about the company I work for
The company is called Optum. It's right there in the lower in the bottom right-hand corner Has anybody here heard of Optum No one one hand, oh, yeah, of course, it's you Sorry guys that he's a colleague of mine. He sits like two meters from me. So yeah
anyway Let's try this again because no one ever heard of Optum Have you ever heard of United Health Group? No one Okay Good start. No one ever has by the way, not in Europe at least We're part of as often we're part of United Health Group, it's a very big American corporation that works around
Health care and health services. It's actually quite a big company. We're talking about almost 300,000 people working for that and This slide is actually a bit old because it's from last year It says we ranked sixth on the fortune 500 list. That's not true anymore because we're fifth now. Yay go us
So It's actually a pretty interesting company And if anybody is interested in working with the hardcore data science and cutting the edge in healthcare, we are hiring worldwide. So Have a look. There's a couple careers website on both of those sites at the bottom of the slide
Anyway back to us the main purpose as I mentioned earlier is for me to tell you a story and as many stories as many Fairytales this story starts with a brave white knight in a shining armor By the way that white knight. It's basically all of us. It's just a metaphor
And as any good story This story also needs a villain because tonight this way this knight in shining armor has to defeat and slay a terrible dragon And you were you might be wondering okay, we're all data scientists We don't really fight dragons usually well again, it's a metaphorical dragon and
The way it usually comes to be it comes into existence It's this and I think some of you might relate to this situation It's basically the evil overlords atop their white tire tire going like we have all of this data Let's just do something with it Even if they have no idea what to do
So what happens at this point is that You get handed your own dragon, and this usually comes in the form of a very big data set you know nothing about and you're supposed to do something with and Yes, it is that fake sometimes this happened to me not too long ago and
It might look something like this so You start a new project and this video is not running Start a new project if this happened this actually happened You get handed a big data set You know nothing about it. You just know that somebody Processed it in some way and you have to do something with it
so you start by having a look at it poking around see what's in there and You know a few things like this might happen you realize that the deal set is huge You have no idea. What's in there it has over 800 features. What are they so you just?
Try decide to have a look. What are they? And I kid you not this happened And there were hundreds of features like that and I had no idea what they were there was no data dictionary of course there never is and
It goes on There were about 200 features that had reasonable names You could just kind of make out what they were and then they were like 600 of these What was the sentence garbage in garbage out something like that? I mean, it's just too much and
I've been feeling like that for weeks while working on that project can anybody relate to that show hands Yeah, of course because The sad reality actually the happy reality if you think about it is the fact that of all the information We usually need a very tiny portion of it to reach our goals to accomplish our task
I'm not saying that everything else is definitely useless for what we're trying to do But it's probably just adding something on top what we need to do. It's a plus, but we might not strictly need it and of course when you're working with such a Big amount of data such a big amount of information. There are many problems that can arise
Like the fact that you have no idea. What's in there The fact that even if you do have some idea of what you're trying to do and how you're trying to do it Training time might increase sometimes exponentially with a number of features and you don't know which features are relevant which features are garbage Hardware requirements might increase because sometimes you just need to load a lot of the good chunk of the detail memory
And then you have memory constraints or CPU constraints or if you need to go for deep learning you might have big constraints regarding GPU availability Which are not always Satisfiable for example in our company we work with the very sensitive data, so we are not allowed to use AWS, so
If it's not in our server farm you can't have it that's the problem And also there are problems like increased performance and overfitting because again You don't know what features are noise and what features are relevant for your problem, and there's the risk of leaks because again
some information regarding the target might actually leak into some of your features and You might just be thinking yeah sure, but it won't happen to me I'm really careful when I do my feature engineering, but sadly yeah, it can happen to you And it happened to me not too long ago I was working on a classification project, and I spent a couple of days sifting around the data
I could understand most of it. I had a fair idea what most of those features were I did some clever feature engineering tricks like a week or so of that And then I trained my first model just to have a baseline and I Got an incredibly high accuracy
99.9% So as any as that happened to you train the first model Sky higher accuracy so of course you're like woohoo. It's great problem solved You know for the first two seconds or so because then you're like Yeah This
Doesn't sound right this doesn't look right. There's something wrong here. It turns out that there was something wrong There was some small leakage from the target into one variable and once I realized that and once I fixed it accuracy went back to a normal value something that I would have expected but This is a problem because if the accuracy
Caused by that leak would have been in the order of 85 89 percent. I wouldn't have been that suspicious But the results would have still been wrong So you should always be careful about these things But anyway getting back to what we're here to talk about We're here to talk about feature selection, and we can't really work with a very big data set
We can't really expect to tackle a big problem right here because you know this is just a simple just a talk Like we don't have that much time and I'm pretty sure you don't really have that much desire to actually do that So let's let's restrict the scope of this problem Let's work with a simple data set and that has a target. So we're working with a classification problem and
Let's see, what can we do about it? Full disclosure, I'm using a small data set a public data set that probably most of you have worked on I'm not gonna tell you what which it is until the end of the talk But some of you might guess what it is. Where is my cursor? I need my cursor
Ah And this is what the Dataset looks like so we have 500 odd rows 30 features and these are the feature names X 0 0 till X 29, so we have no idea what they are and we have to deal with it and
This is a problem because which features do we choose and how do we choose them? This is of critical importance because different problems demands different types of solution and We have to be really mindful of how we choose metrics because sometimes we use
Techniques or metrics that are just not relevant or plain wrong for our type of problem This is an example of it Are you familiar with the Anscom quartet? Are you familiar with correlation for choosing good metrics? But you are So a lot of times we just use correlation and see if a metric is correlated with a target and if it is, okay
It's important. Let's keep it if it's not. Yeah, maybe let's drop it But this can be a bit deceiving in the answer part of this. This is these are three. Sorry for famous examples of data sets that have the same summary statistics for these four
Pairs of X's and Y's have the same mean variance standard deviation and regression coefficient. Oh It's a problem because clearly in two of those cases to regression the correlation exist in some in the other cases Yeah, not so much. So you can't just Use There's no technique that fits all problems and you might be familiar with there's no free launch theorem
Sometimes launches are actually very expensive not just not free And whenever you're working with this kind of elements for feature selection always maintain a healthy dose of skepticism. That's never wrong But then how can we perform feature selection then? Well, there are three main
classes of algorithms that can do that for us and namely these are Filter methods wrapper methods and embedded methods and I'm going to tell you a little bit more about those in the next few slides Filter methods are the simplest form of it. They basically perform
they usually basically perform univariate analysis between features and the target or In the features itself, for example a clear example of that would be a variance threshold you don't consider all features that have Less than X amount of variance and this happens all the time to me I always get handed data sets would have which contain at least 10% of features that just don't change or that are all missing
They're clearly not really useful at least not in the way they are Some clear some examples of this will be techniques like F test. I know of a Granger causality if you're working with time series
LDA techniques like that Then there are wrapper methods. These are a bit more sophisticated and the way they work is They train subsets of features and they look for what features are useful So for all subsets of features you train a model
Usually classification or regression and you determine which models perform the best So this is a bit more sophisticated than the first type of algorithms And it's a bit prone to overfitting because you don't really have fine control over what the features do and what sorry over what the models do and They have a big problem you have to train a lot of models
So if you have a small data set, these are fine But if you have a big data set probably not a bad idea, but it's not the best idea and then as I mentioned There are embedded methods these try to take the best of both those worlds of filter and wrapper methods and
The way they work they perform a classification or regression, but they also have their own internal way of performing feature selection a few examples of this would be for example decision trees or random forest which basically used their own internal representation of the features to determine the best splits and to determine what features are useful and These are quite good. They are less prone to overfitting and
they usually Yield good results and they perform their search their search using cross validation usually All right, let's close this small parenthesis and let's Try and remember why we're here Remember the knight in shining armor. I mean, come on. We have a we have a dragon to defeat so
We have to slay that dragon So and the way we're gonna fight this dragon is with code. I'm gonna show you A toy example, let's say on the dataset. I showed you earlier We're gonna perform a small feature selection
exercise using an embedded system Sorry an ensemble system Because you know want to make things a little more spice and more complicated. Otherwise, anybody would be able to do that word of advice this code is Formatted for slides and presentation. So don't just take it and use it and expect it to get correct results
I'm awaiting a few things. I actually have part of this code embedded in a personal library But there's a lot of things I'm omitting here like, you know validation error handling and so on so Take inspiration by all means but I would advise against using it as it is. So
My requirements for this type of library were that It didn't need to be a free-for-all Algorithm it had to have some guidance as to what it was supposed to do and it had to be extensible So you you should be able to just add
new selectors to the to the ensemble algorithms and expect it to work and There are a few ways we can actually combine the results in the end because we're using different feature selectors technique Selection techniques and then we combine the results so we can actually add a way mechanism For example, you might want results from a specific algorithm to be evaluated more as more important than others
That's not here. But that was the idea. So let's have a look at the code Let's start simple. We just need to start by importing the things we need so these are these would be the building blocks of the of the class and of the selector and we're gonna use just five
Techniques for this one because you know, it's enough for a demo so we're gonna use random forests decision trees and various threshold Flavors for classification and regression because we might want to work in both worlds and they do work in a slightly different way So it's a good idea to have them separate and then we set up our class
based on the type of problem we want to tackle so we might have a regression problem a Classification problem or a generic problem Like when we don't have any targets and we have to still sort out all the garbage because we don't want to deal with that in that case we Are probably better to you. We're kind of doomed to use filter methods and see what features are just not relevant
I'm in this case. I'm just using variance threshold and Which has to be used kind of carefully because that has meaning only if all the features are in the same scale Especially if you're using a hardcore art sets threshold But this library takes care of it in in the background so
We're all good for that Then we have to initialize our class We need to know how many variables how many features we want to get from each of those models. We want to know We want to set a few thresholds like the actual library. I'm using actually is actually able to compare Train results test results compare the scores and see if they differ by a certain amount of score
And if that happens it raises a warning like if you get 90% accuracy in your training set and 20% accuracy on a test set that's not a good sign And you should review what you're doing But again, we don't want to block results. We just want to
Be notified of that Then we need to select all the models based on the analysis type like in this case the analysis type keyword would identify regression classification other or Two of them anyway and based on that we select the models from those that are in our class
We instantiate them and we put them somewhere safe In this case our somewhere safe is just alive estimators. So the ones that were chosen By the way, if you have any questions about the code or about anything else just raise your hand and ask No, okay We Have our set of models in our internal object and now we need to feed them on the data
And this is actually quite easy. We just go over all them. We have the data and we just fit them Now at this point we have a set of models or techniques that are fitted that know about our data and we can Extract results from them we can see
What they consider as important in the data set and The way it can be done just the way I did it there's many other ways of doing it but I basically select the Relevant feature from the model itself some models have an attribute called feature importance There's some models have variances some models have score and so on
For the models I that I was dealing with these were the most important ones and I didn't need to add anything else I get the score or the importance for each feature from each model I combine them and you can see at the bottom here the last row I only get the first any features from each model. This is specified in it
So I only want for example five feature from each model You will end up with more than five once you combine them because you know, there's no guarantee that all The results would be exactly the same set Actually, that's quite a rare occurrence and then we have to combine the results and the way we combine the results is
by having the models cast votes so I Have a list of models with their results and the features that they consider important with the score and I count how many times the features appear as identified as being identified by one of those models features that get
More votes so that have been chosen by more models get a higher value And this is where a good weighing system might come into play Because again some models might be not that important for a classification problem Variance threshold is probably not as good as identifying in identifying good variables has a decision tree For example, maybe depends on your data depends on your problem
Well, yeah, this is the basic implementation. We don't really need anything more than that at this point. We are good to go so Let's see if it works remember our data set Was those 500 odd columns
Sorry rows 30 columns and with those column names want to run our It's really hard to see my cursor there Want to instantiate our ensemble feature selector. We want to get we're not treated as a generic classification problem So we're going to use all the classifiers and all the other
techniques and We set the names of variable so that we have something more explicit than just indices and We want to get five features from each model That we fit on the data set and we get the target. Sorry we get the votes
What it looks like is this nine features were selected these here and these are these are the number of votes they got and You can actually go inside the library inside the object next time in what scores those object got so if you want to know How Important the single features were identified by the models here are the singles course five features per model per model
But since each model selects five features You end up with more than five we have nine All right so Are you satisfied it works?
Are you satisfied? You shouldn't be okay It works, but does it work I mean Does So Sorry Exactly. Excuse me something is this correct result is that this is a useful result. We don't know Well, it works, but does it work? So what I did I
Is I trained the same type of algorithm which is a logistic regression was this was a binary classification and I trained that using the whole data set first and just a subset that my ensemble has identified and By doing that I
Collected the results and what would you expect better same worse by how much? Better by how much it doesn't have to be a precise number just twice
Not that it's not that good I'm sorry. I'll be a disappointment to you guys Anyway, no, I'm using the same seed for both runs and The model score using nine features actually a tad better than the model using all the features
But again, both those course are really high. So I couldn't expect to get 50% increase in that Full disclosure I've run this a few times It's not always better because there's a random component to it using different using different seeds, but all the results they could find were always compatible with each other within
You know two three percent Sometimes it's better. Sometimes it's worse But I think the worst score I've seen using I think it was eight features that time was in the range of ninety four eighty something like that, so still good because
Yes, but why I don't follow so maybe find me later and we can discuss this
anyway, as I was saying These results are kind of are always comparable and it strongly depends on how the features are selected And this is a small example. We're working with
500 500 rows less than 600 rows and with 30 features but imagine that you could safely drop a third of your data set and Get comparable results and imagine that your data set is something big like millions of rows and
thousands of features There will be a huge speed up in your training. There will be a huge speed up in your work and I've used this I've used this in a couple of projects and it actually worked How well? Well enough. I mean I got results comparable with the with the ones I got without the feature selection first, but
to sum up The many important points are that you should always take care of knowing what's inside your data You should always spend time getting to know it intimately If you can if you have an example like the one I showed you earlier then good luck. You have all my sympathy
But you should always take care of Selecting the features that are relevant for your problem because you might have stuff that is really important for something else Then you probably don't need to use that feature selection essentially simplifies the models because it takes the garbage out or stuff that you're not interested in now out of your over the picture and
Imagine that you have a very specific problem. You can build your own feature selector Because you know what metric you're looking for and you can just stick into something sticking to something like the the ensemble They've shown you and give it a very high weight something like that and that would help a lot I've done that a couple of times and it really helped me. Maybe it helped you too
Feature selection increases generalization because again you get rid of noise and If you're working especially with linear models You don't want to have too much noise in there Especially if you have highly collinear variables because that will throw all your coefficients off the roof And you won't be able to use them directly and that's a big problem
Sometimes it helps you avoid the curse of dimensionality because you know as you know the higher the number of dimensions the look the Less meaningful distances become and that's essentially what the curse of dimensionality is and therefore it's hard to compare objects and records between one another and
in general removes noise and simplifies everything and again If you have to choose between something complex and clever and something simple and clever, you should definitely choose the simple and clever That would be it for me I'm open for questions now, thanks very much
Thank you for your talk I'm also very interested in this kind of automated machine learning Where you don't know have any clue about features, it's an interesting sport
But have you looked at papers? Like the data science machine or h2o that kind of tried to do this automatic machine learning I Have and personally I prefer to work on features myself because at least I get to know the data
I mean worst-case scenario get to know what's in there, even if I don't really know what the feature represents I don't trust Automatic feature selection completely when it comes from the outside because I want to get I want to know what's going on Under the hood, but that's one of my idiosyncrasies. So there's nothing wrong with that
Hi, thank for those Do you have any particular class of model which you think that feature selection is really important as compared to other ones I don't know linear models and compared to random forests So some models do perform feature selection Within the implementation itself. Like if you're using random forests you are performing feature selections before
Actually running the classification that being said Some models are very sensitive to feature some models not so much but I would say that Going through the effort of doing that is a worthwhile
Investment of time because you get to know the data and that's paramount in my opinion Do they agree? But do you think there are some particular class of models in which is Incredibly important to do feature selection depending what you're trying to do. Definitely for general linear model Models that's one. I always am very afraid to touch without having a fair
understanding of the data Any more questions? They're making you travel today
So did you try regularization like the rich? in Cycle learn because that's usually quite okay with dealing with like a lot of variables. Mm-hmm That's another technique actually that kind of falls under the umbrella of Being one embedded method reads and lasts as well because you can use the coefficients that you get
As a measure of how the the variables are related to the to the target And yeah, that's I've done that and it's a good technique. But again There's no free lunch. So it might work very well on some type of data and be completely off another and
Which was the data set because I have my own pipeline so I want to try right? All right So what do you think that they deserve was? It's a famous one. No, I work in healthcare Diabetes
Thanks for the talk very interesting So in machine learning if I pre process my data
Split it first and then I use the same transformations from the training set on the test set Do you think the same should be done for feature selection as well that you select your features? Only on the training set so that leakage does not occur That's what I usually do when I perform feature selection. I usually work just on my training set
And I usually have not just one test set by yourself a validation side that I set Safely away a long time before I started working on it But that being said you can Work around this and maybe have a look at different permutations of the data set And I do that sometimes if I want to test the stability of my selection methods
But that only works with data set that is not the data that is not the time dependent So it happens more very often that not just a selection of features But also a selection of rows might help because essentially each selection is Eliminating features that are duplicates or near duplicates you could also eliminate rows that are near duplicates
So you would essentially have a smaller data set that you would learn your model on without the noise So is that something you look at as well a training curve? It depends Removing duplicates from the training set or removing duplicates is always something you should be wary of because you should something So you should it's always something you should take care of because they don't really add extra information
Unless you're trying to oversample for imbalance classes Removing almost duplicates or near duplicates, it's a bit more complex because they might actually add information You just don't realize that by looking at the record Some algorithms are more sensitive to this than others like if you were using
deep learning and you had a very deep neural network and You I would probably leave all near duplicates in because they might have some nuances that you don't really see just by looking at it but they might have some mean dependencies, but if you're using If you have 500 records and you're using this decision tree, you can safely remove duplicates
Just continue water I asked before that if you have so many features, but you have very limited data Now you have 800 feature, but I already have 40 rows of data and what kind of feature selection you can do with it
But that's a very tough problem I Can't say I had to work in situations like that. So I don't know what I would do right now but personally I would definitely have a look because at the at the initially the variance because if I have very few rows and a lot of columns there's bound to be
hopefully a good chunk of those features that don't really vary too much and I would probably start by eliminating those and After that, you kind of have to play by ear Okay, so thank you for your talk
I just want to continue the question about regularization Did you compare that with traditional lasso or like your findings and did you use? regularization Last regularization in your linear modes after doing this initial feature selection
so There are two things that I want to say about this I have used the last enrich for feature selection in a technique like this In my doing my my work and my job. I haven't included them in this example because time
It would have had a little an extra layer of complexity because their interface is slightly different to get the coefficients And I didn't really want to tackle that during the talk. But yeah, it's definitely something you should Investigate if you if you're interested in it, but it works sometimes But you happen to have a comparison or the I mean Did you run comparison whether to run more to have this voting system versus just using lasso and say okay this voting system works
Whether or it's just something that you wanted to have more control and you decided to No, I haven't run a comparison strictly like that I have used although I have used results from lasso within algorithms like this So I didn't really need to compare those results with the final one
But yeah Thanks anyone else can't see any more Wave if you have a question No, all right. Okay. Well, can we thank our speaker?