Data is not flat
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44991 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
CurvatureData structureProcess (computing)Time domainMachine learningExpert systemData modelMereologySoftware testingoutputInformationUser interfacePlastikkarteModel theoryHeat transferVariety (linguistics)Thomas KuhnSampling (statistics)TimestampInformationCASE <Informatik>Parameter (computer programming)Artificial neural networkLatent heatArithmetic meanAttribute grammarConfidence intervalSource codeBinary codeAutomationPhase transitionDifferent (Kate Ryan album)Selectivity (electronic)Information engineeringType theoryMultiplication signBit rateCross-validation (statistics)Real-time operating systemEndliche ModelltheorieProcess (computing)QuicksortComputer architectureWebsiteFunctional (mathematics)Device driverSoftware testingTable (information)Student's t-testInstance (computer science)Virtual machineWrapper (data mining)Limit (category theory)MereologyModel theoryTransformation (genetics)Sigmoid functionUniverse (mathematics)Time seriesComplex (psychology)Task (computing)Data structureNetwork topologyScaling (geometry)Programmer (hardware)Decision theoryDimensional analysisLink (knot theory)ResultantInheritance (object-oriented programming)Game controllerVideo gameExpert systemArithmetic progressionFitness functionCodeOrder (biology)Domain namePoint (geometry)FreewareCategory of beingPattern language2 (number)UsabilityPairwise comparisonNoise (electronics)PredictabilityRepository (publishing)State of matterSpacetimeCycle (graph theory)Classical physicsCompilation albumoutputPower (physics)Presentation of a groupSet (mathematics)Validity (statistics)Linear regressionCartesian coordinate systemÖkonometrieOnline helpComputer fileReal numberCorrespondence (mathematics)Social classMultiplicationSynchronizationLattice (order)Resource allocationRemote procedure callDescriptive statisticsProjective planeData miningHeat transferMultikollinearitätForm (programming)BitFrequencyLevel (video gaming)Revision controlNeuroinformatikGoodness of fitAlgorithmVector spaceDependent and independent variablesSpring (hydrology)BlogMoment (mathematics)SkewnessDistanceDrop (liquid)Data qualityGrand Unified TheoryError messageIterationTunisTelecommunicationRight angleApproximationImplementationComputer animation
Transcript: English(auto-generated)
00:04
So, hi everybody, today I will not only talk about, or I will talk about a specific topic that is rather not that official in the data science, called feature engineering. Because there is like no discipline that you can learn in the university or find
00:24
a course on. This is actually considered to be something that you should be able to do, but it's not always that simple. So, and in the end of the presentation you will get what I mean by saying data is not flat.
00:40
So the structure of this presentation will be as following, I will just briefly mention what feature engineering is, what it is not, and I will concentrate on a certain example of feature engineering, there are like several types. Then I will give you a classification example that's a really stupid project of mine and
01:03
you probably will like it, at least I hope. So at the conclusion I will again tell you about how to use feature engineering, what you can do almost automatically, and what actually helps and works in most cases. And afterwards I hope we will still have time for a Q&A session. So what is a feature engineering?
01:21
According to the Wikipedia it's pretty accurate definition, this is a process of using domain knowledge of the data to create features that make machine learning algorithms work. What does it mean? You probably all heard about concept garbage in garbage out, meaning if you have garbage
01:41
data that doesn't even reflect the situation you're trying to work on, you can try whatever model you can come up with, it still won't produce any good results. So feature engineering is an art of preparing data the way that any algorithm can perform better.
02:00
So the feature engineering, as it was seen in the description, it is an expert domain knowledge, by that I mean that someone who is driving a taxi knows probably a bit more about the whole industry and how these things work rather than some college student. Then this process can be both automated and manual.
02:23
I won't talk about automated process, this is more of a machine learning approaches that cover the feature learning, like it's some part of a knowledge transfer for instance. Or there are a lot of also Python tools that if you have multiple data, like real
02:41
lot of tables, you can concatenate them some way and just run automatic functions to create new features as some sort of transformation from the older columns. I won't talk about that. I will talk explicitly about manual creation. However, all of those approaches are quite expensive.
03:02
You need time to identify what you're working with. You need time to identify what features might be useful. And sometimes it is important to first take a look at your model. Sometimes it doesn't mean anything at all. And as Garbage In, Garbage Out states, data quality is probably way more important
03:24
than the actual model. Feature engineering is not part of the data engineering process, meaning data engineers are people who collect the data initially, make some initial cleaning things like taking completely insane samples out, somewhat fluttering it out if necessary, they make the data available.
03:46
They have no idea actually what for this data will be used and thus they are not responsible for certain specific knowledge search. And it is not the part of pre-processing, even if it's done by data scientists. Again, this data science is an iterative process.
04:03
You create a model, you feed some data in. If you're not even known how your model will behave, you just watch what happens and then you start tweaking some parameters on the data side, on the model side. And this iterative will repeat itself many times. But feature engineering comes in question if your model does not perform well at all
04:25
and even some additional models do not perform well. And you have a kind of sense that the data is at fault. So I said earlier that there are pretty much two main approaches, manual and automated.
04:43
For automated, there are like tons of articles, just type in feature selection, feature engineering automatic or feature creation, Python and you will get all possible blog posts, GitHub repositories, just play around with it. It's also pretty powerful, but the thing with those approaches, you need to have quite
05:06
a lot of data or you need to have high-dimensional data. In this case, they will work the best. Basically, you will try to reduce the amount of noise or unuseful information to feed into your model. Second approach is used when you don't have enough data.
05:22
We are talking about either very low amount of samples or very low dimensionality of the data, or both. So the thing that I want to show you today will be concentrated on the case where I don't have data, not at all, but like a thousand samples, it's nothing.
05:45
And at the same time, it's like one dimension. So what can you do with this? Again, I quite like the article on the Wikipedia about the feature engineering. It has quite a lot of resources and links for further usage and further sources to read.
06:07
And they mention as process how to actually approach the feature engineering. So brainstorming of testing features, deciding what features to create, creating features, and checking how the features work with the model, improving if needed,
06:22
improving meaning you probably have to scale it down, or if it's a categorical feature, you need to define categories the ways that the model actually will understand them and won't have the skewed distribution. So you need to really know your data and repeat it and repeat it and repeat it.
06:42
So this is the process that we're going to all do today. For example, I said it's a three-time project of mine. And I have a friend. He is, I don't know, he's a student, informatics student, and he has very weird sleeping cycles. You know, those typical informatics people who are going to bed at like 7 in best case am
07:06
and wake up at 8, and actually he's a quiet big part of our friendship community and everybody's always interested when is he awake. And on Telegram, everybody's asking, is he awake? Is he awake? And as a part of the meme or gag you can say, I was given a dare challenge to write a model
07:28
that will actually try to learn his sleeping patterns, and instead of actually asking and waiting for a response, we will get the approximate answer right away. For that reason, again, it was a dare project, so I wrote yet another wrapper
07:45
around fully connected neural networks. I did it because I needed it to be configurable my way. It's nothing fancy. It's just NumPy with some features.
08:03
So about these, and NGenerator, it is available on PyPy if you want to play around with it. The feature that I needed was that the whole model should be configurable with JSON so that I will be able to run slightly different models simultaneously
08:22
or even have pure understanding, okay, where I came from and where I am now. So you can just give architecture. It can be flat. Currently, there is only the sigmoid, meaning binary classification implemented. This is the last layer, and then you can have as many layers as you want to
08:41
with the amount of nodes within. You have activations for each layer and something like confidence, meaning you have the probability of something like the... You don't get the 0 or 1 right away. You get the probability, and you need to interpret it. So you can set it to whatever you want. You can have different learning rates, all the hyperparameters pretty much.
09:06
So originally, I was given just 1,000 samples. These were timestamps, and the state whether the guy was awake or not was a 30-minute step.
09:24
So this is what we're going to use. Pretty much, we're going to create input structure approachable for building the model. We're going to create and train the model and just see what the prediction and the accuracy is in comparison to the testing set. I don't have a validation set because it's just Taliban samples,
09:42
and for the purpose of this presentation, it's not needed. So originally, I thought, what will happen if I just throw pretty much a time series inside and see what happens? Well, not that good. You can see that the model performed at 30% accuracy,
10:04
which is way worse than even a coin flip, and I can wait just to see whether the guy's awake or not. And I thought, maybe it's not really a prediction problem, it's not a regression. It might be a classification problem. And in this case, I obviously lack data.
10:22
But the good thing, I have time data, and time has certain attributes to it. You can describe certain point on the moment by what date is it to date, not even the date by what day of the week it is. And for us people, it is important because on the Monday to Friday,
10:41
we tend to work, meaning we have certain behavioural pattern on the weekends, our behavioural pattern might end, might not end, but change. And in this case, this difference is important. Also, it is important whether it's late in the evening, early in the morning, or it's lunchtime, so you can see where I'm going there.
11:03
And I tried to unwrap the timestamp into just vector containing this feature, describing each timestamp. So basically, I went through brainstorming or testing features, deciding what features to create. This is the day and the hour.
11:23
And I created them, and I also checked them how they work. By the way, I'm using two-layered fully connected neural network. This is proved to be the final best model, because after I did some feature engineering, I also played with different architectures,
11:40
and this one had the best performance. So I take it from the very beginning. So we run this one with two features, which are date of the week and the hour. The model suddenly jumps to 81% of the accuracy, which is like really not too shabby. And that made me thinking, what do I else know about this guy?
12:04
I know that he's a student, and he's a human. I know that he also lives in Hamburg, where I come from, and as any living being in Hamburg, we are affected by the weather, which is better, tend to be a little bit better in summer, and it's completely disastrous in late winter, early spring.
12:24
So the season might also be important there. And in this case, I say, okay, let's roll some season into it as a third attribute, and now we have as input data the three-dimensional vector. Now the model performed 8% better than the original
12:43
or the previous one. Like the model is the same, the data is just unwrapped. So 89% is already way better than I started originally with, but at the same time, 95% is something that should be considered to be production-ready, like 95 and above.
13:01
So I want to go further, I want to push the limits. So what do I know about this guy again? He's a student. Is it important? Well, actually, yes, because he has exam phase, he has some lecture phase, and he has vacation. And I know myself, and I tended to behave
13:20
completely different on exams and during vacation. During vacation, I slept like whole day through, whole night through. So why it should be different for him? And in this case, this is really artificial feature because it is static information taken from the website of our university
13:40
where I and he are students, but you probably cannot automate such information collection if you know how or if you trust the sources that can provide you with this information. Anyhow, this is more or less static, and now we have the four features model
14:03
which perform to 95% accuracy. For me, it was like 95, industry-ready, I'm done with it. But if I take a step back and look at it, all I did is I tried to reconstruct behavioral attributes
14:22
or time attributes of the behavior of this guy. And I started originally with only timestamp. And I landed up with pretty good results, only by feature engineering. You should just believe me, you can obviously try it out yourself, but tweaking the model parameters,
14:43
taking different models didn't perform well at all. I never got better than 45% accuracy on only timestamps. I tried STMs, didn't perform. I tried some models from classic econometrics like Arima,
15:01
but this is not exactly the problem I was approaching. So let's check out whether this guy is awake now. No, he is not. It's like three, yeah, he will be probably awake in like two or three hours. Ping me in the telegram or whatever, just to ask whether he is really awake
15:21
if you're interested. So this is pretty much what I just wanted to show you. This is very simple, yet very powerful. And I've only used my, like I am as an expert. I know this guy. I've pretty much simulated my train of thoughts,
15:40
what I know about the guy, what I should, what information do I use? And as a programmer, I try to implement or collect this data. And yeah, that's everything based only on timestamp. Imagine if you have more data that you can unwrap.
16:00
Another case I can think of is, for instance, there is a competition on Kaggle about taxi. You have to compute the waiting time in the taxi based on the four or three years of taxi time from Mexico or Ecuador country.
16:21
And there are like normal data you can see in the file they provide. It's gear coordinates of pick up, drop off, timestamp of pick up on drop off, the distance they actually kept drove and how long did it took,
16:40
how long did the whole ride lasted and stuff like that. But you can throw something like, I don't know whether the time, like for instance, you can, again, can automatically unwrap the timestamps into date, into, sorry, date, a day and an hour.
17:01
And based on day and hour, you can, for instance, decide whether it's a rush hour. And you know that in rush hour, cars tend to have longer drive and waiting time than normally because the roads are full. Also, you can define something like holidays, national holidays.
17:21
This is where really expert domain knowledge comes in and this is purely manual. I don't really imagine anything comes close to this train of thought and performance by using automatic approach because the data is really not there.
17:41
At least it looks like it. So it is costly. It took me about three days to, like one day to build a model. And one day I thought about, okay, what can I do? I only have this much data and only this data.
18:01
What should I do? What helped me originally, I've tried to again reconstruct my own thoughts and I've tried pretty much the model where I choose the fully connected neural networks is also reconstruction of my brain work. For instance, for me, it was important that today,
18:21
the feature importance is also took someplace and the layers, they are reflecting my, for instance, personal attributes that I value more. For instance, I know that exams for this guy are not that important, but what hour it is, is way important.
18:40
And you could see that after adding just day and hour, the, we got 50% of accuracy and by adding only season, we got like five or 7%. It is very error prone. Because it's expert domain knowledge, you mostly, especially in literacy, you will have to drag someone in
19:01
who knows their stuff, for instance, some engineers or even taxi drivers. And between this communication time, between the expert and the programmer, data engineer or data science, some information will be lost. Also, again, it is not really clear
19:21
at what scale and what features should be incorporated into the model or into the data set. Sometimes you are so happy about the results, about the progress that you do, that you try to throw in some features that are based on each other, meaning you'll land into a multicollinearity situation
19:42
where you have quite a lot of noise, but not that much meaningful data. So sanity check is always a good thing, just to create several features, think how they perform, try to maybe fine tune the model, try to get the best out of this small scenario,
20:03
this local optimal, and then sync again. Because every time you need to create a brainstorming session, you will be affected by the previous brainstorming session and at some point you will run out of ideas. So feature engineering is not a knowledge transfer. It can be used as a part of it.
20:21
Basically, you train models on certain amount of data that has, at the first glance, nothing to do with your task, but you have to train the first model somehow. So it can be useful for that step as well. By unwrapping features,
20:41
you create even more features, obviously, and you can use feature selection, you can use automatic feature creation and whatever because you have way more space to do that. I think it's very powerful. You can argue with that, but I think it's powerful. And it worked with all models. I tried some neural networks,
21:02
I tried classical approaches like decision tree, and they all performed way better when they got more data. It's like, yeah, duh, but every time you see it in the code and the results, it's wow. So this is pretty much everything I wanted to say about the feature engineering.
21:21
Thank you for your attention. So thanks for the great talk. There's time for questions. Hi, thanks for the talk.
21:40
I was just wondering, have you at all compared automated feature selection to your insightful feature selection and whether they perform better or worse? Feature selection, like I won't use feature engineering when I have enough data to actually use feature selection.
22:01
Feature selection comes in first when you have the data to search, like to take some features out to prepare for training. Feature engineering is pretty much enabled feature selection so I won't compare them on this level.
22:26
Thank you for the talk. How did you select the frequency about your data, for example? I was given it. I was pretty much just shoved a file, with a txt file, and I knew nothing more than this.
22:41
So in your case, you want to know at what hour your friend is awake, not in what minute or second or so? It will depend on the task, obviously. In this case, it was enough. Like 30 minutes for humans to know whether somebody is awake or not, available or not. Oh, the real-time application would be an automated system for remote teams
23:03
that will allocate slots for meetings or some discussions based on their working patterns, where people tend to be available, where you can use, for instance, multi-class models where you put something like, I want to do my correspondence, I am available for the talk,
23:21
I am focusing on new features, and so on and so forth. So in the real-time or real-world example, you will probably need a different frequency of the data. In my case, it was more than enough. Again, based on the frequency of the data, different features will be more or less important.
23:41
For instance, exams, I was just lucky that I had two university phases in the data sample. Otherwise, it won't make any difference. Okay, some more questions.
24:08
Hi, thanks for the talk. You seem to be just advocating, adding more and more features until you've improved your model enough. Is there some point that will start to become counterproductive
24:21
and you'll hit the cursed dimensionality by adding more and more dimensions to your model? In my case, it's rather hard because I, and honestly, four for me was already kind of, ah, what can I also add? But if you start not with one dimension but more,
24:43
basically, yes, you can hit the problem where you're trying to get all the data simulating the whole real world, which is the point where you start noticing some patterns or behaviour, if we are talking about classification, that is not typical for your case but is more general.
25:03
For instance, you know there are historical 100 cycles or whatever and there are not only the amount of data important but things that show exactly when it happens, why it happens, so you can learn from features. And this is where it starts to be counterproductive. This is why I said define not more than four features.
25:25
First, test them out, try to replace them, like it is a pretty small amount of work for the start of like four features unless you have like real giant model but in this case, you're probably already not doing something wrong then you started
25:42
with the feature engineering and such a complex model. And after a while, I personally tend to say that depending on the case, how complex is something you want to learn, if you want to learn just the sleeping cycles of a dude,
26:01
then you don't need much data. If you're trying to build a self-driving car, oh boy, you're gonna need quite a lot of sensors, quite a lot of data, quite a lot of features. So basically, depending on the complexity of the task, depends the complexity of the data
26:21
and directly amount of features needed. You can estimate yourself how complex your data should be or could be. Reduce this by two and then start working. Okay, any more questions?
26:41
Come on, I'm not biting. Then maybe I have a question. So I'm not a data scientist, so I don't have much clue about this, but I've heard that as you add more information to your model, there's a risk of overfitting. Did you do some sort of cross-validation?
27:02
Because your precision goes up, but yeah, that's sort of too expected anyway, isn't it? Yeah, true, in this case, I didn't because I had like way too less data and I was in the territory of always underfitting and the amount of features was totally under control.
27:23
If I was taught to throw something like, if does this guy live with the parents, does he has a pet and so on and so forth, then probably I will overfit at some point. So basically, in order to not overfit or to have low risk of overfitting,
27:42
you need to not only keep your data, you need to keep your data still low dimensional to not overcomplicate the life or description of the situation. Okay, one last question or there's still time, but.
28:03
What if we come up with a feature that is really not helping with the accuracy then will just drop or how can we observe that? There was a thing. When I initially started to creating features, I also thought about, I didn't thought at all
28:22
at the first about the scale, but this was okay because I had like four categories. At some point, I tried to scale them. For instance, and you saw here this three features, two, three features false. This is basically scaled and not like role version
28:41
from zero to whatever. And this is more of a normalized one from minus two to two or so. And these models actually perform not way better, not always. I had a thing, like I tried to define season
29:02
for like minus two, minus one, zero, one, and it decreases the performance by seven percent. Also, some features, as I said earlier, are just a bit different form of already existing. So they won't really decrease your performance.
29:21
They might have, but they won't help either. Also, always check your data, whether the feature you want to create will be present. I had another case where I was really interested whether some national holidays are affecting taxi waiting time.
29:41
I don't know, I spent not that much time, like four hours by really finding out all the information, and turns out they just don't have those sample points. So time wasted. Okay, one last question.
30:03
Thanks. So when you're dealing with these small data sets and low dimensional data, how do you differentiate between the cases when there just isn't enough information to build a useful model and when you just haven't found the right manually engineered features yet?
30:22
Mostly gut feeling and common sense, if honestly. It might differ. Again, in this case, it was pretty obvious. I simulated my own thoughts about the problem, and I figured out what I personally call feature, what my brain defined as a feature. And this is pretty much where I stop.
30:42
In more complex situations, you probably would start with model first and not the data if I have at least like four features. And after a while, I will notice that no model tuning actually works,
31:02
but the data is clean. I don't have missing values. It is same. And then I will ask what my model wants to describe. And as a human, I'm smarter than the model because I knew more, I've seen more. And this is where I will start trying to dig deeper
31:23
and bring some features in. That model might find sufficient, might not. Thank you very much. Okay, so give a big hand to Alicia again.