Everyone can do Data Science in Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Part Number | 114 | |
Number of Parts | 173 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/20129 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Place | Bilbao, Euskadi, Spain |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
EuroPython 2015114 / 173
5
6
7
9
21
27
30
32
36
37
41
43
44
45
47
51
52
54
55
58
63
66
67
68
69
72
74
75
77
79
82
89
92
93
96
97
98
99
101
104
108
111
112
119
121
122
123
131
134
137
138
139
150
160
165
167
173
00:00
Port scannerMach's principleElectronic mailing listMultiplication signSound effect1 (number)Atomic numberBitMathematical analysisComputer animation
02:33
GoogolBoom (sailing)Data acquisitionMetropolitan area networkPersonal area networkCycle (graph theory)WebsiteDecision theoryProcess (computing)Software developerWave packetPhysicalismComplex systemPoint (geometry)View (database)Multiplication signGodUniverse (mathematics)BitMathematicsCodePhysical systemLecture/Conference
04:09
GoogolMetropolitan area networkGamma functionChemical polarityMultiplication signMetropolitan area networkFreewareSet (mathematics)BitWeb 2.0Slide ruleMachine visionGoodness of fitComputing platformLaptopEndliche ModelltheorieResultantWeightProof theoryLine (geometry)Interactive televisionAreaDifferent (Kate Ryan album)WordLecture/Conference
06:01
Web crawlerWeb 2.0Hand fanEnterprise architectureComputing platformSet (mathematics)Process (computing)Row (database)Cycle (graph theory)Software developerObservational study2 (number)Service (economics)Goodness of fitDisk read-and-write headExpected valueLecture/ConferenceComputer animation
07:39
DiagramFood energyProteinMultiplication signRepresentation (politics)Point (geometry)Software frameworkStaff (military)PredictabilityCycle (graph theory)Endliche ModelltheorieContext awarenessMathematical analysisRight angleDirection (geometry)Medical imagingAlgorithmSoftware developerSystem callEvent horizonThresholding (image processing)IterationExploratory data analysisAdventure gameMathematicsMachine learning
10:38
GoogolBoom (sailing)Value-added networkMetropolitan area networkInclusion mapAliasingData acquisitionSimilarity (geometry)Multiplication signCycle (graph theory)Source codeQuery languageState of matterArtificial neural networkPredictabilityData storage deviceEndliche ModelltheoriePlotterSystem callKnowledge-based systemsDatabaseOnline helpProduct (business)MereologySequelMathematical analysisElectronic mailing listLinear regressionComputer fileSoftware testingProcess (computing)Condition numberTask (computing)SurfaceMathematicsWebsiteAlgorithmGroup actionSeries (mathematics)Random matrixWeb pageCodeWeb 2.0Right angleExploratory data analysisArithmetic meanMachine learningLecture/ConferenceComputer animation
16:39
GoogolPolar coordinate systemMetropolitan area networkMaxima and minimaPersonal area networkInformation systemsMathematical analysisMedianAverageOperator (mathematics)ProteinLine (geometry)OutlierNumberACIDData analysisCurve fittingCASE <Informatik>SummierbarkeitProduct (business)Block (periodic table)WebsiteString (computer science)Video gameState of matterScatteringCore dumpFrame problemDampingProcess (computing)PredictabilityGradientContext awarenessData miningFunction (mathematics)Information1 (number)Uniform resource locatorThermal radiationComputer fileObject (grammar)TwitterSpeech synthesisPoint (geometry)VideoconferencingWeb crawlerElectronic mailing listPlotterSet (mathematics)CuboidRule of inferenceDatabaseExploratory data analysisLecture/ConferenceComputer animation
21:51
Metropolitan area networkGrand Unified TheoryGamma functionArmMathematical analysisPredictabilityDistribution (mathematics)Metric systemFunction (mathematics)AveragePrice indexPoint (geometry)Uniform resource locatorDatabaseGraphical user interfaceCross-correlationAsynchronous Transfer ModeVirtual machineEndliche ModelltheorieFrequencyPresentation of a groupScatteringLine (geometry)CuboidPlotterElectronic mailing listMathematical analysisDifferent (Kate Ryan album)OutlierProcess (computing)Traffic reportingPiExploratory data analysisVariable (mathematics)Multiplication signMatrix (mathematics)Social classArithmetic meanExtension (kinesiology)Right angleBasis <Mathematik>CASE <Informatik>MedianFamilyEscape characterSource codeDimensional analysisSemiconductor memorySystem callHistogramSensitivity analysisDisk read-and-write headHypermediaComputer animation
26:53
GoogolGamma functionSoftware testingSet (mathematics)Wave packetMetropolitan area networkChi-squared distributionGrand Unified TheoryLinear mapLinear regressionMaxima and minimaValue-added networkArmCluster samplingAsynchronous Transfer ModeInclusion mapWorld Wide Web ConsortiumDecision theoryQuery languageVolume (thermodynamics)Student's t-testEndliche ModelltheorieArithmetic meanChemical equationStaff (military)Gene clusterCartesian coordinate systemRight angleVariable (mathematics)NumberResultantCellular automatonSquare numberData miningContext awarenessSound effectComputing platformCASE <Informatik>State of matterFunction (mathematics)Degree (graph theory)Dependent and independent variablesVotingReal numberLeast squaresFamilyLatent heatDimensional analysisSet (mathematics)VarianceArithmetic progressionMatching (graph theory)Line (geometry)AlgorithmPredictabilitySampling (statistics)Multiplication signElectronic mailing listPoint (geometry)Wave packetVector spaceTheoryDistancePerfect groupProof theoryNormal (geometry)Software testingLinear regressionContent (media)WebsiteType theoryDifferent (Kate Ryan album)Web pageHard disk drivePlotterDisk read-and-write headPhysical systemVideo gameDot productMachine learningCurveInheritance (object-oriented programming)Online helpVirtual machineProduct (business)Attribute grammarSupport vector machineOrientation (vector space)InternetworkingMusical ensembleLecture/ConferenceXML
35:58
GoogolMetropolitan area networkInformation systemsWebsiteProcess (computing)QuicksortEvent horizonCartesian coordinate systemMathematicsFront and back endsDecision theoryPosition operatorNumberBasis <Mathematik>Data conversionMaxima and minimaLine (geometry)Electronic mailing listSoftware testingPredictabilityMathematical analysisWordNetwork topologyGroup actionFunction (mathematics)Category of beingCondition numberState of matterSkalarproduktraumDifferent (Kate Ryan album)Cycle (graph theory)Web 2.0Normal (geometry)Lecture/ConferenceComputer animation
38:26
GoogolCycle (graph theory)Adventure gameRight angleTraffic reportingSource codeTheoryMultiplication signLecture/Conference
39:47
Moment (mathematics)WordPoint (geometry)Graph (mathematics)View (database)AlgorithmMathematical analysisLinear regressionBit rateExploratory data analysisDisk read-and-write head
40:32
GoogolMetropolitan area networkParameter (computer programming)Level (video gaming)PlanningVirtual machineScalar fieldFormal grammarTrailResultantEndliche ModelltheorieDifferent (Kate Ryan album)Lecture/Conference
41:21
GoogolBoom (sailing)CodeDefault (computer science)View (database)Parameter (computer programming)Loop (music)MathematicsFunction (mathematics)MetreEndliche ModelltheoriePredictabilityPoint (geometry)Line (geometry)PlotterNP-hardDifferent (Kate Ryan album)IterationFitness functionLecture/Conference
Transcript: English(auto-generated)
00:19
I live in London, but I'm from Spain, from the north, also, quite close, more or less,
00:27
to here. Where are you from?
01:47
All right. Hello, everyone, and welcome to our last session before lunch. Please join me welcoming on-state Ignacio, talking about how everyone can do data science in Python.
02:06
Hi, everyone. Thanks for being here. My name is Ignacio Lola, and I'm going to talk a bit about how to do data science in Python, and what data science is for me. So a quick overview, summary of what we are going to do.
02:21
I'm going to talk about who I am, what I do, so why I'm here, actually, talking about this. A bit of an overview about what data science means for me, what is the, let's say, flavor of data science that I'm going to be talking about. And then we will do a quick overview of the data science cycle with some examples in
02:42
Python, data acquisition, cleaning, processing, and also using that data to predict some stuff. So that's me, with a bit less of fossil here. And who I am, I'm not a software developer by training. I study physics, actually, so I'm more of a, from the, I came from the maths background,
03:06
or point of view. I've done some research in systems biology, complex systems, always very interested in how things work between each other, and things like that. That drives my attention to big and small data, not so long time ago, and I started
03:25
coding in Python around three years ago. You need to have in mind that my, all my previous coding experience was doing Fortran 77 during university, and I'm not kidding, it was not so long ago, probably they
03:42
are still teaching Fortran 77 in physics, I'm sure. And yes, 77, not even Fortran 90. I become obviously in love with Python very easily, and I become also engaged in the startup world, doing a lot of data science, and those kind of things.
04:06
I'm also a huge advocate of pragmatism and simplicity, and you will see that in everything that I'm talking about today, that's why this talk is also pretty much a beginner's talk in data science, because I believe that with very little tools, you can do a lot actually.
04:24
You cannot solve everything, that's for sure, there are still problems and things that will need very clever people to work on there for a lot of time, but most of the stuff actually can be solved quite quickly by most of us.
04:41
Now on contrary to saying that I'm a big advocate of pragmatism, I've done for the very first time all these slides in Python notebook, because well, I thought, you know, it's a Python conference, I should give it a go and do all my slides in Python. It makes sense, it took me forever, but I'm actually, so it was not very pragmatic,
05:02
but I'm actually quite proud of the result, even if it doesn't look as good as if, you know, I will try to use PowerPoint or whatever. I'm also, one more thing, I'm also the man stand between you and the fool from lunch, so I will try to be a bit fast and do this a bit fun, because, yeah,
05:21
I'm looking forward to the food after all the introduction early today about it. I also work at Import.io. This is relevant because of some of the stuff that I will be talking about and also because of the vision of the data that I have and the kind of data science that I do. And what is Import.io? Import.io is a platform that have two different things.
05:45
It has on one hand a set of tools, free tools for people to use and get data from the web, so to do web scraping without having to code. It just has a UI and you can interact with it with not really a lot of technical knowledge and get data from the web, even doing crawlers or
06:03
things like that. And it's also, on the other hand, an enterprise platform for just getting data. So we use our own tool and other things and we just generate very big data sets that we sell. I've been working in Porto.io for a couple of years as a data scientist and
06:20
more recently as the head of data operations. So heading basically the data services that put those data sets together and deliver those to customers. Now let's go into the topic, what we talk when we talk about data science. There is a lot of hype around data science,
06:44
which obviously came with good things and bad things. When you have hype, there is some good things about it. There is a lot of jobs around it, so it's easy to find a data science job. You can get very well-paid to do it, but also there is some bad connotations on it.
07:03
So usually a lot of roles are ill-defined, so you can find with the same tack things that are really, really different. And expectations sometimes can be actually quite not fair to what it is.
07:24
To define what I mean with data science, I'm going actually to just talk about it, to just talk about what is the cycle of data science for me, as it could be the cycle of development. And we will just see it on the go, what I mean with data science.
07:42
And I'm going to start that introduction, cycling around this nice picture. This is called The Hero's Journey, which I took from Wikipedia probably. And I'm not even sure if the context of this image was talking about movies or books or whatever.
08:00
But it's a very nice metaphor for, I think, for most development cycles, and very, very good one for data science. That thing that is called the call to adventure in that diagram is what I call the problem to solve of the business questions.
08:22
Everything needs to start with that. All pieces of work that we do in data science need to start with a business question, with a problem that you need to solve. Otherwise, you are just doing things for the sake of it. And I will be coming back to this theme probably two or three times over the presentation, because it kind of obsessed me,
08:45
because I see a lot of times the opposite. So yeah, here is where myself, the pragmatic is coming. That's always the starting point. Then that threshold between the known and the unknown is when we start actually
09:02
collecting data and cleaning data to try to solve that problem, all those questions. We need then to do exploratory data analysis, which is usually what drive us to some kind of revelation. Where we can actually start to have some insights and knowing what we can do, what we cannot do, and so
09:20
on in the framework of the business that we are working on. Then it comes the algorithms and machine learning. So trying to use that stuff to make some predictions. And the last things, but not the less important, is at the end, we need to answer those questions that we try to solve, or to do a kind of MVP.
09:42
And we need to remember that this is a cycle. When you usually arrive to your first model, it's just the first step into making it better. It's just the first step into actually solving that issue. You might then realize that you have learned something, but you have learned that that model is not the correct model that you need to use,
10:02
or that you need to change the kind of data that you were doing. As far as you have learned something from the first iteration of the cycle, you're going in the right direction. I also want to mention that when we talk about data science,
10:21
especially in tech talks like this, most of the time we just focus on the machine learning and the algorithms, which is fine, because it's a lot of fun. And if you're talking with people that came from mathematical backgrounds or from programming, they will get really deep into this kind of stuff,
10:43
because we find it fun, myself included. We find it fun to be playing with Google's deep dream code, or to do stuff like that. Now, actually, most of the time that we do data science or something similar, we are not playing with those kind of stuff, and we are doing many other things.
11:02
Like data cleaning or exploratory data analysis usually takes much longer than playing with algorithms or tweaking them, and not everybody talks about those kind of stuff. And usually a lot of the pitfalls are there. So I'm not gonna read all of these things, but
11:21
I think it's a very nice list of sentences that I agree with most of them. And I will just highlight a few things. The data is never clean. Yeah, most of the tasks will not require deep learning or things like that. Most of the tasks actually could be done with very easy tricks, and
11:41
we will see that. And yeah, this is basically a lot of the things I believe. I didn't write this, I quote there the person who wrote this. But it's very pragmatic, I like it a lot. I think there's a lot of truth about data science there.
12:01
So let's go inside that cycle and let's see some examples, and let's try to do some stuff and see how that goes. This is a cycle which basically is you get data, you process data, digest the data, and then you use it, and that's like a mantra. We need to be a bit careful with that mantra, because if you go deep into it, you can be biased by yourself,
12:25
biased by the data that you have. And then because I have this kind of data, I'm going to predict these kind of things because that's what I can do. Or biased by, I really like to do a neural network right now, so I'm going to do that. Those kind of things happen, and happen all the time.
12:42
And actually what you should be biased to is through the business to saying, okay, I'm trying to solve this issue, I'm trying to predict this thing. So what data do I need for that? And what is the kind of algorithm or model that I need to make that prediction? And that's the right approach.
13:02
Sometimes you might end up using, yeah, the data you have and doing that cool neural network. Other times you might be doing a very simple regression or just drafting some KPIs, but that's fine. The goal always is actually to have an action after what you have done.
13:20
Your goal is that when you have finished your work, something is going to change. Something is going to change in your business, or something is going to change in how people use your product, or in how you see your product, or whatever. But there needs to be an action. If it's just null, which for the sake of it, something is going wrong, and
13:41
you need to fix it. So let's go into getting data. This is a very important part. I'm not going to stop a lot on it, but it's a very important part, because we can also be biased in getting data. Not a lot of people talk about this, but we can get data from our
14:01
internal data store, which could be a MySQL database. And getting data then might mean doing a SQL command or a series of SQL commands and putting that into maybe your Python code or a file that you then are going to process and make predictions on. Now, this is very important because usually when then you are going
14:23
into the machine learning and doing cool stuff with the data, you don't think again about how did you get the data. And if you have done a mistake, or if there is some kind of bias in how you get the data, you will be conditioned for the whole rest of the cycle.
14:41
This is the very first step of the funnel. So you need to be sure that you are doing it right, or that if you are doing something that is where you have questions, you at least have written down those question marks. So you know where to go in the future if you need to review this. As I was telling, we can get data from what can be internal sources,
15:05
like the database where you have data around your web page or around your customers or something like that. Or you can get those external sources, which for me, and obviously I'm biased here because I work on this, can be things like web data.
15:22
Data you get from crawling or things like that. The next step is to process the data. What I'm talking about for processing data, I'm meaning digest the data. Digest the data, so we get from that data that you got from a SQL query, let's say, or whatever that is, into the actual
15:44
ND array that you're going to use in Python to make a prediction or to make a plot. That's when the data is ready. And there are steps in between where things can go wrong or where things just can take time to make. So we are going to do a very simple example.
16:02
This is a web page called speakerpedia, which I find by pure coincidence some time ago. And it's basically like a Wikipedia or a list of speakers around the world, all kind of topics. You can find, I don't know, Obama there or things like that.
16:22
And how much they cost if you want to put them in your conference. Basically this was for me a surprise because I didn't know people charged to speak in places, but apparently some people do that. So I crawl the whole site and I make a database of this stuff just to make some analysis and some quick fun stuff or insights into how
16:44
that strange world of people who receive money for speaking work. I've done that within Proteo but I'm not going to go into how I crawl the site. It's pretty easy and if someone is interested I can show it to you. Probably take like ten minutes or so to set it up.
17:03
And I'm using the pandas to see the data and also to clean it a little bit. As you can see, here I'm just consuming the CSV that was, let's imagine, the output of my crawling. It was actually. And we got around more than 70,000 speakers and we got a lot of information.
17:24
I'm just plotting here some of the ones, sorry, showing here some of the ones that we have, like we have the speaker name, the fee, we have the location, tax, stuff. There's a lot of things to clean here, which is very common in getting data from the web.
17:40
And in some cases you can just do the cleaning while you extract data. It's the same when you are calling a database or when you are crawling. If I use the right regex, let's say, I could have turned ahead those fees into a number that will be read as a float here and not as a string because I have the case.
18:02
But I've done it very plain and naive just to showcase how these kind of things happen and we need to deal with them. Same thing happened for Twitter where we have that inside a list or many other things. I'm actually putting only a few columns here, but I have many others.
18:22
So I'm showing here how can we clean, for example, the fee data, because if we are going to do something simple, the very first thing that I would like to see is how much people charge for speaking and how many people actually charge and things like that. So I can replace very easily the case for zeros in a string and
18:42
then reload, let's say, that column of the data frame as a float. And we have then this ready to be used, to be consumed. That's what I'm calling basically process the data, getting it ready for that, getting it ready for using it.
19:06
And there's a lot of things to do in using data before going into making predictions with it. And a good example is a data set that we just saw. One thing that is called, that thing that is called exploratory data
19:21
analysis is basically knowing, okay, I have that data set, I thought that was cool, we need to make now something out of it. We need to know where we can start. And I'm breaking my rule here, I know. I have no business context or question in this problem, okay? This is just for fun. I just don't love that thing and I'm for fun. I don't really have an objective in so far as this.
19:43
We will see other examples later where I have that objective and are more like, more real worth examples. This is not the case. But exploratory data analysis point is very similar. You need to see what your data look like. And if I want to see what my data look like in the previous example,
20:02
well, I can print the average, the median, and the most of the fees of that data set. And we see very easily here, well, we have an average fee of more than, what is that, $12,000, but the median and the most is zero,
20:21
which is already telling us, okay, a lot of people actually charge zero. So that average is probably meaningless on that sense. If we do a box plot, we actually see that. We actually see that, we see that, but we see something else. The box plot is not even a box, it's just a line. Because there is so many things close to zero.
20:43
And we see that that's also because we have like three outliers here. Three outliers that are, I don't know, like a really crazy number. So crazy that I can think probably maybe it's not true. Maybe it's, I don't know how a speaker video works, but we can go back to the source and think, again,
21:02
and this is why we need to think about this kind of stuff. Well, maybe if a speaker video is actually like a Wikipedia and people can edit things, that might be not true. That might be something put in something crazy, because that's what, 10 millions or whatever, that might be. Or even if it's true, it's changing a lot anything that I do in my data set.
21:23
I have 70,000 people here, and just those three guys is going to change all my numbers. So I might want to exclude those layers in any further analysis. And one more thing to comment here, I really love box plots. I think they are like one of the most important things,
21:41
important plots that you can think about. And probably, if I can choose only a few plots to work for the rest of my life, it will be like only three or four, and I think I can do it with those. Probably scatter plot, box plot, line plot, and an Instagram, and who needs something else. I don't know, journalists to plot pie charts, but
22:02
really not people who is doing actual stuff. Now, after saying this, probably tomorrow I'm going to use something else and see that this is super important, but that's what I think. We can go deeper into this and say, okay, let's actually see the Instagram, but avoiding those crazy guys to see how actually this is distributed.
22:24
The distribution is something that we will expect ahead, and if we again do the same thing of calculating the median, the mean, and the mode, we see that the average is much lower, but we still see the same. Because there is a lot of people charging zero. There is a lot of people, also in the speaker video, who is actually not charging.
22:41
They are just there because it's a list where you see people by location and people by topics and things like that. So what make even more sense to do is something like this. Where I'm seeing how many people do not charge anything, and how many people is charging, and what is the average for
23:01
those people, which is around $20,000 for a talk. But we see that only one between four people in a speaker period do that. Now, this is getting me back to my previous point of knowing always which are your data sources, and how you are biased from the very beginning.
23:20
Because the right conclusion here is 25% of the speakers in a speaker period charge an average of $20,000. It's not that 25% of the speakers charge at all. Because no, most of the speakers at all don't charge. It's just that you are not on a speaker period, I'm not. And that's a very important point.
23:40
It's kind of obvious in this case, and maybe it's not so obvious when you are working with your database on Hadoop, but it's actually the same, and you need to have it clear. Other things that we can do here, and we are not going to do, but we could do stuff like repeat this kind of analysis for a speaker topic, and see how different topics have charged different, maybe.
24:04
Or have a different ratio between people who charge and people who don't charge. That's something very easy, we have a column already for the topic. We can do, I don't know, we can do location versus fee. How fee correlate with the location of the speaker.
24:21
All those kind of crazy stuff, very interesting. Basically when we do exploratory data analysis, we always want to do that kind of thing of knowing what is our median, what is our mean, what is our mode. What are the presenters plotting the data to see how it actually looks like, which outliers do we have, and also which variables correlate or
24:43
can correlate with others. I'm not going to speak a lot about correlation, but I'm going to give you at least one comic about it, which I think is kind of important. This, we can do a whole talk just about this.
25:02
But I think the comic probably makes the point even better. So okay, we were using data. That was an example of a very quick and dirty exploratory data analysis. And other things before we're going into predictions is KPIs, k-perform indicators.
25:20
So what are the metrics of the thing that you are trying to solve or the thing that you are measuring? Because sometimes just monitoring the right metrics can save your business and very simple things can have a huge impact. So we shouldn't be afraid of going sometimes for simple tools to do simple jobs.
25:43
Every tool is right for one job. And we shouldn't be afraid of things like Excel. Just the fact that we can consume data from pandas and do really cool stuff, that doesn't mean that sometimes, I don't know, Excel is not the right tool.
26:00
And I'm saying this because actually it's how most of the people consume data. CSBs is how most of the people consume data and how most of the people is also going to read your data. So a lot of times the output of an analysis or the output of a report or whatever is going to be on the emphasis B. And it's important that we know also how to, not how to work with those
26:21
tools, it's not so difficult, but how to make good use of them. There is even a whole book written by John Foreman called, I think, Simple Data, which is just about how to do data science only in Excel. And it's a lot of stuff about modeling and machine learning only in Excel.
26:43
When I'm talking about Excel here, I'm talking just about something that can give you a graphic interface for viewing and editing at CSB. Not really about Microsoft Excel, even if I choose that picture, because I think it's kind of amazing. Okay, let's go now into making actual predictions,
27:02
into doing some machine learning and modeling. And I'm going to do super simple stuff here, but going to use different examples and some different, a whole bunch of different algorithms. First of all, when we go to this step is when we separate the data into what
27:22
is called a train set and a test set, and this means a whole world. This means everything into data science, because this is the basics of why you will be able to, in theory, prove why your predictions are correct. This means that all the data that we were preparing before,
27:41
we are going to split it into two pieces. And one piece is going to be used to train our algorithm, train our machine learning model. And the other one is the one that we will use only to test the results. So it's the one that we are going to test in the model and then see if we were right or not. Because we know the answers for that one, so
28:02
we can see what is the answers of our algorithm and if that matches. And we can have some kind of accuracy into our predictions. It's very easy to get biased by this. It's very easy also to, your data set not being specific enough. You have a sample set that is actually not good enough for
28:21
the problem that you try to solve. But then you divide it, you train your model, you test it with your test set, and suddenly you say, wow, I have 90% accuracy. And when you suddenly go into a real data set outside your fairly big data set, the accuracy is completely wrong. That happened a lot of times, it's a very big problem.
28:43
So we need to be doing this all the time. It's train set and test set is what is going to tell us how good our algorithm is, but it's not like a my thing. It's still biased by how was your first data set and where did you get it and how did you get it.
29:01
After doing that, we have basically only one question to answer from my very simplistic approach, which is, do I want to predict a category or do I want to predict a number? If I want to predict a category, I'm in a classification problem. If I want to predict a number, this is just a regression. So there are only basically two things to do.
29:23
And being simplistic and taking each case aside. But we can put almost everything in those two buckets and are very, very differentiated and they depend on what is the output. Is it going to be a number or is it going to be a category? Let's start with the regressions because I think it's what everybody has done.
29:45
Everybody in high school has used least squares. And least squares is a machine learning algorithm that will make predictions with some data for where it will predict other points for the data using some trained data that we have.
30:03
There are others, things like LASSO or things like a support vector regression, for example. We will see an example, but least square feet is basically a machine learning algorithm and any other regressions that we do are basically going to be the same. Or the same in the theory.
30:22
The only thing that will change most of the times is how we are defining the distance between the dots and our perfect line or curve to those dots. How do you define this distance, if it's this thing or that thing or any other crazy thing is what will change between having a very simple
30:45
algorithm here or having a more complex one. But on the end, we are basically doing this. Maybe we are doing this for 20 dimensions and not for two. And we have maybe a whole bunch of other problems. But on the end is this, what we are doing.
31:02
And I'm going to do another example here. The data that I'm going to use now here is more business oriented. It's hard drives prices that I also scrape from the Internet. So I have a whole CSV with features for hard drives and prices.
31:23
And I can basically do very easily a linear regression, which I think is least squares, this thing that I'm doing here. After dividing into test and train set my data, I can see basically more or less what is the variance score for
31:40
that linear regression, see how it looks like. And we come very easily using skill learn, doing more complex regressions. Support vector machine is just two lines.
32:01
It's just two lines to train, two lines to print a score, and probably again, 20 lines to make a plot. But on the end, it's very easy to do, and we can get some results. We see that the results here are not much better than the results that were from least the squares are just like 5% improvement or something like that.
32:20
Which might mean a whole world in business context, but it's actually not a lot. Very quickly, some classification issues. Let's try to do, as an example, let's try to put our heads into how people is using a platform, for example.
32:40
And here again, I'm doing a real world problem. I'm trying to get to know better the users of import IO, the free tool, the free platform, plotting and dividing how they use our product. So I'm going to be looking into how much people is using the platform, how many volumes do they do of queries, and how often do they do that usage or
33:04
that volume of queries, and I can try to divide that in clusters. That can just tell me something that I didn't know about that data set and hopefully make me do better decisions in the future. We again load some stuff from SQL learn, we load the data with pandas,
33:25
we do a quick model with using Minship, which is one way to do clusters, one algorithm to do clusters. We plot it, I don't like how it looks like, because we have bands of stuff.
33:41
So basically, the only clustering that it has done is in one of the axis, which kind of not sounds right. So I say, yeah, let's do comings. If you Google for clusters, most of the people do comings, so let's try it. And we find basically the same thing.
34:00
And the issue here, which is very obvious for anybody who has done some clustering before, or even some material before, but not for the real beginner, is that you cannot be doing this. This is absolutely wrong. You cannot be working with an axis that go from zero to, I don't know what, and one that go from zero to one. That's never going to work, especially in clustering.
34:22
So we need to clean the data. I'm not going to do it, but we just basically need to normalize the two variables that we were trying to plot, and then we just repeat the same thing. We have now two axis that go from zero to one. And we actually have some kind of clustering that makes more sense visually, but also when I go to the data, because if I now use this stuff
34:41
to see, okay, which user is this? And I see with real examples, I see that it makes a lot of sense. And one of these users can be, I don't know, the user who used Python and has connected an application with our API and is doing millions of queries versus the guy who is using the UI to do crawling without knowing even what
35:00
crawling is. And making that prediction might be very valuable because you can implement that into your, I don't know, your help desk system. And the customer support guy that you have working in your company can knowing ahead when a ticket support coming. If that guy is actually a very technical guy or is a less technical guy or is doing this kind of user or of that kind of usage.
35:20
And that will improve the experience for the user and the support that they get, and also the life or your friend at the support desk. Last thing that I'm talking about, very briefly, we're running out of time. It's a web page classificator, I think. Decision tree, which is another way to classify things.
35:41
In this case, the context is I'm trying to basically know which kind of website is a website just by looking up very simple attributes of that website, and which type of website I mean classifying the content. So trying to know, okay, this is an e-commerce website, or this is a map,
36:01
or this is a jobs application board, or this is events data, things like that. For that, very easy again with SQL learning, just two, three lines to make a decision tree, and also to plot it. We plot this thing here, and again I'm doing a very naive mistake here,
36:22
which is when you see something like this, a decision tree is supposed to be simple to read and simple to interpret, simple to know what it's telling you. When you see something as big as this, it's because you're doing something very wrong. You're overfitting your whole data set into a lot of very small conditions
36:41
that will drop into this huge list of categories and decisions to then make the classification of categories. We can very easily change that just by doing a lot of things actually. But the most simple one, you can just say no, the maximum number of leaf nodes that I want is this. And then you've got a much simpler decision tree, which you can read and
37:03
try to see if it makes sense. Which you can make a prediction very easily, also in only one line, with your test data, and see actually how it works out. And that's it. The recap, always know what problem are we trying to solve. Clean your data and get it ready to use.
37:20
Beware of very common problems like overfitting. I tried to make an example of that, or normalization of your data. I try also to make an example of that. And always try to have an output which is something actionable, something that you say, okay, we finished this analysis, and now we need to change this in our business.
37:42
Now we need to change this in how we do support with our people, or in how we are doing this in our product, or in how we are dealing with this data. If there is no that kind of action, basically the whole thing has failed. And you need to learn from that cycle and go again into the loop and make it better. So that was it. Just telling you that we are hiring a lot at Import.io.
38:03
So there's a lot of different positions. DevOps, front-end, QA, even Python with a lot of data connotations in the role. So anyone that want to talk about that, or about data science, or about Python, or about web scraping, I will be here for the next few days.
38:23
And I will be very happy to engage in any conversation. Thanks for your attention.
38:42
Do we have any questions? I've just seen that you jumped over the abyss in the adventurous cycle.
39:04
The abyss like the death and the rebirth. Is there something in data science too like that? In the hero cycle, in the beginning. And the cycle, sorry, what was the question around the cycle? I cannot hear you very well. You didn't reference the abyss, the rebirth at all.
39:23
The what? The rebirth and the abyss, like the death of a friend. Huh? Yeah. You're referring, sorry.
39:40
You're referring about this, the very bottom. Oh, sorry, I know now what you mean. Yeah, I didn't refer about that. But I think that's precisely the moment I have actually worked for all the things there. So I have the metaphor very well in my head. And the abyss basically is that moment of realization where you know what kind of
40:03
problem are you really trying to solve from a mathematical point of view. So what algorithm is going to work? Because when we are doing just exploratory data analysis or when we are doing the data cleaning, we might not even know at that moment for a complex problem. We might not know at that problem if we are going to do a regression of a classification. We might not.
40:20
And even less, what kind of algorithm is better for that classification problem or for that regression? That's the point of the revelation basically, when you think you have an idea of how to solve that. And then you just need to apply it, which is much easier.
40:45
And what is your experience with SKLearn when you were a beginner? Do I have to know, trying around with different parameters, the parameters until I get a result? Or don't I have to know the internals of the algorithms?
41:04
It's very easy to use, SKLearn. Basically, in the documentation page, they're receiving a tutorial of how to approach it from the sense like, depending what kind of plan do you have, what algorithm do you need to use? Which is to be like a great map into how to do machine learning with it.
41:21
And once you know what algorithm you're going to use, which is usually just a few lines of code to put in there, knowing which right parameters you need to use. If we are objective, it's a very hard problem and it's basically the whole thing around this is how do you fit those parameters? But from a simplistic point of view, it's not so much.
41:43
You can just use basically some defaults or some things almost random. You can basically do a loop and iterate through different parameters and see how it looks like. You always need to have an output from your model which is either a plot or a prediction, or even better, the two of them.
42:03
So you can see, okay, I put these parameters, this is my output, do I like it or not? Let's change the parameters till we fit something that we think it makes sense. That will be a simplistic approach into how to change parameters and fit in the right things using SKLearn.
42:22
Thanks. All right, do we have one last question? No. Thank you, Ignacia, for a good talk. Let's all head out for the apparently fabulous lunch. Thank you very much.