Top 15 Python Tips for Data Cleaning/ Understanding - TIB AV-Portal

Top 15 Python Tips for Data Cleaning/ Understanding

00:00

6

Related Material

Chua, Hui Xiang

Formal Metadata

Title

Top 15 Python Tips for Data Cleaning/ Understanding

Subtitle

With two bonus tips!

Title of Series

EuroPython 2020

Number of Parts

130

Author

Chua, Hui Xiang

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/49924 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Data cleaning is one of the most important tasks in data science but it is unglamorous, underappreciated and under-discussed. These are some common tasks involved in data cleaning but not limited to: - Merging/ appending - Checking completeness of data - Checking of valid values - De-duplication - Handling of missing values - Recoding Most, if not all, of the time, the datasets that we have to analyze are unclean. i.e. they are not necessarily complete/ accurate/ valid. This will impact the accuracy of our analysis if we do not clean them properly. This talk covers how to perform data cleaning and understanding using primarily Pandas and Numpy. If you’re new to data analytics/ data science and are interested how to use Python to perform analysis, or if you're an Excel user hoping to move to Python, this talk might be for you. Participants should be at least familiar with the basics of Python programming.

Speech

Text

Image

00:00

Goodness of fitComputer animationMeeting/Interview

00:20

ArchitectureCodecBlogMessage sequence chartStatisticsTask (computing)Range (statistics)Floating pointData conversionString (computer science)Rule of inferenceDistribution (mathematics)Process (computing)LaptopFocus (optics)Term (mathematics)HypermediaString (computer science)CountingTask (computing)BitIntegerVariable (mathematics)Context awarenessPhysical systemDifferent (Kate Ryan album)Real numberRange (statistics)Flow separationRight angleAnalytic setLevel (video gaming)outputBuildingStatisticsCategory of beingData typeTransformation (genetics)Computer architectureProduct (business)Buffer overflowSet (mathematics)Uniqueness quantificationShared memoryMultiplication signLibrary (computing)Row (database)Endliche ModelltheorieNegative numberBlogCodeElectronic program guideFrequencyCoprocessorCodecProjective planeUniverse (mathematics)Overlay-NetzFile formatStack (abstract data type)Data warehouseQuery languageDampingCASE <Informatik>Operator (mathematics)GenderWeb crawlerSoftware frameworkPosition operatorComputer animation

08:33

Variable (mathematics)Range (statistics)Task (computing)System identificationSimilarity (geometry)outputMetric systemObservational studySource codeFile formatSlide ruleLaptopRight angleReading (process)SpreadsheetCASE <Informatik>Different (Kate Ryan album)Materialization (paranormal)Mathematical analysisSign (mathematics)Library (computing)Database transactionVariable (mathematics)ConsistencyLink (knot theory)Data typeDatabaseSource codeMultiplication signField (computer science)Data structureRight angleCodecKey (cryptography)Row (database)Endliche ModelltheorieSet (mathematics)AlgorithmTerm (mathematics)String (computer science)Matrix (mathematics)InformationResultantComputing platformReflection (mathematics)Projective planeGoogolFile formatWeb 2.0Similarity (geometry)Task (computing)Context awarenessBitLattice (order)Ocean currentShooting methodObject (grammar)IntegerTransformation (genetics)MathematicsDampingArray data structureForm (programming)Perturbation theoryDivisorXMLComputer animation

16:46

Symbol tableoutputExecution unitTwin primeCAN busProduct (business)Menu (computing)Floating pointMaxima and minimaBinary fileLink (knot theory)Petri netChainConvex hullMetropolitan area networkNumeral (linguistics)Function (mathematics)Shared memoryExecution unitHoaxNatural numberLevel (video gaming)Functional (mathematics)CodeKey (cryptography)BitYouTubeLine (geometry)File formatCountingRow (database)Total S.A.FacebookGroup actionMereologyDifferent (Kate Ryan album)Slide ruleSet (mathematics)Error messageVariable (mathematics)Form (programming)outputDistribution (mathematics)Similarity (geometry)Equaliser (mathematics)NumberNeuroinformatikShape (magazine)Term (mathematics)Process (computing)CollisionSymbol tableInstallation artString (computer science)Default (computer science)Arithmetic meanSheaf (mathematics)Task (computing)Correlation and dependenceLaptopRight angleCASE <Informatik>Software development kitVertex (graph theory)Presentation of a groupSpherical capComputing platformCross-correlationQuicksortMatrix (mathematics)Network topologyHydraulic jumpXML

25:00

CodecLink (knot theory)FacebookBlogComputer animationMeeting/Interview

Transcript: English(auto-generated)

00:06

So, this talk is the top 15 Python tips for data cleaning and understanding. So, yeah. So, good luck. You're ready to go. Thank you.

00:21

Okay, yes. So, hi, everyone. Thanks for tuning in. So, like, for today's talk, I'll be talking quite a bit about, like, what I do at work, like, 70 to 80% of my time, because, like, you know, as a data scientist, as someone who works on data science project, like, 70 to 80% of our time is spent on data cleaning and data understanding. But in a way, like,

00:47

I feel like this topic is kind of underrated. Like, a lot of people don't talk about it, and maybe, I guess, it's because, like, it's boring, like, you don't see any fantastic product from this process itself. But I feel like it's something that, you know, all of us,

01:06

whenever we are learning how to work with data, how to understand and analyze data, these are the few comments that I feel like are really useful for me. And I hope, like, at the end of this talk, you'll find all these comments and also tasks that are going

01:25

to useful for you. Yeah. So this, you can think of it as like a consolidation of my experience. And of course, like, I am also happy to hear from you, like, if you have experience working on different projects. And yeah, if there's anything that you feel that I have missed out,

01:45

yeah, I'm happy to add them in as well. So a little bit about me, I'm currently working at a global media agency called Essence, and I am actually helping to build data architecture over there. So we are trying to build a data warehouse, and we are laying the foundation,

02:05

basically, for more analytics capabilities. And this is where, like, we do more ETL, extract, transform and loading, and there's a lot of data cleaning involved as well. Data understanding, like trying to make sense of what are the valid data and what data do we

02:23

need to, you know, clean, how do we clean them? Yeah. And outside of work, I run a data science blog called Data Lover Confirmed. So if there's a little bit of time towards the end, I'll share a bit about my blog. And yeah, basically, I actually have been working with data, like, since my undergrad days. And back then in school, like, I feel like a lot of

02:45

times, the data sets that we work with are clean, and also partly because, like, the focus of the curriculum is more on statistical analysis and also on modeling. So in a way, like, the focus on data cleaning and data understanding is slightly less. And I feel

03:05

like most of the time, I mean, my experience on how we should clean data is, yeah, related back to my work, like, right on the job, I learn about the data itself, like how it's being

03:20

collected, and then all the rules governing the data, like, you know, for example, why this data is being collected at a certain frequency, and why these data are being quoted in a certain way. So a lot of this is like, I'm really dependent on the business context. And I guess this is something that, you know, in schools, we might not be able to

03:42

work with such data, because, you know, they are not really business centric in a way, like, that's why for academic projects, sometimes, yeah, the data can be much cleaner than the real world data that we actually work with eventually. So over the seven years of working

04:00

with real world data, and also with open data, I keep going back to the same stack overflow answers to find out like, you know, how to do certain tasks, like how to do certain operations, which is why like, I eventually came up with this, this framework, I will say like this

04:22

process, this guide, which I'm going to talk to you later on, under 15 tips. Yeah, so without further ado, I will just give a high level overview first before going down into the code. But this 15 processes, I mean, 15 tasks is Yeah. So mainly, like, whenever we get a data

04:44

set, we are interested in knowing, like what data variables they are in the data set. So usually we will query to find out what the column name is. And then subsequently, we also want to know, like, whether the data that we read in using Python is correct or not,

05:02

you know, whether Python loaded the data correctly. So it can be in various UI, say, for example, if you use spider, or say, my case, I like to use Jupyter Notebook. So if I read the data in, I want to check that all the data or the records is being read in. Yeah,

05:21

then next, I want to understand like, what the data types of my variables are. So you know, when I don't reach the data in, right, like, or, or to be more specific, like, when you're using pandas in the library, then the library itself, like they will automatically

05:41

try to detect, you know, what is this particular data type. So it is smart in a way, like, for example, if you have string, like for a particular variable, and string format, and then you will detect it as an object, like if you have integers of float, yeah, then you would subsequently try to detect it in that way. But in certain cases, like it can go a bit

06:03

wrong. So this is why like, we also want to check whether the data type is correct or not. Yeah. Then next, sometimes we want to understand like, what are the unique values for each variable that we have. And this is really one way to identify unclean data,

06:22

you know, like, for example, you know, like, for a particular categorical variable that you have, you only have say, for example, for gender, you have like, female, male, and this close, okay. And then somehow, like during the data collection process, say, for example, it's not collected to system or it allows like manual entry kind of stuff. And then you

06:48

have people are keeping in like M or F, you know, so they are similar in a way that, you know, male and M is like referring to the same thing, but they are treated currently as two separate

07:01

categories in terms of unique values. So this is where, you know, like we want to eventually do some correction over there. And then plus this is why we need to know what other unique values there are. So next, nearly like, so for continuous variables, we want to understand what range of values we have for them, right. So like in certain cases, you only have positive

07:25

values that you can only take positive values, but somehow this particular variable shows a negative value in terms of when you look at its distribution. So this is probably like an incorrect input or, you know, like some something happened along the way during data collection that

07:44

resulted in this. Yeah. And it's something that we want to investigate further. Yeah. So, which is why we do this task. Okay. So subsequently, next, we will also get the count of values saved by different groups, like by different levels. So you overlay,

08:02

so instead of like univariate, like one level getting just the unique values, you want to get the count and so on. This is more related to data understanding. And then in certain cases, you want to rename columns before you do any, say, merging or appending, right. So this is why

08:20

I also included this as a tip. And then as well, there are certain cases where, you know, you have people inputting, say for example, values in Excel spreadsheet, right. And then they have coded it in as a currency. So you have like dollar sign and as well as commas.

08:44

And then, you know, eventually when you read the data set in using pandas, this variable is identified as like object, like it is not identified as integer or float as you would like it to be. So in these cases, like we want to remove the dollar sign and the commas.

09:06

From our variable. And yeah, relating back to the data types, whenever we want to convert the data types to its correct format, yeah. So there could be cases where we want to convert

09:21

string to numeric and string to date. So this is one of the tips that I'm going through as well. And in other cases where, you know, you want to replace values with another values, like back to the example of the male and female case, like you want to replace M with male and F like yeah, with female. And I am also including a slightly more complex transformation over here

09:51

where, you know, we want to identify like data variables, similar or different across data sets. So this is, you know, way also to understand like what are the keys that we can merge on,

10:02

if we want to. And then again, like whether if you want to do like a left or right join kind of thing, like what are the variables, what are the records that are going to drop because they don't appear in both data set, right? They only appear in one. Yeah. And then similarly,

10:21

if you have like a lot of data that is collected across time and that's when, you know, you want to append data sets like to make your analysis more robust and more meaningful, like you have basically quarter one, quarter two, quarter three, you know, and then yeah,

10:40

each quarter they collect the same kind of data. And so every time you have new data set, you want to add them together. Yeah. So this is something that is very common when working on data science project, when you have so much data that you can work with. Yeah. And similarly, like if you have a lot of data that, you know, you are like pending, then there could

11:06

be cases of duplication. And these are the times where, you know, we want to duplicate our data. So there's different ways where we can choose to drop the earlier record, you know, kind of thing. Yeah. And pandas can handle this very easily. Yeah. And next,

11:26

yeah. Similar to a pending, say for example, you have data from different sources and then you want to combine them together, right? So for example, if you have like demographic data from one database and then you have transaction data and other database. Yeah. Then you want to

11:41

match them together. Yeah. And then based on like a certain key, like a customer key, for example, and this is something that is very useful as well, which I'm also going to touch on. Yeah. And next, the last tip before the two bonus tips is about recording. So in these cases, like we want to say change or rather like include a new variable based on the existing

12:09

variables. Okay. So we can do some form of recording and yeah, pandas can allow us to do it very easily as well. So there's two other, I would say tasks that I do it sometimes,

12:24

not super often, but I think they are very interesting and they are very useful as well. So I've included them in. Yep. So just to give a bit of context on what we are going to go through later on, there's this particular scenario that I have basically created that is

12:45

relevant to my work. Yeah. What we do at work for my current work at Essence, where, you know, we want to investigate certain factors behind campaign success, like advertising campaign success or like, yeah, if you want to do some post campaign analysis, then we have data from

13:06

different sources, right? We have like data that is related to the campaign details set up, like how it's been, like in terms of like what market is being run in, how many days is it run in, you know, this kind of thing. And then we have also viewability matrix, like for example,

13:26

relate that to impressions, clicks, you know? Yeah. And then next we also have like results from the different platforms that help us run brand live. Yeah. So we've sought, you know,

13:43

data from so many different sources, like there's going to be common issues, you know, like relate that to inconsistent naming of variables or fields across data sets. Yeah. So it can be relating to the same thing, but they are just turned differently in different

14:03

data sets. And then also inappropriate data formats or like there can be invalid, duplicate or missing value. So all these are related to how clean the data can be. Yeah. So the materials for today's talk can be found here. Yeah. So I will be sharing this thing. I

14:24

mean, I'll be opening up this Google drive later on, so you can check it out later. Yeah. And the two libraries that I'm going to go through today is mainly Pandas and NumPy. And yeah, as you can see, if you search on the web, right, to learn data science, like to learn

14:44

Python for data science, mainly I think these are the two most essential libraries that you need. So before you go into say fancy modeling or, you know, like deep learning kind of algorithms, then eventually, I mean, what essentially you need to be able to first

15:02

process and clean the data, right? So these are the two libraries that lays the groundwork. Okay. So Pandas is built on NumPy and yeah, NumPy basically is a library that helps to handle different kinds of like data structure relating to arrays. Yeah. So

15:23

there's three data sets that I have created. They are all mock data. And this is a preview of the data set. So mainly these first data set hot campaigns have information relating to the campaigns, like the different advertising campaigns.

15:44

So you have spans and information relating to which marker and platform they are run on. And then there are also two other data sets that are relating more on viewability matrix.

16:00

So like again, impressions and different kinds of like impressions, like this is measurable, and they are collected across different times. So like H1, you can think of it as being the first half of the year, and then H2 being second half the year, but there are more data being collected in the second half of the year. So the thing is that like there can be

16:21

different teams collecting the data differently. So in a way, there could be some inconsistencies here and there when it comes to like naming. So if you look at it, right, like over here, we have campaign name, but in a matrix we have campaign, for example, but they are similar,

16:40

referring to the same thing. Yeah. So all this kind of like, I would say a reflection of the data challenges I face at work. And in this case, like due to confidentiality, so I am just creating mock data, fake data, because we can't share the data that we have.

17:02

So for the rest of my talk today, I'll be using Jupyter to go through the code. So Jupyter notebook is like a very friendly UI. And yeah, if you go to the drive later on, there's like a HTML format as well. So, yeah, so mainly if you just run through, you can see like

17:28

all the different commands are structured, tied to the task number that I have gone through just now in my presentation. Okay. So very easily, like you can see after you run the Python code,

17:45

then the output will show you exactly, you know, like what are the column names that you have in each dataset. And then subsequently, you also can see like if you just run dot ship, right. So all these are like commands that is tied to either numpy or pandas. And then if you just

18:07

import them at the start, right, then you can already run it. And by default, like if you install anaconda, anaconda has to be the notebook and also Python directly within that distribution itself. So it's by default already installed. So you don't need to

18:23

run any like further pip install to install numpy or pandas. Yeah. And then the thing that, yeah, I just want to highlight is like, for example, all this, you know, all these commands, they are repeated for all the three datasets. So you can kind of like keep seeing the same thing,

18:44

but really it's like a process of understanding our data. So we want to be very clear, you know, whenever we receive a dataset, we go through the same similar process, same treatment for all the datasets. So just one thing to note over here, like I have created this scenario

19:01

in my dataset whereby, you know, people actually do different kinds of coding or rather like do different kinds of recording of the platforms where Facebook and FB, they actually the same thing and then YouTube, you can have like, you know, people not doing caps for the T and also just

19:23

abbreviating it as YT. So they are referring to the same thing. So this is something that we want to clean as well. Okay. So yeah, over here we can make use of the describe function, you know, to look at variables that are numeric in nature. And for variables that are

19:44

categorical in nature, we can make use of value count. And then similarly, like if you want to look at more detailed breakdown by different levels, so this is by market and by vertical, you can make use of the dot count function. And then next we can also, you know,

20:01

instead of outputting all the columns, okay, they are all the same because there's like tree records for, you know, all these columns, we can just choose a particular column to output. Yeah. So these numbers over here is the same as what you see at the top. Okay. So in cases where, you know, you want to rename our column names, okay, so this is how we can do the

20:22

naming. Yeah. And this is just directly replacing them in the dataset. Okay. So again, like to remove symbols in values, we can make use of the replace function. Okay. So now that the data for spans is clean, yeah, we have removed those unnecessary symbols. And then

20:43

similarly, like for converting string to numeric and string to date, there is this two numeric function in pandas. Okay. And then over here, when you see this section, I mean, this part of the code that says errors equals scores means like whenever

21:02

there is a missing value, for example, or like, you know, a format that cannot be changed to numerals, there will be output as missing. Okay. Yeah. So over here, we want to clean our values for Facebook and YouTube. Okay. And then also we make use of the replace

21:24

function. Okay. So in this particular line of code, what I'm trying to do is to add like underscore between all the values in the different columns. So I'm overwriting this

21:44

campaign over here because the keys that I want to match on have to be corrected first, okay, before I can match with the other datasets. Okay. So over here, yeah, I'm looking at the variables, similar or different across datasets. Okay. So the similar column name

22:05

across all three different datasets is campaign. Yeah. And over here, sorry, just to go back a little bit. Yeah. Click and measurable is present in matrix H2, but not in matrix H1, which you have seen above just now. Yeah. So concatenating or pending is very straightforward.

22:25

Yeah. You just have to make use of the function pen. And then now you can see there is a total of 36 rows. So of course, in fact, actually, when I created these datasets, I made it a case whereby we have some form of duplicates. Okay. So I am keeping, yeah,

22:47

the last, the latest record. Okay. So yeah, I didn't want the earlier records, which are unclean to me. Okay. So next we do some form of merging, okay, across the three datasets.

23:01

And then now I want to add in a new column variable, okay, called days where, you know, it's the number of days between the start and end date. So I want to do some other format checking or like identifying, you know, different priority groups, for example. And then I do some form of recording based on the number of days. Okay. So this is where I cover the recording.

23:25

Okay. And then this very particular, I mean, this is a new package, okay, which you have to install and it's very powerful, I feel. So it gives us an overview of all the variables and

23:40

distribution, the unique counts. And from here we can get a sense, like for example, you see now days you have like negative, which is something wrong. Okay. So yeah, from here you can identify like unclean data and they actually give you a correlation matrix. So yeah, I think this is really useful if you want to have like some firsthand preliminary understanding of

24:06

the relationship between variables. Yeah. So of course, like some might not make sense. Some sort of correlation here might not make sense because it's more data. Yeah. So in terms of imputing missing values, we can first, like, you know, if I'm interested to know which are the

24:21

value, rows that have contained missing values, you can, I mean, you can filter it using this particular code. Okay. And then next I impute the missing values based on mean proportion, okay, of the measurable impressions to impressions. Okay. And then now after I do the

24:43

imputation, okay, I have, yeah, all these values now. Yeah. They were originally missing, as you can see from the top. Yeah. So let me jump back to my slide. Okay. So just to wrap up. Yeah. Mainly there are actually more exercises relating to Python and stuff that you

25:05

can find on my blog if you're interested. So do check it out. Yep. Thank you.