We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Machine learning on non curated data

00:00

Formal Metadata

Title
Machine learning on non curated data
Subtitle
Dirty data made easy
Title of Series
Number of Parts
118
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
According to industry surveys [1], the number one hassle of data scientists is cleaning the data to analyze it. Textbook statistical modeling is sufficient for noisy signals, but errors of a discrete nature break standard tools of machine learning. I will discuss how to easily run machine learning on data tables with two common dirty-data problems: missing values and non-normalized entries. On both problems, I will show how to run standard machine-learning tools such as scikit-learn in the presence of such errors. The talk will be didactic and will discuss simple software solutions. It will build on the latest improvements to scikit-learn for missing values and the DirtyCat package [2] for non normalized entries. I will also summarize theoretical analyses in recent machine learning publications. This talk targets data practitioners. Its goal are to help data scientists to be more efficient analysing data with such errors and understanding their impacts. With missing values, I will use simple arguments and examples to outline how to obtain asymptotically good predictions [3]. Two components are key: imputation and adding an indicator of missingness. I will explain theoretical guidelines for these, and I will show how to implement these ideas in practice, with scikit-learn as a learner, or as a preprocesser. For non-normalized categories, I will show that using their string representations to “vectorize” them, creating vectorial representations gives a simple but powerful solution that can be plugged in standard statistical analysis tools [4]. [1] Kaggle, the state of ML and data science 2017 https://www.kaggle.com/surveys/2017 [2] https://dirty-cat.github.io/stable/ [3] Josse Julie, Prost Nicolas, Scornet Erwan, and Varoquaux Gaël (2019). “On the consistency of supervised learning with missing values”. https://arxiv.org/abs/1902.06931 [4] Cerda Patricio, Varoquaux Gaël, and Kégl Balázs. ""Similarity encoding for learning with dirty categorical variables."" Machine Learning 107.8-10 (2018): 1477 https://arxiv.org/abs/1806.00979
20
58
Network socketMachine learningAreaAddress spaceMultiplication signMedical imagingDigital electronics
Machine learningPosition operatorGenderOperator (mathematics)Personal digital assistantLibrary (computing)Bus (computing)Category of beingChemical polarityComputer fontData miningStatisticsPreprocessorNumerical analysisNumberFrame problemSelf-organizationCodierung <Programmierung>Different (Kate Ryan album)Projective planeBitGenderArchaeological field surveyNumberSpacetimeFrame problemStatistics1 (number)Matrix (mathematics)Type theorySoftware crackingString (computer science)Video gameMessage passingCategory of beingVirtual machineFunction (mathematics)Price indexMachine learningComputer fontCASE <Informatik>Object (grammar)DampingTransformation (genetics)Computer fileSet (mathematics)NP-hardMultiplication signFitness functionMorley's categoricity theoremProcess (computing)PreprocessorMetadataSoftwareComputer animation
Category of beingRule of inferenceModel theoryPosition operatorPersonal digital assistantLibrary (computing)Operator (mathematics)Bus (computing)RankingDatabaseDatabase normalizationFunction (mathematics)StatisticsSimilarity (geometry)Codierung <Programmierung>String (computer science)Link (knot theory)NumberLocal GroupTotal S.A.PreprocessorString (computer science)Category of beingDistance1 (number)Function (mathematics)Functional (mathematics)DatabaseFitness functionFlow separationWave packetDifferent (Kate Ryan album)Software testingFrame problemoutputCodeTransformation (genetics)Virtual machineNumberCodierung <Programmierung>View (database)MereologyBitData storage deviceNP-hardMultiplication signInstance (computer science)Inheritance (object-oriented programming)Stress (mechanics)Expert systemPoint (geometry)Canonical ensembleLatent heatBranch (computer science)Rule of inferenceTheory of relativityDatabase normalizationRankingPosition operatorSet (mathematics)AlgorithmPerturbation theoryContent (media)Supervised learningDatenverknüpfungGradientModel theoryPattern languageRepresentation (politics)Cross-validation (statistics)TwitterDiscrete groupObject (grammar)GenderLink (knot theory)Electronic mailing listSelectivity (electronic)Digital electronicsRow (database)Machine learningNeuroinformatikValidity (statistics)Raw image formatRight angleBuildingError messageReal numberProduct (business)TouchscreenSimilarity (geometry)Group actionRange (statistics)Proper mapOffice suiteIntrusion detection system
ImplementationAbelian categorySoftwareRepresentation (politics)AverageCodierung <Programmierung>Similarity (geometry)PredictionGradientModel theoryLinear mapRankingCategory of beingHausdorff dimensionNumberEmulationFormal languagePrototypeGradientSoftwareSimilarity (geometry)NumberCategory of beingFrame problemPrototypeImplementationParameter (computer programming)View (database)BenchmarkLink (knot theory)Codierung <Programmierung>Process (computing)Transformation (genetics)Data managementDefault (computer science)Network topologyDistanceRankingVideo gameString (computer science)Real numberInstance (computer science)Perturbation theoryDatabaseComputer-assisted translationGroup actionPredictabilityClassical physicsDifferent (Kate Ryan album)Row (database)Open setPoint (geometry)InformationWave packetSlide ruleMultiplication signObject-oriented programmingAverageBitNatural languageRepresentation (politics)Object (grammar)Right angleTouchscreenSet (mathematics)Functional (mathematics)Position operatorTwitterData structureMessage passingLevel (video gaming)Office suiteDecision theoryStack (abstract data type)Linear regressionEinbettung <Mathematik>Order (biology)
Similarity (geometry)Codierung <Programmierung>PrototypeEstimationDatabase normalizationNatural numberPosition operatorAuto mechanicInformationModel theoryAbelian categoryString (computer science)Linear mapFactorizationCategory of beingComputer programOperator (mathematics)Poisson processGamma functionIntegrated development environmentDigital rights managementPermutationGenderLibrary (computing)Machine learningSpacetimeMetric systemVector spaceImplementationTheoremPoint (geometry)StatisticsMaxima and minimaProcess (computing)Independence (probability theory)Random numberView (database)Likelihood functionCategory of beingSoftware frameworkDescriptive statisticsResultantLikelihood functionComplete metric spacePosition operatorLink (knot theory)Matrix (mathematics)Interpreter (computing)Natural languageInformationPoint (geometry)String (computer science)Set (mathematics)AnalogyEndliche ModelltheorieDefault (computer science)Flow separationFundamental theorem of algebraRepresentation (politics)DivisorForm (programming)Stochastic processStatisticsMaximum likelihoodHeegaard splittingClassical physicsMereologyProcess (computing)Revision controlVirtual machineModel theoryNumberVector spaceOffice suiteFraction (mathematics)PrototypeDistribution (mathematics)Different (Kate Ryan album)TheoremView (database)Term (mathematics)SpacetimeWave packetSimilarity (geometry)Strategy gameImplementationCASE <Informatik>Coefficient of determinationStructural loadGradientPermutationMachine learningElectric generatorGame theoryInstance (computer science)Codierung <Programmierung>Sampling (statistics)Computer-assisted translationRight angleBitMetrischer RaumNetwork topologyObservational study
Machine learningInformationSample (statistics)GenderPosition operatorOperator (mathematics)Bus (computing)Personal digital assistantLibrary (computing)PredictionArithmetic meanFunction (mathematics)ImplementationForestDistribution (mathematics)StatisticsCondition numberPressureMathematical analysisInformationImplementationCondition numberTheoryoutputView (database)Model theoryDistribution (mathematics)Classical physicsPredictabilityEndliche ModelltheorieMultilaterationBitSoftware testingFunctional (mathematics)Virtual machineMathematical analysisLinear regressionStatisticsArithmetic meanRandomizationCASE <Informatik>Forcing (mathematics)Category of beingPoint (geometry)VarianceDirection (geometry)Sampling (statistics)Personal digital assistantInstance (computer science)Matching (graph theory)Goodness of fitNetwork socketSet (mathematics)Computer animation
TheoremConsistencyPredictionGradientSimulationSample (statistics)Arithmetic meanLaptopIterationGradientDistribution (mathematics)outputArithmetic meanPredictabilitySupervised learningForestIterationClassical physicsMobile appSoftware testingLikelihood functionWave packetSlide ruleLaptopNetwork topologyResultantOnline help
Random numberTheoryPressure volume diagramPredictionInformationPredictabilityoutputPrice index
Arithmetic meanPressure volume diagramIterationSimulationLaptopArithmetic meanoutputCASE <Informatik>Pattern languagePredictabilitySimulationIterationPrice indexAuditory maskingComputer animation
Process modelingStatisticsCodierung <Programmierung>Similarity (geometry)Category of beingModel theorySoftwareAbelian categoryPressure volume diagramAxiom of choiceEndliche ModelltheorieCodePrice indexSimilarity (geometry)Category of beingGradientoutputSupervised learningMessage passingAreaTransformation (genetics)Task (computing)Projective planeResultantMorley's categoricity theoremStatistics2 (number)String (computer science)
NumberComputer-assisted translationSoftware developerCodierung <Programmierung>ImplementationCategory of beingOffice suiteRoundness (object)SpacetimeLoop (music)Revision controlBenchmarkMorley's categoricity theoremAnalytic continuationElectronic signatureGoodness of fitCodeSoftware testingBitPrice indexString (computer science)PlanningSet (mathematics)Standard deviationDatabaseLatent heatPoint (geometry)Similarity (geometry)1 (number)Lecture/Conference
Transcript: English(auto-generated)
Thank you very much for the nice introduction. So today I will now talk about brain imaging. This is a new area of research that we've started in the team on dirty data. And the reason we've started this is that as we all know data science is
80% of the time spending on preparing the data and 20% of the time spending complaining about the need to prepare the data. So let's address those 20% of the time. And really that the thing is once again with the modern machine learning tools such as a circuit learn
machine learning is easy and fun and we like to do it, but the problem is really getting the data into the learner. And industry surveys showed this. This was an industry survey by Kaggle a few years ago and it asked, you know, what's the most blocking
aspects of running a data science project in your organization. Dirty data came on top, you know, above things like hiring the right talent. So dirty data, you know, seeing this we thought well let's tackle the problem. And when we thought let's tackle the problem, we didn't know what it meant. I'm not sure we know
these days. I guess everybody has their own dirty data problem, but at least we've understood a few things. And one thing that we've understood is that every machine learning research paper starts by let X a numerical matrix that lives in a matrix space. And if we're going to implement this, it's going to be well, you know,
give me your data as an umpire. And we've always said, you know, sure you're going to have to transform your data from the kind of data you have to the umpire, but that's your job, not us. So yes, in real life the data,
best case comes like this. Often there's a panda's data frame. So it's not exactly a numerical matrix. And the first thing is that we can, we will need to transform the different columns in different ways to cast this to a numerical array. And I want to talk a bit about how to do this with Scikit-learn, because Scikit-learn has gotten much more pleasant in the last
few years to do this. But then we're going to hit a set of hard problems. And one of them is the fact that one of these columns is not a well formatted categorical column. And for machine learning it falls a bit between the cracks. And
that also raises problems. So the outline of my talk is going to be that I'll talk a bit about transforming columns with Scikit-learn. And here I just want to
emphasize a bit things that are feasible with modern Scikit-learn and that can make your life easier. And this is just mainly Scikit-learn. Then I'll talk about the problems of dirty category. And this is more a research-y talk, even though we do have software that you can use. And then I'll talk about the problems of
learning with missing values. And this is more of a statistical talk, but there will be take-home messages. So column transforming. The goal is to start with pandadater frames and come out with a well-formatted NumPy array that can easily be plugged in statistics such as
Scikit-learn. So it's a pre-processing problem. So often the way we get our data is we read it from a CSV file. So we could do this with pandas and we'll get a data frame that has different types and different columns. And our goal is going to be to convert all those values to numerical values.
And let's look at gender. So gender where so it's a categorical column and we're going to transform it to a numerical column. And the standard way to do this is to use one-hot encoding. So in Scikit-learn, we'll use clearing.preprocessing.one-hot-encoder and
then we'll call the fit-transfer method on the column gender. And it's going to output indicator columns with zero and ones that indicate the different genders, okay? Now for dates, we could use the pandas date-time support. So pandas deals quite well natively with this kind of
strings. It knows how to convert them to the date-time object. And once we have the date-time objects, we can take its value in float and it's a value in, I believe, milliseconds to the epoch. And so it's a numerical value that is reasonably well ordered and hopefully we can learn from it, hopefully.
So something I'd like to stress is that in Scikit-learn, we like to work with things that we call transformers. And if we look at the one-hot encoder, we can actually split the fitting of the one-hot encoder and the transforming. The idea is that during the fitting,
we're storing which categories are present in the data. And during the transforming, we're encoding this data accordingly. So this separation between fit and transforming is quite important because it avoids data leakage between the train and the test set when we're evaluating the pipeline. And we can also store the fitted
transformer and apply it to new data at predict time for production, for instance. And it can be used with a bunch of tools in Scikit-learn, such as the pipeline or the cross-val score that is used to do a cross-validation.
So for dates, it might be useful to shoehorn our panda code into such a process. And for this, we can use the function transformer. So we can define a small function that will take as input the panda's data frame, or the panda's column we're interested in, and then return as output a 2D
array of numerical values. And for this, it's just, you know, taking the code that we were doing with pandas, putting it in the proper function, and making sure that we're returning a 2D output. And then once we have this, we can use a Scikit-learn.preprocessing.function transformer, give it this function, and
tell it that we don't want validation, because if we want validation, it's going to try to check that the data is well formatted at the input, and it's not. So it will complain. So function transformer can be a bit more clever. You can tell it how to inverse transform. It's a more more sophisticated tool.
And I won't go into details. What I just want to stress here is that it can be useful to look at the modern pre-processing documentation of Scikit-learn, because it has many useful tools
for this purpose. And once again, pipelines are good. Now, how do we put a pandas data frame in a pipeline and apply different transformers to the different columns? For this, we can use the column transformer object. The column transformer object will take a list of pairs of
transformers and selectors of columns. And selectors of columns can be, for instance, column names, okay? So here, with this code, I'm telling that I want to apply a one-hot encoder to the gender and employee position title columns and
my date transformer to the date first hire. And now I can call the column transformer on a data frame. It will do all the magic and comes out an umpire. And so I can build complicated pipelines using this kind of patterns to get my raw data, at least my
raw data frame, in and then use Scikit-learn on this. So this is useful for cross-validation, for instance. The benefit really is that we can use all the tools in Scikit-learn for model selection, such as, for instance, we could pipeline this column transformer with a
fast gradient boosting classifier that's new in O21. And then just apply cross-validation on the raw data, okay? So if you're not using it, you should probably be using it. If you think it can be improved, find an issue.
Now, if we do this on the example that I'm using, we're going to hit a problem. And the problem is with the employee position title. And really, the reason is that there are many, many different entries in this title.
For 10,000 rows, there are 400 unique entries. So that will lead to a bunch of different problems. And some of them are numerical. It's just going to take, well, computational, it's just going to take a lot of time to run. But some of them are statistical. And the reason they're statistical is that we might have some rare categories.
There's only one instance of Architect 3 in the data set. We might have some overlapping categories. We have different instances of police officers. And the link between those instances is not obvious as we, if we don't look at the string content. And if we just look, consider these things as discrete categories. And
finally, it's a detail, but it's a real problem in practice. We might have new categories in the test set. So basically, one hot encoder doesn't work well at all with this kind of data. And sometimes we have this kind of data. So the standard practice to do this is to use,
to resort to data curation. Cleaning your data, it can be seen, it's mostly techniques from database normalization. And so one thing that we could do is we could do feature engineering and we could try to separate the position from the rank. And maybe we could separate the position, the rank, and the department. So this would require us
building rules that we might apply in Pandas and strings to separate those things out. The problem is it's going to take a little while to build those rules. And they usually have to be handcrafted. Another related problem, for instance, in a different database. Here we have a database where we have company names and we have the same company
that's expressed under different names. Now that's a canonical problem of database curation, and it's known as deduplication or record linkage. And the goal being to output a clean database. Basically to merge those different entries, those different entities, and represent them as the same entity.
Now this is quite difficult to do in general without supervision. You usually need an expert that shows a set of mergers to have an algorithm learn how to do the mergers. And one problem is that it can be suboptimal.
Because here, the data set here, the challenge is to detect fraud with payments to doctors. And it's a real question on whether we should merge the Pfizer Hong Kong branch with the Pfizer Korean branch.
Maybe they should be considered as the same entity, and maybe not. That really depends on the question at hand. So, the problem with this view is that the goal is to output a clean database. Which is a question specific point of view. What is a clean database? And in general, it's something super hard.
So really these things, or these IDs, are hard to make automatic and to make turnkey. And I'd like to claim that they are much harder than supervised learning. Supervised learning, so supervised machine learning, is a toolbox that works quite well as long as you have a supervision signal.
Database cleaning is a hard problem, and you will need a supervision signal, but that supervision signal is basically a clean database. So usually clean, the database cleaning, you first have somebody clean part of the database, then you learn rule from this, and then you clean the rest of the database. So our goal here is not database cleaning,
it's working directly on the dirty data, and doing good machine learning on the dirty data. Really the point being that the statistical questions, so the supervised learning problem, should inform the curation. And ideally we should even curate.
So the first work we did with Patricio Cerda, I should stress that this part is really the work of Patricio Cerda, who's doing a PhD in my group. So the first thing that we did is that we took one-hot encoding, and we relaxed it. And basically instead of having zeros and ones,
we added string distances between the representations of the categories, and we encoded with string distances instead of zeros and ones. And that really tackles the problem of new categories in the test set, because if there's a new category in the test set that's not represented in the train set, I can just look at the string distances to the categories in the train set. And it also allows us to link categories.
If, for instance, I have typos in my columns, which is something that does happen, the typos are going to give me very small string distances, and those two columns are going to look very similar. So
maybe the most well-known one is the Levenshtein distance. The Levenshtein distance is basically the number of edits that we need to do to one string to match the other. And it's really a classic one. There's the Jarrow-Wilkler distance. It's the number of matching characters renormalized by the number of trend positions, character transpositions.
It's well used in the database community. And there's what I call the n-gram or Jaccar similarity. If we define n-gram as a group of n consecutive characters, so for instance, if I have London, the first n-gram will be L-O-N, the second n-gram will be O-N-D, the third n-gram N-D-O.
So we're basically taking all those n-grams, here these are three grams, we're taking all the three grams, and then to compute the distance between two strings, we're looking at the number of n-grams in common between the two strings and the number of n-grams divided by the number of n-grams total.
Okay, so if the two strings are the same, they have all the n-grams in common. So this is one. So this is a similarity. If they're completely different, they have no n-gram in common. So this is zero, okay? So these are three classic string similarities.
So because this is a Python conference, we're giving you a Python implementation. We have this software that we call DirtyCat for dirty category, and it allows me to put pictures of cats on my slides. It's crucial. It's available online, BSD license and everything.
It's something in between a research quality software and production quality software. I think it's reasonably good quality. It's not as high quality as scikit-learn, but it comes with documentation, examples, and everything. You can look at it. It also comes with
example data sets. And it provides the similarity encoding. So similarity encoder is just an encoder. It will work like scikit-learn. You can instantiate it, saying which similarity you want to use, and then you can transform the column of the data frame or the data frame you're interested in transforming. So it's a drop-in replacement for
one-hot encoding in scikit-learn. Now, I'll show you how it performs on real data, but before I show you how it performs in real data, let me present another approach that has been around for quite a while. That's called target encoder.
It's not known well enough. The idea being that we're going to represent each category by the average target. So for instance, we're going to represent the police officer 3 by the average salary of the police officer 3 in our database if we're trying to predict the salary.
Right? So this gives us a 1D representation of all the categories I've shown in here. So all our categories are embedded in one-dimensional, which is the average salary. So we have the average salary, and so you can see that in the database the person who makes the least amount of money is the crossing guard, and
the person who makes the most amount of money is the manager 3, manager 2 actually. So by the way, this is maybe a bit surprising. The order in managers doesn't make sense, right? We have manager 3 who makes less money than manager 1, who makes less money than manager 2. Why is that? Because those are average
salaries, and we might have people with different level of experience, or I don't know what. It also is telling us that this signal is not a perfect signal. It's a noisy signal, this embedding. But it's useful because it's embedding all the categories close by when they have the same link to Y. So that's helping us build a simple
decision function to do prediction from this representation, okay? Now this is, it comes with the drawbacks. The first one is it doesn't know how to deal with a new category. If I tell you a category, if I give you a category that I've never seen, I don't know the average salary, I can't represent it. So I can represent it by the average salary of everybody, but that's
losing a bit of information. And the other thing is it's absolutely not using the string structure of the category. So typos, for instance, it will not find the links between typos unless it sees enough of those typos to see that they basically link
to the target in the same way. So I'd like to say really it's a complementary approach to our approach. It takes a different point of view, and is very interesting, too. It's also available in DirtyCat because our goal in DirtyCat is not to
sell the methods that we developed, but to help solve a problem, which is dirty categories. So it's target encoder, oops, and I was editing this too late yesterday evening, and target encoder does not take a similarity argument. So Patrice Ocerda did
a numerical benchmarks on real-life data sets, here using seven real-life data sets to compare the different approaches, and we benchmark linear models and gradient boosted trees. And what I'm showing you here is the average
rank of the different methods across the different data sets. So one would be that the method was always the best predictor across all the data sets. And so what you can see, and we, so there's more in the paper, we benchmarked many other methods, but I'm really giving the executive summary because many of the methods that we benchmarked were not helpful.
So what you can see is that target encoder helps, so with gradient boosted trees, it helps compare it to hot encoding. One thing that is not visible in those numbers is that gradient boosted trees do much better
than linear models, so I would advise you to focus on gradient boosted trees in practice. They're much more useful for this kind of data set. So target encoding helps a lot, and then in the similarity encoding, what we found is that the three-gram distance, the three-gram similarity,
was really the most helpful, and the others are not most, not as helpful. So our take-home message is really we can focus on similarity encoding with three-gram distance. Though it might be useful, for instance, to build a pipeline that stacks
both a target encoding and similarity encoding because these two objects capture different information in the data. And that's easy to do, by the way, with a column transformer. You just select the column twice and send it in the two different
encoders. Okay? Now, in practice, we're going to hit a problem, is that in many, though not all, databases, the number of different categories grows with the amount of data. Here, that's the second work that we did with Patricio, and now we've moved, we've gathered more data set, and now we've moved to 17 data sets.
It's actually hard to find data sets that are not curated and with an open license. People do not like to share their non-curated data set. So please, please do share your non-curated data set. That's the only way we can develop better methods.
So what you're seeing here is across many data sets, as we increase the number of rows, the number of different entries that we're seeing in a given column, increases. And increases sometimes very fast, sometimes only
slightly faster, but it gives a problem because it means that if we're going to use the similarity encoder, we're going to blow up the dimension, and we're going to end up running gradient boosting on things that have a hundred thousand features, which is not only bad statistically,
but also will take a lot of time. So really the, oh, and yeah, so there's, this is, this is related to problems in, for instance, natural language processing, where as the corpus of the text gets bigger,
the number of different words that we see keeps increasing. So it's quite related to classical natural language processing problems. So, we need to tackle this. That's where we can't give this to you as a turnkey method that you can apply to large databases.
So both the similarity encoding and one-hot encoding are prototype methods. What I mean by prototype method is that they compare the data to a set of prototypes, and by default, it's all the prototypes on the training set. The challenge now is to choose a small number of prototypes to be able to scale.
So we can take all the training set. That's what we're taking by default. It blows up. We can take the most frequent, but it's a strategy that's easy to game. You can easily have a data set that breaks the strategy.
And one of the problems is that the most natural prototypes may not be in the training set. For instance, if my training set is made of big cat, fat cat, big dog, fat dog, I probably want to break this in big and fat, and cat and dog. And none of these original entries actually have
the right terms. Okay, so I need basically to break down my categories. So now I'll tell you how we estimate those prototypes. And the thing that is going to save us is that when those different strings grow, they have common information.
Here I'm showing you the growth in the number of three grams as I increase the number of strings, and what you can see is that it's a smaller growth than the number of different strings. And this makes sense because, for instance, if this dirtiness, this diversity of the string, is made
from typos, then typos actually modify a small fraction of the string. So yes, I will have new three grams, but most of the three grams will be in common. In practice, if I look at my data sets, you can really clearly see this, that the substrings are in common. For instance,
in this drug name dataset, I can see that I have many different versions of alcohol, but they're all versions of alcohol, so there's alcohol in common everywhere. In my employee salary problem, I have substrings that are really meaningful. Police is in common, officer is in common, technician,
senior. So the challenge is going to grab this information and capture those meaningful substrings. And for this, we're going to use techniques from topic modeling and a natural language processing, and we're going to apply topic modeling on substrings.
So what we're going to do is that we're going to represent all the strings as their substrings using an n-gram representation. And here I'm showing a three-gram representation, but in practice we're doing a bit, something slightly more sophisticated than this. We're taking the two grams, the three grams, the four grams, and also the words that we've split with a set of
separator separating characters that we have default values, but you can change them. So then we build a big matrix that represents each entry by its substrings, and then we apply matrix factorization on this.
Really matrix factorization, what it's doing here is to say, I will separate this matrix in two matrices. One matrix that is what I call the descriptions of the latent categories, and it tells me what substrings are present in a latent category. And another matrix,
which is what latent categories are present in a given entry, okay? So I'm really factorizing in the description of latent categories, categories that I'm inferring from the data or prototypes, and how those are expressed in the data. So to give you an example of
the result, this is, and so by the way, we're using the activation, we're using the activation matrix, so the one that expressed which categories are, which latent categories are in an entry, we're using this to represent the data. And so this is what I'm showing here. These are those employee salary, the employee position titles, and this is, and I've run the model with a
dimensionality of eight, and this is the loadings that are showed. So what you can, if you squint your eyes, what you can see here is that it has detected something like a technician,
like legal, police, it has detected those substrings, okay? And so one thing I'm not showing here that I should be is that we're using a heuristic to give a name to those columns, and the name is really what's the, what are the three words that are most represented in those columns?
So this is useful because it gives you, it's giving you feature names. We're encoding this with feature names. And so if we compare it to a similarity encoder, it's much more marked, much more present, much more interpretable. And then we can do data science, interpretable data science, and for instance, we can look at permutation importances of
gradient boosted trees, for instance, with the string, with the categories that were inferred from the data. And this is what I get here. So what I'm showing you here is that I've inferred from this messy data,
I've inferred latent categories that make sense and on which I can do an analysis and present it to you. And it also, by the way, predicts well. In the paper, we're showing that it gives you good prediction, okay? So you don't have to clean your data anymore.
So now I want to talk about one last thing, which is learning with missing values. So we've dealt with this non-formatted categorical data, and now we need to deal with the fact that some of our values are missing. And so why doesn't the bloody machine learning toolkit work on this? There is a fundamental reason, is that machine learning models in general
tend to need entries in a vector space, or at least a metric space, or at least an ordered space. It's just easier for machine learning to draw analogies if it knows links between data and missing value is nowhere there. So it's slightly more than an implementation problem.
There's a fundamental problem there. There is a very, very advanced thorough literature on missing values in statistics, and let me summarize it really quickly for you. The canonical model is that we have, A, a
generating process for the complete data, and B, a random process that occludes the entries. This is really the conceptual model on which the classic results stand upon. And then there is a really classic
situation, which is known as missing at random, MAR, that says, hand-waving, that for non-observed values the probability of missingness does not depend on the non-observed value. This might seem a bit mind-blowing. If you look at the actual definition, it's even more mind-blowing
and people simplify it because it doesn't really make sense. And it's true. It doesn't really make sense. The reason there is this definition is that it allows an unlikely hood framework to prove that, and that was proven by Rubin 40 years ago,
that maximizing the likelihood of the observed data while ignoring, marginalizing in technical term, the unobserved value will give the maximum likelihood of model A, of the model of the complete data generative process. Okay, so it means that if you are modeling your data, if you're doing classic statistics,
you're modeling your data with likelihood models that you believe, and you believe you have an occluding process, you can still do the, you can still solve the problem when you're in a missing at random situation. Conversely, if you're, so, and missing completely at random is a special case of this situation where the missingness is independent from the data, and
it's an easier, so it's a special case of missing at random, and it's easier to understand, and the theorem still applies. Now, conversely, if you're in a missing not at random, if you're not in this situation, then missingness is not ignorable. If you try to maximize the likelihood while ignoring the missing data,
you will have problems. In practice, what does it look like? I've shown you a complete data, I've shown you a missing completely at random, so basically I've sub-sampled, so here I'm deleting my missing values. They're not on the data set. So I've sub-sampled, it's a problem, and I'm missing, I'm showing you missing not at random, and what you're seeing here is that we have some form of censoring process, and part of the data distribution is not well represented.
Okay? So this will give problem. Now, I would like to say that this classic statistical point of view is not of interest to us here, at least not completely of interest, and we shouldn't take those results as
fundamental results for machine learning. The two reasons, one is there is not always a non-observed value, for instance, what is the age of the spouse of people who are single? So even this assumption is broken in many, many, many data sets. And the second one is that we're not trying to maximize likelihoods, we're trying to predict.
Now, based on this, we can just do machine learning. But the bloody machine learning toolkit still doesn't work. I've given you theory, not practice. Okay, practice. I'll come back to this theory later. Practice. We can impute, and this goes back to the theory before.
Imputing means we're going to fill in the information. We're going to guess things for those values we haven't seen. And once again, there's a large statistical literature, but it's focused on in-sample testing, doesn't tell you how to complete the test set, and doesn't tell you what to do with the prediction. So let me
cover a bit the tools we have in scikit-learn. There is mean imputation, which is a special case of univariate imputation, and we can, for instance, replace the missing values with the mean of the feature. So this is done with the simple imputer. There is conditional imputation, the idea being that you're modeling one feature as a function of the other, and then you can learn predictive models across features,
and then you can predict missing values, okay? There are classic implementations in R, and we now have an implementation in scikit-learn that can do this with linear models or random force or other things.
The classic point of view tells you that mean imputation is a very, very, very bad thing because it will distort the distribution. So as you can see here, I've imputed the missing data with the mean, and you can see that we're collapsing the variance of the data along one direction. So we shouldn't be doing that.
Classic point of view. And there are conditions that are known as congeniality conditions on an imputation that tell you that a good imputation must preserve the data properties used by the later analysis step. Now, we've looked at supervised learning in this setting, and we've shown, we've proven that if the learner is powerful enough,
like a random forest or a good and boosted tree, imputing both the test and the train with the mean of the train is consistent in the sense that it converges to the best possible prediction. And the reason is, A, we're not trying to maximize likelihoods. B, the learner will learn to recognize the imputed entries and will compensate for them.
So the learner basically learns those biases in the distribution and fixes them. So we don't have to worry about the classical results. In practice, you can see it here. I'm comparing mean imputation and iterative imputation. And what we can see is that if I have enough data, they perform as well. If I don't have enough data,
then the iterative imputed is better. The notebooks are online, and the slides are online. The conclusion is when we have enough data, iterative imputation is not necessary. Mean imputation is enough, but when we don't have enough data, iterative imputation helps.
Now, it may not be enough. Imputation may not be enough, and here's a pathological example. Why? What I'm trying to predict depends only on whether the data is missing or not. Suppose I'm trying to predict fraud, and the only signal about fraud is that people have not filled in some information.
So this will fall into missing, not at random situations. And in such a situation, imputing makes the prediction impossible. Okay, so if I impute, I'm losing this information, and I can't predict anymore. So what's the solution? The solution is to add a missingness
indicator, an extra column that tells me whether or not the data was present, so I can impute, but also expose to the learner whether or not the data was present. And if I do this, so this is another simulation where we have specific censoring in the data, and if I do this, what you can see is
that both the mean and the iterative imputer are consistent. They converge to the best prediction. If there is the indicator, but the iterative imputer doesn't work well at all if there is not the indicator.
And also what we can see is that here, so adding this indicator, this mask, is absolutely crucial. And the other thing that we can see is that iterative imputation in this situation is actually detrimental because it's making it harder for the learner to see this missingness pattern.
Alright? So basically we have two situations. One where the missingness is not informative, in which case the iterative imputer is better. One where missingness is informative, in which case actually iterative imputer can harm because it makes it harder for the learner to learn this informative missingness. Okay? Now to wrap up,
learning on dirty data, first take-home message, prepare the data via column transformer. That's easy. Second take-home message, use gradient boosting. In my experience, it really works well on this kind of data. It's robust to all kind of word entries in the data. First thing you should try, probably.
Dirty categories, so we're interested in statistical modeling on non-curated categorical data. Please help us and give us your dirty data with a prediction task. It helps us benchmark what we do. It's very important. And we have similarity in coding and more work that's coming up really soon.
Supervised learning with missing data, mean imputation with a missing indicator is actually a pretty good choice. There are many more results in the paper. And in general, if you're interested in this area of research, we have this research project that we call dirty data where there is ongoing research and there will be more. Thank you.
Thank you, Gael. We have five minutes for questions. Please come to the microphones in the aisles. Thanks, liked the talk a lot. A little bit,
maybe not the best question for EuroPython. Is there a version of dirty cat also for R? And if not, do you think it would be easy to port it? Dirty cat should be fairly easy to code. Well, dirty cat. Dirty cat is several things and it will grow.
But both targeting coding, so targeting coding the implementation, one of our colleagues, Joris Van den Bosche, who was also a pandas developer, found that there was a better way to do target encoding. So we're gonna fix this. We're gonna improve target encoding. But both target encoding and similarity encoding are fairly easy to code. Code the n-gram version for similarity encoding.
Don't bother about the other ones. But yeah, please do it. Go ahead. There's one in Spark. Hi, thank you for the talk. It was very interesting. I have two questions. The first one is why three? Why the n-gram number three? Did you test the other numbers or is three the
golden standard and everyone should use? No, three was more for the didactic reason. In practice, what we're using in these days is the two, the three, the four, and the substrings that are separated by specific characters such as spaces. And this we did benchmark, but we only have 17 data sets.
So our benchmarks are not fully trustworthy. We need more benchmarks to, more data sets to do more benchmarks. And the second question is what if you have missing data in the dirty category? What if you do not know if it's a police officer or a janitor or something? Good question. Yeah, I forgot.
Yeah, I should have mentioned this. So my missing data is more a problem for continuous values. For categorical value, I would advise in general to basically just add an indicator to represent
the missing value as a specific value in your encoding, which could be zero zero zero zero by the way. Thank you. Very interesting talk, practical as well. Thanks. Do you have a plan to look into active learning at some point as well? I mean, I think on real-world problems that might be interesting.
This is not a research agenda. Our research agenda is to put the human out of the loop. But it is true that active learning for database curation is extremely useful, and it probably complements what we're doing. Thanks. Please give a round of applause to Gael. Thanks so much.