We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Fuzzy Matching

00:00

Formal Metadata

Title
Fuzzy Matching
Subtitle
Smart Way of Finding Similar Names Using Fuzzywuzzy
Title of Series
Number of Parts
132
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Matching strings should be one of the first natural language processing problem that human encounter since we start use computer to handle data. Unlike numerical value which has an exact logic to compare them, it is very hard to say how alike two strings are for a computer. One may compare them character by character and have an idea of how many characters in the pair of stings are the same. Unfortunately in most application we need computer to perceive strings like we do and therefore we have to use fuzzy matching. Fuzzy matching on names is never straight forward though, the definition of how “difference” of two names are really depends case by case. For example with restaurant names, matching of words like “cafe” “bar” and “restaurant” are consider less valuable then matching of some other less common words. Also, do we consider company names that matches partly (like “Happy Unicorn company” and Happy Unicorn co.”) are the same? In the first half of the talk Levenshtein Distance, a measure of the similarity between two strings, will be explained. Different functions in Fuzzywuzzy like “partial em ratio” and “token/em sort_ratio” will also be explored and compared for difference. It is very important to understand our tool and choose the right one for our task. Then in the second half, we will start tackling the example problem: matching company names, we will show that besides using Fuzzywuzzy, we have to also handle problem like finding and avoid matching of common words and speeding up the matching process by grouping the names. By combining all tricks and techniques that we demonstrate, we will also evaluate how efficient this method is and the advantage of using this method. This talk is for people in all level of Python experience who would like to learn a trick or two and would like to be able to solve similar problems in the future. Theory of how the library works will be explained and It is easy to be pick up even for beginners.
35
74
Thumbnail
11:59
PlastikkarteSimilarity (geometry)CodeBit rateTheoryTwitterComputer animation
Matching (graph theory)Similarity (geometry)String (computer science)Translation (relic)DatabaseCorrespondence (mathematics)ComputerDistanceInsertion lossSource codeComputer programmingAerodynamicsDensity of statesToken ringQuicksortAlgorithmGroup actionGame theoryRadio-frequency identificationPoint (geometry)Fuzzy logicForm (programming)RAIDHost Identity ProtocolSample (statistics)Fuzzy logicHaar measureDistanceString (computer science)MathematicsFuzzy logicMaxima and minimaLibrary (computing)Token ringServer (computing)Different (Kate Ryan album)Matching (graph theory)Line (geometry)Multiplication signTable (information)Equaliser (mathematics)GenderLogicSoftware repositoryIntegerFormal languageMatching (graph theory)QuicksortNumberParameter (computer programming)Exterior algebraSoftware testingCASE <Informatik>InformationTotal S.A.BitOpen setTrailSlide ruleElectronic mailing listSpacetimeClient (computing)Resultant4 (number)AlgorithmOrder (biology)BlogGraph (mathematics)Event horizonDynamical systemLaptopType theoryWindowSimilarity (geometry)Data dictionaryThresholding (image processing)Point (geometry)Metric systemLimit (category theory)CoefficientSet (mathematics)Error messageSystem callHeuristicScaling (geometry)MultiplicationPlateau's problemLevel (video gaming)SubsetKey (cryptography)Right angleComputer programmingDatabaseMatrix (mathematics)Game theoryPolygon meshWrapper (data mining)Transformation (genetics)Patch (Unix)LengthFacebookMusical ensembleHypermediaPerfect groupForcing (mathematics)Capillary actionCodeUniqueness quantificationPartial derivativeMessage passingInsertion lossProjective plane
Transcript: English(auto-generated)
So, yeah, thank you very much after lunch and then you join my session, so welcome. So yeah, this talk will be beginners friendly, so I won't dive deep down into the detail of how the package works and all this. If you are into something really technical, I'm sorry, maybe lagging, but I will tell
you how I use the package and what is the theory behind how I use it, just some tips from people who may be using it in the future, and it's like, you know, PyData style, you know, so here who is kind of like identify themselves as data scientists, analysts working
with data? Yeah, great. So, yeah, I hope you will find this useful. So yeah, that's my, you know, Twitter there and, you know, GitHub, and you will find the code on GitHub, actually. And also, remember to tag Euro Python and please rate my talk afterwards, so thank you.
Okay. And then, yeah, I'm Chuck, or you can call me Sherry, my friend does, and I'm working in hotel beds, I'm organising an AI club for gender minority, which is in London, so yeah, I'm based in London, so we are trying to promote, you know, gender equality
in the technology, so trying to, you know, encourage and empower gender minorities, like many women who, yeah, who is like struggling, you know, feel a bit, you know, not enough support in the community, so we will try to help.
Also, I'm a member of Python Sprints, so if you are, you know, like me, coming from London, and then next week, we will have a sprint for pandas documentation, which is for gender minorities, so if you are minorities like me and you want to start contributing to pandas, we will have other events for other libraries as well, so if
you want to contribute and you don't know how to get started, you're welcome to join, or if you're super senior, you're like very experienced and you're welcome to help out as well. So, yeah. Okay. All of those things clear. So, why we are doing this, like matching company names, like string matching, and so,
yeah, I was having a problem at work, like I want to match, like there's a list of companies, which is my company's client, and I want to find some similar names in it, but, of
course, it's not limited to my company, right, so, like why we want to match the name, so I have some kind of funny encounter on Facebook, actually, because somewhere in China, I saw this picture, it looks familiar, right, okay, what about this one?
It's giving credit to a company in the USA, obviously, and this one, I have to quickly skip through it, because it's a very nasty typo there, so, if I want to see, like, okay,
if I have a list of companies and some of the companies are like those, I want to see, oh, are they the one that the coffee shop, like we drink coffee and we buy cakes and are they the same, so, can you tell me, like, what would be the result of those? Yeah. All fours.
So, I can't do it like that, right? Everybody knows that. So, maybe I can be smart, maybe I can use some string methods to do it, so, who knows, like, what string method I could use to make the first one become false, like to do some modification, yeah, up or lower.
How about the second one? There are different ways, but I found a way to do it, which is, yeah, I will show you, and then the last one, let's talk about the last one.
I will show you the thing that I can think about, like, yeah, I replaced the space bar and I just trimmed the S. It's a bit stupid, right? What if it's not, like, Starbucks, it's like Starbucks exclamation mark, then it doesn't work, right? So, we need something else to kind of find all these similar strings, so I kind of know
this one, it's called Fuzzy Matching, like, it's just a funny name, it's like, Fuzzy, you're thinking about, you know, like, cute animals, like, they're like. But, yeah, that's Wikipedia's definition, it's too long, I just don't like to read long paragraphs, but actually, basically, what it means is just matching two strings
that is not exactly the same, but we want to find a way to score them, how similar they are. So, voila, we have something very, you know, smart coming up. So, this is kind of named after a Russian scientist, I hope I pronounced it right,
if you speak Russian, then, yeah, you can tell me. It's Vladimir Levenshtein, is it correct? Levenshtein, yeah, Levenshtein's distance. So, I'm still doing it wrong, but, yeah, forgive me. Yeah, so, it's basically telling, there will be a number, you know, an integer
that says, like, how much alternation I have to do from changing a string A to string B. So, if I delete one character, there will be one, one change. If I change that, like, for example, from an A to an E, that would be also one alternation, or adding something. So, yeah, so, if that number is bigger, this Levenshtein distance is bigger,
then the two strings are more different. So, it sounds very intuitive, right? So, how can I do it in programs? I have dynamic programming, it sounds smart, right? But actually, yeah, I just, I got this picture, it's not my picture,
I should give credit, it's from GitHub, actually. There is a JavaScript library that has implemented this, but, of course, I'm using Python, so I'm not talking about that packages, but, yeah, if you find my slides, and you can click on it, and you will go to there, and, like, have a look at their GitHub repo.
So, this is the graph, right, or matrix, I don't know how you should call it, but you can see, like, from the top left-hand side, it's zero, because there's no change. And then, if, okay, we just go from one character to one character, so first character, they're the same, so if you,
so, basically, you don't need to change anything, so they would be the same, right, both S, but you can see, like, for example, if I go one step to the right, I'm adding one word to Saturday, so, if I spell Saturday, it would be adding one word each, so, it would be, you can see the first line is one, two, three, four, five, six, seven, eight,
and if I go down, it's, like, adding words to Sunday, so you can also see that as well. So, it's from zero to that word. So, what if I have, now I have two characters, I have, for example, Saturday, I have, like, S-A. If I want to change it to, like, Sunday,
so it's, like, S should be S-U, right, so I have to change the A to the U. So, if you look at the intersection of the A and the U, there will be one, so you have to change one character. So, this is what this is, and, of course, it's dynamic programming, so, it's, like, so, if you have, first, you have the small problem, and then you can expand it until the end,
and at the end, you can find that the minimal change is free, and then you can work your way back up and find the path, the minimum path that you have to do to change Saturday to Sunday. So, yeah, you can see there's some transformation, some deletion, which, yeah, you can see from the graph.
So, I'm not talking about that repo because that is in Java, so, JavaScript, I believe, yeah, JavaScript. So, yeah, and, so, I'm using Python, so, which is, I can use Fuzzy Wuzzy, which is also a funny name,
which I love. So, why I use that is, like, what is it, like, better that I really like it that much is because it's not just using the Levenshtein distance that, you know, I can do, compare two strings. It also provides some features that's very useful. For example, simple ratio, which is the basic one, right?
So, that one is not magic, that's basic, but we can also have partial ratio, which means that if I have, for example, two words, that, for example, my name is, like, I'm Czech,
but also if you include in my middle name, like Czech Teng. So, Czech and Czech Teng, are they the same person? Is it both me? So, if I use partial ratio, that would be the same, right? Because Czech Teng also contain Czech, that is also my name. So, that would give you a score of 100, so they are the same. But if you use simple ratio, it won't be the same because you have to add Teng, add, like, four characters
to become my name, right? So, that would be, give you a lower score. For this, first you will see a matching score. And also we have a token sort ratio, which means that I token each word and then they can change the order. For example, my name, like, my name is Czech Teng Ho,
which Ho is my surname. But in Chinese, like, in Chinese we have surname first. So, my name would be actually Ho Czech Teng in Chinese. So, yeah, so, it's the same person. So, I have to use token sort ratio. So, that would be actually still, like, the three names are the same. So, it's just the order is changed
and because, you know, Western culture usually put your first, give a name first and then your surname, but in Chinese it's different. So, yeah, it's still the same. It will give you 100. And then for token set ratio, that would be, like, for example, my name, because, like, for example, you would skip the middle name, right? Usually if I put my first name and my last name,
there would be Czech Ho. So, yeah, then that would be, you know, is it the same as me? And then it would do the token ratio. And also if it's a subset, it will also pass. It will also give you 100 score. So, that's very useful. It depends on the use case. Actually, it would really help you out.
So, you can check out the GitHub repo. It's done by Sitki. Yeah, it's a very, very popular library. I used it and I really like it. So, you can go there, check it out. And they have a blog post about, you know, what the difference between these four as well. So, yeah.
Okay, so my use case now, okay, now I have the company data, but of course I won't use my company's client's data because that's confidential. So, I download an example dataset, which is from the open license public database,
which, you know, all the UK companies, if they have limited liability, they have to have all this information in public. So, you can download it. So, there's a lot of company actually in this country. So, I only use Cambridge because a lot of startups going on there is very exciting. So, but still, there's a lot of company in that list.
It's like 15K, so quite a lot. Okay, so track one, because if I just use for WUSI and I just check all these names, are they the same? Actually, a lot of them will be having a high score
because there's some words that, you know, all the company use, right? You can have, you know, I can have my own, you know, track paying company or whatever and everybody can set up a company with the name company. It's a very common idea, right? And limited is like very common as well. So, yeah, that would be less meaningful to mention.
For example, if it's talking about my clients, there are lots of them, because my company is doing travel, like hotel rooms and stuff. So, a lot of my clients, they will have travel in their name, so which is less meaningful. So, yeah, what I do is like, I use a small trick that, you know,
I do all this, you know, count dictionary and I see what's the 30 most common words. Actually, this is also useful in doing some NLP stuff as well. So, recurring theme, so yeah, very convenient. I just use the same idea to do it. Okay, and then another thing is like,
I came across a problem, because there's, remember, there's like 15, 15, almost 16 case companies. So, there's a lot of number. If we match each single one of them with the others, that would be a lot to compare, which I don't want to have a big project, right?
I don't want to have like a GPU and all this stuff to train, like to calculate for a day. No, no, it's not gonna work. So, and you know, Python is not super fast. So, that's why I am trying to use some trick, because I'm thinking about, I want to match companies the name that's highly similar. So, I would just assume that they won't make the mistake
on the first character. If like, if for example, if somebody make an account, type in a company name, but made a mistake, have a typo, and oh, it's wrong. I will open another account with the right name. So, basically, a person opens two accounts with highly similar names, because I made a mistake on the first one.
So, I just assume that they won't make the mistake in the first character, because that's very obvious. Usually, you type something, you just have a look at it. If the first character's wrong, you just catch it. So, yeah, that's the trick that I do. That's good enough for me. Okay, so, remember, I found out the 30 most common words, right?
So, when I do foresee matching, I would deduct 10 points. It's like a game, right? It's like, oh, because you have this word, I would just deduct your points, because I would check the score at the end, right? I would check if it's having a high score.
So, if they just have a score, this pair having a score, just based on they have a common word, so it's not valid. So, I have to deduct the score to make them balance it out. Is it a very good trick? It may not be, but it works, and it's very simple. It's very easy to implement.
So, that's why I do it. I think, in a lot of cases, it will work. Yeah, it sounds very simple, but it works. And also, I choose to use this token, salt, raisal, because if somebody, for example, if somebody type in the names, if they swap it,
so they should be highly similar, right? So, I'm considering it in a word, word by word. So, I choose to use that, sorry. Okay, so at the end, what do I found out using that data set? You can go to my GitHub and check it out. It's just a very Jupyter notebook.
It's just very simple, and it's an example that you could just simply reference back. So, the thing that I caught, like if they score a score that is more than 85, then they would be considered the same. So, what are they?
Usually, they would be like spelling mistakes because they would be like two names that's different by, for example, just an S, like the Starbucks and Starbucks, or they would be having one less L or one less I or things like that. So, it's very easy to make those kind of mistakes
because if it's an I and an L, obviously, if you look at it, then it's very easy to miss. So, it kind of makes sense for human typo. It may not be, but I just suspect that that is. And also, number three is step number two. So, that one, it will be, I would suspect somebody having multiple accounts, right?
For example, if I sign up with an account with a username, I already did it last week with my name, and then this week, I won't have another one. I'll just put a two at the end. So, that's the logic. Sorry. So, and also, that would be,
like for example, abbreviation. You know, there could be some changes, so that could be intentional. So, I won't suspect that would be absolutely a human error. And also, that was just the funny thing that I found
because like some names that actually, they look similar, like the one that I have here, but they are not just different by one character. It could be they are just like two companies. That's coincidentally having a very similar name. So, that's the interesting one that I would love to investigate.
You see, for example, if they are my clients, I would maybe ask my colleague if they can talk to their clients and like, oh, are they like the same? It's still like the two accounts belongs to you, or is it, you know, if somebody is coincidentally having like a similar name? So, yeah, we need some human checking at the end,
but actually, I can show you, it's actually highly reduced the work that we have to do to find out if it's a typo, or is it totally two clients? You see, after I applied the matching,
so it's like the one that got caught, there's like highly similar. It's only 1% of the total. And so, among like 15K that you have to maybe, you know, look at it, at least like you won't call all of them, right? Of course, my colleague won't call all of the clients, but like by just looking at them,
you can like, you still have to do like, go through the long list of, you know, like that much, like 15,889. But now, I just need to look at 57 of them. So, and also maybe investigate in 57 of them. So, it really highly reduced the work that, you know,
you or your colleague have to do. So, I think it's a really good trick. So, yeah. So, yeah, happy matching. So, I think I have a lot of time left, is it? And yeah, I just go through it very quickly, is it? So, yeah, just ask questions and, yeah.
Thank you very much. So, we have time for questions. Yeah.
So, the fuzzy would say use four metrics to estimate if it's a match or not. What is the, let's say, the coefficients or the priority of these four metrics? You mean the? Like the token salt powder?
Yeah, so, I can actually show. They are doing it, they're actually giving it a score, like it will return a score at the end, which if it's perfectly matched or if it's, there's actually examples. Like, for example, if, like here, for example, the token salt matching,
you can see, like, let me make it bigger. Yes, but how it eventually generates this final score from that four metrics? Yeah, so, basically what it's doing, if it's just, for example, if it's just simple ratio, it will be a lot of time distance
and then just, you know, normalize it and give it a score if it's, you know. I haven't really, like, checked in the code of this, but it will give you a hundred if it's a total match, you can see here. For example, if it's a partial match, then it will give you a hundred if it's a totally matched. So, it's by proportion as well.
So, if you have a longer phrase, then if it's the tolerance of the difference, it will be higher because it's by proportion. Oh, okay, so it's a, so you pick which one you want to use. It's not like eventually they combine the four to generate one score.
Yeah, you can choose which logic you want to use because, for example, if I applied this in the simple ratio, it won't be perfect, right? So, it's the logic that you could choose, yeah. Thank you.
Okay, we have another question. Hello, so I was curious if Levenstein only uses the letters. So, does it work equally well in all languages or are there some languages that it has problems in
and do people use techniques like combining it with a dictionary from a language to get better results? Actually, if, because it doesn't care what language it is, it's the idea, it's like a check characters. And if, for example, if it's not in English, if it's in French, then it doesn't matter, right?
Because if you have a word that is one character difference, it would give you a score, like a Levenstein distance of one because there's one character difference. So, it doesn't matter if it's a change of that character, if it's deletion or if it's adding one character.
This is kind of more of a question of what if. Suppose I've got two super big tables and there are World of Warcraft players from one server and another server and I kind of want to know if there might be a match and let's assume that these names are fuzzy.
Not the best example, but let's assume that. Like, you would then maybe do a full join and then check if they are fuzzy, wuzzy, matchable, but this won't scale very well unless you use some sort of heuristic. In such a situation, you already mentioned that you can take the first letter, for example, as a trick to maybe make things faster. Other tricks?
For two tables, that's matching the names. I would say that I, because I haven't think about something else yet at this level, but I mean, this already helps a lot
because, for example, if it's two names that got changed in order, like the two words got changed in order, that would already sort it out for you. But, yeah, if you want it to be super quick other than, you know, having the first character that match, maybe you can check the length as well.
But there is some limitation because if it's different by length, then if you kind of, if it's not change of character, if it's adding one character or minus one character, then you can't check that as well. So, there will be limitation. For me, it's just, my trick is more or less like a patch, like a quickly kind of, yeah, guessing them quickly.
So, yeah. Okay, another question? Hi, here we are comparing one to one, but what if you want to compare one to a set? Let's say you have 10,000 names and you want to figure how this specific name is a rarity to compare or uniqueness to compare to the whole set.
Will you also use Fuzzy Wuzzy or would you go with something else? So, if I have one name, for example, I have to match with the other thousand to see if there's a couple that is very similar. Is that what you mean? No, actually, how would you measure how this name is unique compared to whole data set?
Okay, yeah. Actually, yeah, if it's a match, then it depends on which one you choose. The logic will be different. For example, if I want to be, for example, it's the same.
I choose the token slot ratios or if the name is changed in order, I would just consider them the same. So, I can apply, because every single one of them have a score, right? So, if there is, for example, if there's a 10 out of a thousand, yeah, 10 out of like 10,000
that got a score higher than 90, for example, then I would say that it's not unique. So, at least like all of them need to score at a point that is smaller than certain threshold and you can say that it's different from everything else. So, yeah. I hope that answered your questions. Do we have any more questions?
Okay. Yeah. So, before you said you pick 85, if I remember correctly, as a threshold, right? Yeah, that's the threshold, yeah. So, how do you know that like matches with a score below 85 cannot, so you don't consider anything below 85. Did you run any tests like to choose this parameter 85
or how do you, did you pick it and? Yeah, yeah. I did run a couple of times actually. Yeah, I didn't show it, but yeah, because I, as this one, the 85, it give me like 57 of them that is considered the same, like matching, right? So, I look at them.
I can also, you know, change it, for example, change it to 75 and then it will give me, for example, let's say it give me 200 and I can compare the two or if I have prior knowledge knowing that, you know, all the mistake won't be, you know, won't be more than 1%, let's say, then I could, you know, have a look at the number that is got matched and then compare it to the total.
Is it less than 1% or is it more than 1%? So, I can kind of, I need to still fine-tune it. So, that would be, yeah, it's not absolute. Okay, here we have another question.
Hi, did you ever try it on longer strings? So, could you, for example, find a company name in a large text? Would that scale the library? A company with a... So, if you have a large text and you want to know if a company name appears in this text,
do you think you could use the library for that? Yeah, that would be, of course, if it's a not so big paragraph, maybe, like, you can do some windows, the slide flew, and then you can do some, maybe, like, partial ratio and all this to see if the company name appear in that window. But I haven't tried that out,
but I do think the sliding window would work if you kind of have a paragraph and you, for example, your company name is, like, in consists of three words, around normally a three word, then you can have a window of, you know, five words and slide flew, and then to see how many of them is having a high partial ratio that you can kind of pick that chunk out
and know that is, yeah, it's a match. And so, the last question. Yes, have you tried phonetic algorithms? Because I think that would work very well for short names.
Because you only use it for short names, right? Not for long corpus. Yeah, this one is for short. So, what's the algorithm that you mentioned? Phonetic algorithms. Oh, phonetic algorithm, is it? Oh, yeah, I haven't tried that one. Because, yeah, it's just, because I just tried this for C matching,
but I haven't tried that, yeah. Okay, and just to answer, maybe Finn's plateau or Finn's method could be a good match for your question. And I thought we don't have any more time for questions, so let's thank Chuck once again. Thank you.