Fuzzy Matching

Cite

EuroPython

Ho, Cheuk Ting

Formal Metadata

Title

Fuzzy Matching

Subtitle

Smart Way of Finding Similar Names Using Fuzzywuzzy

Title of Series

EuroPython 2018

Number of Parts

132

Author

Ho, Cheuk Ting

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/44976 (DOI)

Publisher

EuroPython

Release Date

2018

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Matching strings should be one of the first natural language processing problem that human encounter since we start use computer to handle data. Unlike numerical value which has an exact logic to compare them, it is very hard to say how alike two strings are for a computer. One may compare them character by character and have an idea of how many characters in the pair of stings are the same. Unfortunately in most application we need computer to perceive strings like we do and therefore we have to use fuzzy matching. Fuzzy matching on names is never straight forward though, the definition of how “difference” of two names are really depends case by case. For example with restaurant names, matching of words like “cafe” “bar” and “restaurant” are consider less valuable then matching of some other less common words. Also, do we consider company names that matches partly (like “Happy Unicorn company” and Happy Unicorn co.”) are the same? In the first half of the talk Levenshtein Distance, a measure of the similarity between two strings, will be explained. Different functions in Fuzzywuzzy like “partial em ratio” and “token/em sort_ratio” will also be explored and compared for difference. It is very important to understand our tool and choose the right one for our task. Then in the second half, we will start tackling the example problem: matching company names, we will show that besides using Fuzzywuzzy, we have to also handle problem like finding and avoid matching of common words and speeding up the matching process by grouping the names. By combining all tricks and techniques that we demonstrate, we will also evaluate how efficient this method is and the advantage of using this method. This talk is for people in all level of Python experience who would like to learn a trick or two and would like to be able to solve similar problems in the future. Theory of how the library works will be explained and It is easy to be pick up even for beginners.