Summary
As a developer, you probably have already faced the problem of string matching. Whether you want to develop an algorithm of automatic spell check or you want to match a query string in your database, you need a way to match similar strings together even if they are different. These differences can be due to grammar variability or due to mistakes introduced by OCR engines.
In this article we will introduce and explain the different ways of doing string matching and provide you with python snippets, so you can convert them to your favorite language. We will be using three Python libraries difflib, fuzzywuzzy, and regex.
What metrics are used when dealing with string comparison?
The whole problem of partial string matching consists of finding a function that gives a meaningful similarity score between two strings.
There are plenty of ways for measuring string similarity but we will be discussing these below:
- The Jaccard distance
- The longest common substring percentage
- The Levenshtein similarity
The Jaccard distance
One of the simplest ones is to use the Jaccard distance.
It basically computes the ratio between the number of unique similar characters over the number of unique characters in both strings.
Here is the implementation of the Jaccard distance in Python
For example, the Jaccard similarity between “fruit” and “fruits” is 5/6.
How good is this metric? Well, it’s quite easy and straightforward to implement, however, it does not take into account the order of the characters. For example, the Jaccard distance between SILENT and LISTEN is …… 1 – 6/6 = 0. So we need something more robust.
Pros:
- Easy to implement
- Fast
Cons:
- Don’t take character ordering into account
- Not very reliable
The longest common sub-string percentage
The longest common substring is the longest string contained in both strings. One can immediately think of a similarity measure as the ratio between the length of the longest common substring and the minimal length of both strings.
So in the examples above:
- “Fruit” and “Fruits” gives 100% score as the full word “Fruit” is the longest common substring and
- “Listen” and “Silent” gives 1/3 , as two characters (”en”) out of six are common
Depending on your use case, you can also compute the ratio using the maximum length from both strings:
- Using minimum length: A score of 100% means that one of the two strings is completely included in the other.
- Using maximum length: A score of 100% is possible only when the two strings are exactly the same.
Here is a python implementation of this method using difflib:
However what happens if I want to compare “goodbye” and “goozbye”? Well, the longest common substring is “goo” so the similarity would be 3/7 which is very low given that only one character differs.
Pros:
- Take the character ordering into account
Cons:
- Not easy to implement from scratch
- Don’t take typo errors into account
Levenshtein similarity
The Levenshtein distance is based on the minimum number of modifications to apply to a string to match another one.
The Levenshtein distance is a particular case of the EDIT distance. The generic EDIT distance allows you to define a weight for each type of modification to apply on the strings although the Levenshtein distance has a weight of 1 for all of them.
So in the examples above:
- “Fruit” and “Fruits” gives an 80% score as one error is introduced by ‘s’ out of five characters for the smallest word, hence 1-(1/5) being 80%
- “Listen” and “Silent” gives 33% as the minimal number of operations to make them match is 4 with two replacement needed, one insertion and one deletion, hence 1-(4/6) being 33%
The EDIT distance gives more flexibility because it’s possible to fine-tune the weights in order to fit your problem better.
A modification can be of 3 types:
- Insert: Add an extra character
- Delete: Delete a character
- Replace: Replace a character
NB: Sometimes, the Replace modification is not used and is considered as a deletion plus an insertion. You can also find some definitions including the Transposition modification.
To get a comparison score from the Levenshtein distance as on the other methods, we can divide the distance by either the length of the shortest string or the longest string.
Here is an implementation of a comparison score using Levenshtein distance:
Pros:
- Helps in understanding how many user interactions are required to modify a string to match another
Cons:
- Harder to implement
How to search allowing mistakes using regular expressions?
The package regex in Python allows searching using regular expressions that allow fast search in text data. This package has a powerful feature that allows partial regex matching. We will introduce this feature and give a taste of its power in the following paragraph.
Approximate matching with regular expressions
Regexes are used to define a search pattern and allow to find matches inside strings. The use of approximate matching is possible using packages like regex in python: it can allow the search for a pattern with some acceptable errors.
You may be interested in searching keywords in a scanned document having OCR errors. In fact, OCR errors can show some recurring patterns (like the following: “w” → (“vv” or “v”), “O” → “0” , “y” → “v”), hence by allowing some maximum number of errors or by specifying the type of errors allowed (insertion, deletion, substitution) we can find those keywords, as in the examples below.
The identifier for allowing general errors is : {e} , by doing this we are not specifying how many errors are tolerated, hence to put an upper limit to the number of errors we will use the sign “≤”, for example, an upper limit of two errors we will use {e≤=2}.
Moreover, we can specify the character introducing the error, which can be introduced by substitution/insertion forming the error, by using this identifier {e<=2:[v]}.
The upper limit for the number of errors can be more specified, as we can specify it by error type, as in the last example above: we wanted the sum of the number of substitutions and number of insertions not to exceed 2, resulting in this identifier {1s+1i<=2:[v]}.
Conclusion
As we have seen there are a lot of ways to do approximate search and matching. These range from simple methods such as Jaccard distance to more complicated methods like Levenstein similarity, and this can be leveraged using regular expressions with the Python regex library for fast search in text data.
Photo by Agence Olloweb
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere. uis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.