Blog
Machine Learning
Tutorials

Partial string matching

Reading time:
6
min
Published on:
Jun 7, 2022

Mohamed Biaz

Mohamed Biaz

Khalid El Aaboudi

Khalid El Aaboudi

Summary

Share the article

As a developer, you probably have already faced the problem of string matching. Whether you want to develop an algorithm of automatic spell check or you want to match a query string in your database, you need a way to match similar strings together even if they are different. These differences can be due to grammar variability or due to mistakes introduced by OCR engines.

In this article we will introduce and explain the different ways of doing string matching and provide you with python snippets, so you can convert them to your favorite language. We will be using three Python libraries difflib, fuzzywuzzy, and regex.

What metrics are used when dealing with string comparison?

The whole problem of partial string matching consists of finding a function that gives a meaningful similarity score between two strings.

There are plenty of ways for measuring string similarity but we will be discussing these below:

String comparison
String comparison

The Jaccard distance

One of the simplest ones is to use the Jaccard distance.

It basically computes the ratio between the number of unique similar characters over the number of unique characters in both strings.

Here is the implementation of the Jaccard distance in Python

def jaccard_similarity(s1: str, s2: str)->float:
	"""Computes the Jaccard similarity score between s1 and s2."""
	return len(set(s1) & set(s2)) / len(set(s1) | set(s2))

if __name__ == "__main__":
	s1 = "String matching is not easy"
	s2 = "Compare two strings is not easy"
	jaccard_sim = jaccard_similarity(s1, s2)
	print(f"The Jaccard similarity between {s1} and {s2} is {jaccard_sim}")

For example, the Jaccard similarity between “fruit” and “fruits” is 5/6.

Jaccard similarity

How good is this metric? Well, it’s quite easy and straightforward to implement, however, it does not take into account the order of the characters. For example, the Jaccard distance between SILENT and LISTEN is …… 1 – 6/6 = 0. So we need something more robust.

Pros:

  • Easy to implement
  • Fast

Cons:

  • Don’t take character ordering into account
  • Not very reliable

The longest common sub-string percentage

The longest common substring is the longest string contained in both strings. One can immediately think of a similarity measure as the ratio between the length of the longest common substring and the minimal length of both strings.

longest common substring
longest common substring

So in the examples above:

  • “Fruit” and “Fruits” gives 100% score as the full word “Fruit” is the longest common substring and
  • “Listen” and “Silent” gives 1/3 , as two characters (”en”) out of six are common

Depending on your use case, you can also compute the ratio using the maximum length from both strings:

  • Using minimum length: A score of 100% means that one of the two strings is completely included in the other.
  • Using maximum length: A score of 100% is possible only when the two strings are exactly the same.

Here is a python implementation of this method using difflib:

from difflib import SequenceMatcher

def longest_common_substring(s1: str, s2: str) -> str:
		"""Computes the longest common substring of s1 and s2"""
    seq_matcher = SequenceMatcher(isjunk=None, a=s1, b=s2)
    match = seq_matcher.find_longest_match(0, len(s1), 0, len(s2))

    if match.size:
        return s1[match.a : match.a + match.size]
    else:
        return ""

def longest_common_substring_percentage(s1 : str, s2 : str) -> float:
	"""Computes the longest common substring percentage of s1 and s2"""
	assert min(len(s1), len(s2)) > 0, "One of the given string is empty"
	return len(longest_common_substring(s1, s2))/min(len(s1), len(s2))

However what happens if I want to compare “goodbye” and “goozbye”? Well, the longest common substring is “goo” so the similarity would be 3/7 which is very low given that only one character differs.

Pros:

  • Take the character ordering into account

Cons:

  • Not easy to implement from scratch
  • Don’t take typo errors into account

Levenshtein similarity

The Levenshtein distance is based on the minimum number of modifications to apply to a string to match another one.

The Levenshtein distance is a particular case of the EDIT distance. The generic EDIT distance allows you to define a weight for each type of modification to apply on the strings although the Levenshtein distance has a weight of 1 for all of them.

Levenshtein distance
Levenshtein distance

So in the examples above:

  • “Fruit” and “Fruits” gives an 80% score as one error is introduced by ‘s’ out of five characters for the smallest word, hence 1-(1/5) being 80%
  • “Listen” and “Silent” gives 33% as the minimal number of operations to make them match is 4 with two replacement needed, one insertion and one deletion, hence 1-(4/6) being 33%

The EDIT distance gives more flexibility because it’s possible to fine-tune the weights in order to fit your problem better.

A modification can be of 3 types:

  • Insert: Add an extra character
  • Delete: Delete a character
  • Replace: Replace a character

NB: Sometimes, the Replace modification is not used and is considered as a deletion plus an insertion. You can also find some definitions including the Transposition modification.

To get a comparison score from the Levenshtein distance as on the other methods, we can divide the distance by either the length of the shortest string or the longest string.

s1 = "listen"
s2 = "silent"
[('replace', 0, 0), ('delete', 2, 2), ('replace', 3, 2), ('insert', 6, 5)]

s1 = "fruit"
s2 = "fruits"
[('insert', 5, 5)]

s1 = "string matching is not easy"
s2 = "compare two strings is not easy"

[('insert', 0, 0), ('insert', 0, 1), ('insert', 0, 2), ('replace', 0, 3), ('replace', 1, 4), ('replace', 3, 6), ('replace', 4, 7), ('replace', 5, 8), ('replace', 6, 9), ('replace', 7, 10), ('replace', 8, 11), ('replace', 9, 12), ('replace', 10, 13), ('replace', 11, 14), ('insert', 15, 18)]

# forget about the spacing 
string matching is not easy               # source string 
**c**string matching is not easy              # 1 - insert 'c'
c**o**string matching is not easy             # 2 - insert 'o'
co**m**string matching is not easy            # 3 - insert 'm'
com**p**tring matching is not easy            # 4 - replace 's' with 'p'
comp**a**ring matching is not easy            # 5 - replace 't' with 'a'
compar**e**ng matching is not easy            # 6 - replace 'i' with 'e'
compare **t**g matching is not easy           # 7 - replace 'n' with 't'
compare t**w** matching is not easy           # 8 - replace 'g' with 'w'
compare tw**o** atching is not easy           # 9 - replace 'm' with 'o'
compare two **s**tching is not easy           # 10 - replace 'a' with 's'
compare two sthing is not easy            # 11 - delete 'c'
compare two st**r**ing is not easy            # 12 - replace 'h' with 'r'
compare two string**s** is not easy           # 13 - insert an 's'

Here is an implementation of a comparison score using Levenshtein distance:

from Levenshtein import distance
def levenshtein_distance_percentage(s1: str, s2: str) -> float:
    """Computes the Levenshtein dis"""
    assert min(len(s1), len(s2)) > 0, "One of the given string is empty"
    return 1. - distance(s1, s2) / min(len(s1), len(s2))

Pros:

  • Helps in understanding how many user interactions are required to modify a string to match another

Cons:

  • Harder to implement

How to search allowing mistakes using regular expressions?

The package regex in Python allows searching using regular expressions that allow fast search in text data. This package has a powerful feature that allows partial regex matching. We will introduce this feature and give a taste of its power in the following paragraph.

Approximate matching with regular expressions

Regexes are used to define a search pattern and allow to find matches inside strings. The use of approximate matching is possible using packages like regex in python: it can allow the search for a pattern with some acceptable errors.

You may be interested in searching keywords in a scanned document having OCR errors. In fact, OCR errors can show some recurring patterns (like the following: “w” → (“vv” or “v”), “O” → “0” , “y” → “v”), hence by allowing some maximum number of errors or by specifying the type of errors allowed (insertion, deletion, substitution) we can find those keywords, as in the examples below.

import regex

# search pattern for key word way
ptrn = "(?:way){e<=2:[v]}"

regex.findall(ptrn, "it is way better") 
['way']

regex.findall(ptrn, "it is wav better") 
['wav']

regex.findall(ptrn, "it is vvay better")
['vvay']

regex.findall(r"\\b(?:w{1s+1i<=2:[v]}ay)\\b", "it is vvay better")  
['vvay']

The identifier for allowing general errors is : {e} , by doing this we are not specifying how many errors are tolerated, hence to put an upper limit to the number of errors we will use the sign “≤”, for example, an upper limit of two errors we will use {e≤=2}.

Moreover, we can specify the character introducing the error, which can be introduced by substitution/insertion forming the error, by using this identifier {e<=2:[v]}.

The upper limit for the number of errors can be more specified, as we can specify it by error type, as in the last example above: we wanted the sum of the number of substitutions and number of insertions not to exceed 2, resulting in this identifier {1s+1i<=2:[v]}.

Conclusion

As we have seen there are a lot of ways to do approximate search and matching. These range from simple methods such as Jaccard distance to more complicated methods like Levenstein similarity, and this can be leveraged using regular expressions with the Python regex library for fast search in text data.

Photo by Agence Olloweb

OCR
Python
String Matching
logo Mindee

Schedule a meeting with one of our experts

Please provide the following information so we can connect you to the right teammate.

Oops! Something went wrong while submitting the form.