Worldwide Handwritten Receipt OCR10 min read
We are delighted to introduce the V4 release of our Handwritten Receipt OCR API that now handles handwritten character recognition on any receipt worldwide. Thanks to months of work from our data science team, we were able to train Handwritten Text Recognition algorithms offering very high performances using the latest research in deep learning and computer vision. This article will explain why we did it, how we did it, and what’s new in this release.
If you take a look at the distribution of receipts in the world, most of them contain printed text that is easier to read than handwritten text. But if you want to optimize any expense management process and make your users happy whatever the shape of their receipts, being able to read handwritten documents becomes very important. The main reason is that depending on the geography, or the type of expense that is reported, the distribution between handwritten and printed text varies a lot. If some of your users report only US receipts from restaurants, most of their expenses will come with a handwritten receipt, and you don’t want to let them down.
In any expense management solution, the amount paid for this expense needs to be inputted in order to be validated down the road by a finance manager. More generally, that’s the case for any other useful information that can help in pre-validating the expense, such as the amount of tip, the type of expense, etc… Using technology to help users extract the right data is something valuable. That’s the main reason we wanted to include this new feature: To make sure software using our technology reduces the time spent by companies on expense reports as much as possible.
In general, restaurant receipts are a large proportion of receipts in expense management. When you go to a restaurant in the US, you will most of the time end up adding manually a tip on top of the total price of your meal and then doing the maths and writing down the final price.
Being able to extract automatically the
total amount and the
tip amount is very useful because:
total amount: This is the price the collaborator paid for this expense and thus the price that she needs to get reimbursed.
tip: Tips are chosen by the collaborator before paying but should be always in the range of 10 to 30% of the price. Companies can have specific policies for tips and extracting this value can help ensure the policy is respected.
Depending on your geography and the type of expense, you might sometimes end up getting a fully or partially handwritten receipt from your merchant. Sometimes it’s a blank page, fully filled manually. Sometimes it’s a template where some data are missing and need to be filled in manually. But in both cases, being able to extract the data is important. According to our study, it looks like restaurant and cab receipts are the most represented in this category. But it can be from any type of expense, as there are many merchants in all industries who don’t use a machine for printing the receipts. Here are a few examples of Taxi receipts:
Let’s talk about the problem first: why handwritten recognition is a much more complicated problem than printed text? The answer is simple, there are almost an infinite number of different ways to write an amount with a pen, although printed text comes with much more normalization and thus less variability. When using machine learning to solve a problem, it’s very important to be able to train an algorithm that has very high generalization capabilities. This means that the algorithm should learn how to accurately perform tasks on data that look like but are not the same as the data used for training it.
That’s why this problem is complicated because the extreme variability of handwritten styles makes it hard to generalize on any handwritten style. But let’s try to break it down a bit further.
Whether you just had your lunch or dinner at a restaurant or you are a cab driver just dropping off your customer, you are not in the best condition to use your best writing to fill the receipt. The result is that your writing style can be a bit messy and vary a lot from your original handwriting. Let’s take a look at a few examples and try to find out some variable aspects:
- Confusion: The tip is hardly readable. Without any context, it’s difficult to say whether this is a 9 or a 2.
- Character size & confusion: Each digit doesn’t have the same size. Besides, the last two digits can be read either as 0 or 6.
- Spacing: This one is very well-written and readable. The problem is in the high spacing before the decimals.
- Spacing & confusion: Large spacing before the decimals and very high confusion possible for the last digit between 1 and 7. (Feel free to make a bet on the comments of this article for 1 or 7)
- Strong confusion: We can assume that what’s written is 77.10 but we know that thanks to the decimals. We’ll talk a bit more about this problem in the next section.
Let’s try to explain why those aspects of handwriting are a problem for handwritten OCR.
OCR stands for Optical character recognition, while ICR is for Intelligent character recognition. In a nutshell, ICR is for complex typing such as handwritten whereas OCR refers more generally to printed text. ICR can also refer to recognition algorithms that include a learning capability.
For handwritten text, adding intelligence when you read is important for machines and humans. Mainly because when you want to read a printed text word, you’d be able to read it just by separating all the characters and reading them one by one. But for many reasons mentioned in our examples above, this is not enough for handwritten and you need to understand more context of each word to accurately read it. Here is why:
Let’s get back to example 5 of our previous section and take a look at each character separately. You can see how complex it can be to read each digit properly when it’s handwritten compared to printed digits.
When looking at the first two handwritten digits, it’s very hard to say whether they are “7” or “1”. The only thing that helps us, as humans, to infer that they are sevens is the fact there is a clear “1” for the third digit. Because we have no possible confusion for this third character, it’s then easy to understand that the first two characters are “7”, because it’s not how the person writes “1”.
This is an example of why handwritten text recognition can be considered more intelligent than a traditional OCR approach.
Non-uniform character sizing, overlapping, artefacts, blur, rotations, ink stains, inconsistent character spacing… There are lots of noise types included in handwriting, in addition to unconventional writing styles. This makes it very hard for old techniques of computer vision to recognize text.
The reason is that doing text recognition in the pre-deep learning era was consisting of a set of pre-processing like denoising, binarization, line detection, character separation, etc… With the end goal of having a normalized set of features that can be compared character-wise to normalized characters. Applying these techniques to handwritten doesn’t bring any accurate results, but deep learning does.
Deep learning models have such an important generalization capability that they are able to learn by themselves given examples. Compared to the former techniques, the data scientists don’t have to define a complex pre-processing pipeline, because the deep learning algorithm will, in a certain way, learn it by itself.
Here is the paper we used for the base of our handwritten text recognition algorithm. This paper was originally proposed as an approach for irregular text recognition:
The important thing in this model, compared to more simple models, is the addition of attention. Attention is a mechanism in the model able to make connections between different parts of the image, which enables the model to learn patterns from those connections.
By slightly modifying the model from the paper, and training on a lot of data, we were able to achieve a very fast model (~10ms per image) with very high accuracy for reading handwritten amounts. We will explain exactly what we’ve done to train the algorithm in a future blog post, but here is one of the tricky examples we used earlier (you can test yourself by signing up on the platform, for free, no credit card required):
The release includes:
- Handwritten support for tips and total amounts
- Update in response scheme with new extracted fields:
- Total excluding taxes
- Total amount (formerly total_incl)
- The tax base is now extracted on each tax item
- Detection of 39 new currencies (44 currencies total now): EUR, GBP, CHF, USD, CAD, CZK, NOK, SEK, HUF, RON, PLN, RUB, DKK, XPF, TRY, MXN, COP, BRL, CLP, ARS, AED, SAR, QAR, ILS, OMR, CNY, PHP, SGD, HKD, JPY, MYR, KRW, TWD, THB, VND, IMR, IDR, DZD, MAD, TND, XOF, ZAR, XAF, AUD
We’d be more than happy to collect your feedback on our brand-new handwritten receipt OCR. Let us know in the chat if you have any questions, you can also join our slack community if you want to discuss the topic with the data scientists who have worked on this topic!