Invoice OCR line items extraction17 min read
An Invoice is a document critical to every business because of the information it contains and the workflow it triggers. That is why Invoice OCR technologies are crucial to process invoices efficiently.
Invoices are key to business operations:
- An invoice sums up the expenses of a business. It must be properly tracked to evaluate business spending and cash flow. This is the field of Accounting.
- Transactions between businesses rely on the payment of an invoice for the services or goods delivered. This is the field of Accounts Payable.
- Business relationships also rely on invoices: what is billed should match was has been ordered. This is the field of Procurement.
When it comes to Invoice processing, the first step is to get access to the information it contains thanks to Invoice OCR technologies. Each use case leverages different information on an Invoice document:
- For example, an Accounts payable SaaS will require the reading of the
total amountfield as this is mandatory to make a payment to a supplier.
- On the other hand, a Procurement SaaS focuses on matching various documents within the 3-way matching process. A Purchase Order (PO) is matched with a Delivery Note and an Invoice to make sure that goods ordered, delivered, and billed are the same. For this use case, having details about the content of the Invoice is essential. It requires an understanding of each item on the Invoice.
To cover those different use cases, Invoice OCR technologies should provide key information extraction for generic fields (Invoice Number, Total Amount, Supplier Information, etc…) and also from table-level information called line items. Being able to decipher the content of a table in an invoice to understand each item’s attributes (such as price, quantity, description, etc..) is a challenge that we will cover in this article.
We will start with a couple of definitions about OCR, line items and why extracting line items from an invoice using OCR is strategic for businesses. Then we will focus on the problem: why it is a complex problem to tackle, and how at Mindee we have been solving it using Deep Learning.
One of the advantages of Mindee, is it’s key information extraction. Here’s an article to learn more about information management.
First, let’s clarify the difference between OCR and Invoice OCR.
OCR refers to the process of detecting and extracting words out of an image, it is generic, meaning it does not depend on the context, it can be on any type of image or document.
If you are curious to learn more about OCR technology, this is explained in this thorough article.
On the contrary, Invoice OCR refers to the detection and extraction of key information in a specific document: an Invoice. It does depend on the context.
Now let’s focus on key information extraction. What does that mean?
Let’s say you are building software for Accounts Payable Automation. The key information you are interested to extract is probably the
total amount. Access to this data automatically thanks to an Invoice OCR API will save your users time: they won’t have to manually input the Invoice amount. This removes both friction and manual error.
As you look at the lay out of the invoice below you can notice that apart from the generic fields
Total amount, […] ,there is one specific and central element: the table.
Table extraction or line item extraction is critical to a lot of different use cases. Let’s see why.
A line item refers to a line in an invoice that describes goods or services that has been purchased. It identifies the goods or services with information such as description, quantity or price.
Let’s take an example.
You have ordered 5 computers from your favourite consumer electronic dealer. A Purchase order was issued indicating that your ordered 5 computers at 1,500 $ each. The dealer has run out of stock and could not deliver the full order. Instead, you received 3 computers in a first delivery batch. The delivery note contains that information.
When receiving the Invoice, you want to make sure that you have been billed for 3 computers, not 5. This is what the 3-way matching process in procurement is all about: matching items from an Invoice with items that have been ordered and delivered, for verification, reconciliation, and validation purposes.
As a user, I don’t want to manually enter each item from an Invoice so that they can be matched with the content of the purchase order or the delivery note. This process is too tedious, time-consuming, and error-prone. If the Procurement software I use can scan the Invoice, and read the information for each item that I just need to validate then I am a happy user 🙂.
Let’s take another example, in the Accounting field this time. The book-keeping process is about writing expenses in the according book based on their nature. To do so, the accountant that receives an Invoice needs to scan each line of the Invoice and attribute each item to the relevant book. Extracting automatically the content of lines in an Invoice simplifies the Accountant work within his accounting software.
If you look at a wide variety of invoices you will notice that when it comes to line items or tables there is a set of common fields that is standard to most of Invoice templates.
To choose what fields need to be extracted, our approach is to understand what information is relevant to which use case.
For instance, for the Procurement use case, the purpose is to match an item from an Invoice with a pre-approved order. Therefore, we need to answer the following questions:
How can an item be identified?
- Product code or Description gives information about the item itself.
Does the billed item match what has been ordered?
- Quantity, Unit Price and/or Total Price, Discount rate/Discount Amount will help do the math and identify discrepancies.
For Accounting, tax information is important for tax recovery, then we needed to extract information to be able to answer the following question:
What is the tax applied to the item?
- Tax rate and or tax amount
Having answered those key questions and with a deep understanding of the underlying use cases, we have decided to support the following fields in the Invoice OCR API Response:
- Product code
- Unit Price
- Unit of Measure
- Total Amount
- Tax Rate
- Tax Amount
- Discount Rate
- Discount Amount
Understanding the problem of extracting line items starts with analysing a significant amount of documents.
Let’s take a look at a couple of examples:
- There is, on one hand, easy layouts: the table is structured, there are lines and headers for all columns.
Unfortunately, there are also a lot of variety of configurations where the table is more complicated to read even with human eyes.
- In example 1, the description fits on 2 lines, but there are demarcation in between that makes a separation.
- Sometimes columns are merged on some lines – lack of consistency within a table is complicated to read as displayed in example 2.
- A table without header such as in example 3, will require extra effort to understand the content of each line item.
- For example 4, we can wonder what is a line, what is the relevant content to extract.
- When it comes to Invoice processing, some times documents are scanned such as example 5. Image deformation can make it harder to read a line, that geometrically does no longer look like a straight line.
As we have seen in the examples above, being able to extract the information from a table is challenging because of the diversity of table configurations. To crack this complex problem, we have implemented complementary approaches. It relies on a combination of computer vision with geometrical heuristics and business specific rules. The illustration below represents the different steps for processing an Invoice and extracting relevant fields in a line item.
We will deep dive into each part of this pipeline, but first let’s take a look at what is the definition of a table.
A table contains:
- headers: it is the title of the column, it defines its content (⚠️ some table does not have headers)
Each element from the table is critical to each step of our pipeline:
- Step 1: the deep learning segmentation model will detect boxes that could be either cells or headers. They are called candidates.
- Step 2: OCR models will extract all words from the document and their coordinates (bounding boxes).
- Step 3: A Table reconstruction algorithm using dynamic programming will help reconstruct lines and columns.
- Step 4: Business rules post processings will help select relevant outputs.
Our segmentation model has been trained with multiple classes for each categories that we defined as potential outputs (Product Code, Description, Unit Price etc…). It learns to detect cells for each categories and also for column headers.
We analyzed the output of our segmentation models and we identified different types of recurring errors that we needed to fix in order to improve end to end performances.
As illustrated in the image above, errors can be of 4 different types:
- Missing candidates: the segmentation model has not detected some cells in the table.
- Splitted columns: single columns are detected as multiple columns.
- Candidates overlapping lines: for example 2
descriptioncells on 2 different lines are detected as a single long
- Category Classification errors: some boxes are detected in the wrong category. Example: a
descriptionbox is detected as a
Being able to reconstruct the table, which means attributing cells to a grid of columns and lines, would help fix all of those issues. The approach taken for table reconstruction will be described further in step 3. An other input is required to the reconstruction algorithm: having access to the document content, words and their coordinates.
The Text Recognition algorithm will be used:
- to feed the table reconstruction algorithm. Using the content and coordinates of candidate boxes will help to define line and column boundaries.
- to read words and feed final output of the API response, as described in the example below.
To improve end to end performances we use a combination of generic and specialised OCR models. For instance, we have trained a specialised OCR for amounts, one for dates etc…
Our Invoice OCR API works on typed documents at the moment, but we plan in the future to support handwritten documents with a dedicated handwritten OCR. We already made this technology available on our Receipt OCR API to support handwritten fields such as tips and written amounts.
The next step in the pipeline is the table reconstruction that will use outputs from both the segmentation model and the text recognition models.
Table reconstruction means building a grid of cells with columns and lines. The algorithm defines boundaries for each column and lines. Reconstructing a table based on its lines and columns is a recursive problem. To make sure we limit the complexity, we have decided to use dynamic programming, a standard computer programming algorithm dedicated to reduce complexity for recursion.
For each lines, we are looking at segmentation candidates and OCR candidates and we use rules to determine column and line boundaries.
There are some basic rules:
- 1 column = 1 class, and only one. There can’t be a
descriptioncell in a column classified as a
- A cell can only belong to one column.
descriptioncan be on multiple lines within a cell.
- A price (
total amount) can only be on a single line within a cell.
Using this grid will help solve the 4 issues identified after segmentation. Let’s take a look.
Sometimes the segmentation model has not detected a candidate in a line. This leaves an empty cell in our table.
Mapping the output from the OCR model within the grid helps identify when there is effectively relevant content in a cell left empty by the segmentation model. If that is the case, the gap can be filled and the line reconstructed properly.
If the segmentation model split a column in 2 on one line and not the other (like in the example below), it must be checked if there is in fact 1 or 2 columns for this area of the document. To make this decision, the grid is used to evaluate what is a column.
Let’s say the segmentation model has merged 2 lines of
description together. There would be only 1 candidate for those 2 lines. Now, if we look at the generic OCR output, we can find that there are 2 candidates, 1 for each line of
description. In mapping the segmentation candidates with the grid we can split the single segmentation candidate into 2 candidates. The 2
description lines can then be reconstructed properly.
A column can be of one class and only one. In fact, in a
description column, there can only be
description fields. There can’t be
unit price. How can we make sure what is the relevant category when there are multiple categories displayed in a single column? First, we count the number of occurrences for each categories within the column. The one that is more displayed is probably the relevant one. We double check using the type of element thanks to OCR and parsing results: if the type of cell is
amount and the category identified is
unit price then the choice is validated.
The last step of the pipeline is what we call post processing. It is an extra layer of business rules to make sure that candidates are relevant. It is the last mile of checks before attributing a candidate to a field in the API response.
Extra rules are based on business logics and a deep understanding of what an Invoice is.
A couple of examples:
- Does the calculation of
total amountread for each line item?
unit price * quantity - discount = total amount
- If we sum
total amountfor all the lines does it match the Invoice overall
total amountwhich is already extracted in our Invoice API OCR?
sum(total amount) = Invoice Total Amount
- Some lines are excluded when they do not fit the minimum pattern which is having at least an item description (
description) and a pricing information (
When all those steps are executed in a very short time, the output is ready to use. Our Invoice OCR API gives access to all 10 fields for each line item identified in the table.
Line items extraction in Invoices is a complex problem to tackle. At Mindee, we have combined Computer Vision, geometric heuristics, text recognition and a set of business rules validation to make sure that extraction performances are optimised.
It is a key feature for Invoice OCR API as it power many use cases from accounting, procurement, accounts payable and many more.
This feature will be soon publicly released, in the mean time reach out to us using our chat or through our slack community if you want to know more about our Invoice OCR API.