Optical Character Recognition (OCR) Explained8 min read
The need to scan our IDs, invoices, or receipts, etc, and pull up information stored on them has occurred at some point in our lives. Similarly, you could have PDF documents that contain information that you need to extract. Did you know that optical character recognition (OCR) is responsible for this?
This article will provide you with a clear picture of the potential of OCR, how it works, and the most prevalent (and helpful) uses of this technology.
OCR stands for Optical Character Recognition. The technical definition refers to software technologies capable of capturing text elements from images or documents and converting them into machine-readable text format.
Bare OCR technologies have a limited usage scope. Without any extra processing layer, it’s usually purposed to retrieve machine-encoded text from images or scanned documents and store them in document management software.
For other use cases, such as key information extraction from documents, it’s required to build another layer of intelligence on top of the OCR output, based on semantic information or other visual or natural language features. Regardless of the context, OCR is the initial step when the entry point of your workflow gets structured text input from scanned documents or photos.
Therefore, the acronym OCR is commonly used to refer to any OCR-based application such as a consumer mobile application turning a picture into an editable text document. For example, software capable of extracting key information from invoices is often called an “Invoice OCR”.
In elementary schools, you have learned how to perform the image-to-text conversion using your eyes as visual inputs, and your knowledge about characters’ shapes as a model. Admittedly, at that age, you were able to read properly written/typed words on a sheet of paper, but you might have struggled reading handwritten graffiti.
Before diving into the OCR technology, it’s important to understand the main use cases and how it is used within software. Structuring text from images can help solve multiple real-world use cases. In this manner, we shall divide the use cases into two distinct visual domains:
- OCR in the wild: a love message written on the sand of the beach
- OCR on documents: your train ticket
These two domains display significant differences in terms of element density (much higher in documents) and spatial/structural arrangements. However, we will focus on the most specific domain: OCR on documents.
Image-to-text conversion is the most common OCR use case. It refers to any web or mobile application transforming a picture containing text into plain machine-readable text. The main goal of those tools is to help the user transcribe all the text captured in a photo in a few seconds, instead of manually typing the whole text. The resulting text can be then copied into the user’s clipboard and used on his device.
Also known as dematerialization, digitizing a document consists of creating a machine-readable text copy of a document to store it in document management software. It can either be a plain text format or an editable copy with the same layout. In the second scenario, an OCR capable of detecting the position of each word along with their text content is required to detect the layout of the documents.
Most digitization companies are providing their clients with the appropriate hardware (scanners) to handle the conversion from paper documents to digital data.
Any archive of unstructured documents can be transformed into machine-readable text in order to make each document searchable using a natural language query. Using only an OCR, you can simulate a CTRL/CMD+F search within a scanned document on the text it contains. For more advanced OCR use cases, it’s likely that you need to build a search engine to look up different semantic information written in your document. Adding key information extraction features on top of your OCR might be required before indexing the extracted data in a search engine.
Also known as document classification, this task is about automatically classifying a new document by assigning it a type among a predefined set of document’s classes. The role of the OCR, in this case, is to extract all the words and characters from a document and use them as features for a classifier down the road. This classifier can be based on simple keyword detection rules (i.e. taxes for invoices or receipt, passport number for passport documents…) as well as machine learning (ML) algorithms for more complex classification. Using an ML model is a real advantage for this task, as it doesn’t require an extremely robust OCR to get very high performances.
- In procurement workflows: Given a new invoice-like document, the actions taken in a procurement workflow are different depending on the document’s type. An order form needs to be approved, an invoice needs to be paid, and finally, a receipt needs to be uploaded to the accounting system.
- In expense management: Expense management softwares use the type of newreceipts to apply the right approbation process, or to detect potential fraud. For instance, some companies only partially reimburse their employees’ expenses for gas or restaurants but fully reimburse parking or hotel receipts. Some of them also don’t reimburse their employee’s restaurant expenses if the receipt time is not in a specific time window.
- In loan application: For some loan application processes, the applicant sends a unique PDF including a set of different documents (IDs, tax return, payslips, certificate of incorporation for businesses etc..). That unique PDF must be split into proper type of documents to send each of them into different workflows.
This use case is probably the most complicated and crucial in the workflow automation space. It consists of extracting specific information from documents and outputting them in a machine-readable format that can be used in other software.
Compared to the document classification problem, the performance of the OCR used is extremely important for this use case as it doesn’t rely on an extrapolation of the text content, but on some specifically chained characters.
- Customer onboarding workflows: This involves getting structured data from legacy documents in order to populate the customers’ accounts information into required software. In contracts management software, for example, users need to get their past and ongoing contracts stored with their key data to benefit from the product features. These key data mainly include the parties, contract type, start and end dates, and the renewal policy. Automating the key information extraction (KIE) from contracts is important to ease the onboarding process.
- In procurement workflows: Accounts payable (AP) automation requires extracting the key data written in invoices to automate or ease the payment process of incoming invoices. The use of the term “invoice OCR” for this use case is widespread but it doesn’t correspond to the technical definition of an OCR. The set of data that needs to be extracted slightly depends on the workflow and the software specifications, but it generally includes:
- Vendor name (sometimes along with a company identifier such as TIN (Tax identification number), SIRET, VAT Number…)
- Invoicing date
- Payment due date
- Invoice total amounts: one including taxes and one excluding them
- Invoice total taxes (sometimes with the tax line details)
- Vendor bank wire details
Adding extra key data on top of this list can help cover more procurement use cases:
- Three-way matching consisting of associating an order form, an invoice, and a payment receipt using the purchase order number (PO) that needs to be extracted from the three documents;
- Accounts receivable (AR) and factoring automation requiring the automatic extraction of the client name and address;
- Procurement approval requires extracting the line items written in the invoices. Line item extraction is a task that doesn’t exactly fit the key information extraction problem as it relies on table detection and structuration.
Mindee is a trusted OCR API that is used to parse documents for developers. With our free API key, you can quickly and easily begin using our Optical Character Recognition (OCR) APIs to extract information from your documents. Try it for free.