In this article, we will demonstrate how to create ocrized PDFs from images, scanned PDFs, etc. to run word searches on them. Below is the breakdown of the two-step process that we use at Mindee to accomplish this task.
- First, we use an open-source tool called Mindee docTR to perform OCR (Optical Character Recognition) on the image or scanned PDF. The docTR OCR results are then exported as an XML file in hOCR format.
- Lastly, we convert the hOCR file to PDF using an open-source tool, OCRmyPDF
Why would you want to ocrize your PDF?
We ocrize the images as well as the scanned documents or PDFs so that we may search for certain keywords or phrases within them. A few lines of code is all that’s needed to do this. With the approach we present, we’re also able to exhaustively ocrize the texts embedded within the images, which are normally left out (logos, watermarks, etc.).
To better understand why we need to ocrize documents, let’s take a look at two use cases, which involve searching through a huge PDF and searching through a folder full of PDFs.
Below is a non-exhaustive list of documents that can be categorized under the two use-cases.
Searching through a huge PDF, such as:
- contracts (terms and conditions, loan contracts, employment contracts, etc)
- specifications
- scientific and technical reports
- insurance policies
- request for information/request for quotation/request for proposals
Searching through a folder full of PDFs :
- resumes (find a specific skill)
- questionnaires/forms (find a specific answer)
- invoices/receipts/quotations (find a specific item/customer/supplier )
- presentations (find any keyword)
- old scanned news articles (find specific news)
Creating an ocrized PDF makes it easier for non-developers and developers alike to search for a specific keyword in the various use cases listed above while using their favourite PDF reader. You can also check out our article on how to read the passport MRZ lines!
Why use docTR for ocrizing PDFs?
Quick catch-up with docTR
- docTR is one of the best open-source OCR solutions available on the market. It uses state-of-the-art detection and recognition models to seamlessly process documents for Natural Language Understanding tasks. With just 3 lines of code, we can load a document and extract text with a predictor!
docTR offers pretrained backbones such as dbresnet50rotation for both detection and recognition. For more information on available backbones, please refer to the documentation page. Another major perk of using docTR over the existing open-source packages is that it can be trained with small rotations. This makes docTR more robust for the ocrization tasks. The list of supported vocabs can be found here.
docTR Performance
Using example datasets, the table below compares docTR against some alternative OCR solutions.
Note: The dataset used for the comparison could not be made public due to the sensitive information included in it.
We have also included some comparisons of public datasets FUNSD and CORD below.
The above OCR models have been evaluated using both the training and evaluation sets of FUNSD and CORD. For further information regarding the metrics being used, see Task evaluation.
Jumping to codebase!
To create lightweight ocrized PDFs using docTR and OCRmyPDF, we will start by installing Mindee docTR and OCRmyPDF.
# installing requirements!pip install "python-doctr[tf]"!pip install ocrmypdf
You can use this example or any image/ scanned PDF of your choice, but for the sake of this tutorial, we are going to use the image below.
Below is our chosen image for the demo:
To download our sample image, you can run the following code:
Alternatively, you can download and save the image on your computer.
As iterated earlier, we are breaking the process into two steps:
- Define the output folders for the output PDF and the hOCR data related to the docTR results.
Then load the image.
Note: if you are using a scanned PDF, you’ll need to use the DocumentFile.from_pdf method instead and run an OCR with docTR.
Below we can see the docTR result which shows the detected and highlighted text in the image.
- For the next step, export the docTR OCR results as an XML file in hOCR format.
After exporting the hOCR result of docTR as XML, we can use OCRmyPDF to convert it to an ocrized pdf.
Voila! Now we have created your ocrized PDF as desired.
How to search a folder with multiple ocrized PDFs
In the Ubuntu terminal, for example, you may use the Ubuntu pdfgrep command to search a folder full of numeric or ocrized PDFs.
To do this, let’s first install pdfgrep
Now we can use pdfgrep to search for any information using a keyword. We can do simple searches with an exact match or use a regex for more flexibility. Let’s look at some examples:
Below, we want to look for year-specific information using the keyword “Year.”
We can also search for a specific time-lapse, say from 2010 to 2019, using a simple regex.
From the above examples, you can see how easy it is to leverage ocrized PDF search power on a folder – using only a few lines of command.
Why OCRmyPDF?
OCRmyPDF is an application and library that adds text “layers” to images in PDFs, making scanned image PDFs searchable. It includes an image-oriented PDF optimizer, which by default runs with safe settings with the goal of improving compression with no loss of quality. Optimizations only occur after OCR and only if OCR succeeds. Optimization ranges from -00 through -03, where 0 disables optimization and 3 implements all options. In addition, it comes with tons of other options such as rotation correction, batch processing, selective ocrization, and so on!
This article helps overcome the major limitation of OCRmyPDF, which is limited by the Tesseract OCR engine. As a result, Tesseract is not as accurate as a state-of-the-art OCR solution (you can test OCR accuracy with our benchmark tool). Poor quality scans could produce poor quality OCR. That is the reason we went with docTR as a replacement for the default OCR engine of OCRmyPDF.
Photo credit Canva Photo Collage
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere. uis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.