Create Ocrized PDFs In 2 Steps7 min read
In this article, we will demonstrate how to create ocrized PDFs from images, scanned PDFs, etc. to run word searches on them. Below is the breakdown of the two-step process that we use at Mindee to accomplish this task.
- First, we use an open-source tool called Mindee docTR to perform OCR (Optical Character Recognition) on the image or scanned PDF. The docTR OCR results are then exported as an XML file in hOCR format.
- Lastly, we convert the hOCR file to PDF using an open-source tool, OCRmyPDF
We ocrize the images as well as the scanned documents or PDFs so that we may search for certain keywords or phrases within them. A few lines of code is all that’s needed to do this. With the approach we present, we’re also able to exhaustively ocrize the texts embedded within the images, which are normally left out (logos, watermarks, etc.).
To better understand why we need to ocrize documents, let’s take a look at two use cases, which involve searching through a huge PDF and searching through a folder full of PDFs.
Below is a non-exhaustive list of documents that can be categorized under the two use-cases.
Searching through a huge PDF, such as:
- contracts (terms and conditions, loan contracts, employment contracts, etc)
- scientific and technical reports
- insurance policies
- request for information/request for quotation/request for proposals
Searching through a folder full of PDFs :
- resumes (find a specific skill)
- questionnaires/forms (find a specific answer)
- invoices/receipts/quotations (find a specific item/customer/supplier )
- presentations (find any keyword)
- old scanned news articles (find specific news)
Creating an ocrized PDF makes it easier for non-developers and developers alike to search for a specific keyword in the various use cases listed above while using their favourite PDF reader.
- docTR is one of the best open-source OCR solutions available on the market. It uses state-of-the-art detection and recognition models to seamlessly process documents for Natural Language Understanding tasks. With just 3 lines of code, we can load a document and extract text with a predictor!
pip install python-doctr from doctr.io import DocumentFile from doctr.models import ocr_predictor ====================================== model = ocr_predictor(pretrained=True) # PDF doc = DocumentFile.from_pdf("path/to/your/doc.pdf") # Analyze result = model(doc)
docTR offers pretrained backbones such as dbresnet50rotation for both detection and recognition. For more information on available backbones, please refer to the documentation page. Another major perk of using docTR over the existing open-source packages is that it can be trained with small rotations. This makes docTR more robust for the ocrization tasks. The list of supported vocabs can be found here.
Using example datasets, the table below compares docTR against some alternative OCR solutions.
Note: The dataset used for the comparison could not be made public due to the sensitive information included in it.
|(docTR)db_resnet50 + master||79||81.42||65.57||69.86||51.34||52.9|
|(docTR)db_resnet50 + sar_resnet31||78.94||81.37||65.89||70.79||51.78||53.35|
|Gvision doc. text detection||68.91||59.89||63.2||52.85||43.7||29.21|
|(docTR)db_resnet50 + master||71.03||76.06||84.49||81.94|
|(docTR)db_resnet50 + sar_resnet31||71.25||76.29||84.5||81.96|
|Gvision doc. text detection||64||53.3||68.9||61.1|
# installing requirements !pip install "python-doctr[tf]" !pip install ocrmypdf
You can use this example or any image/ scanned PDF of your choice, but for the sake of this tutorial, we are going to use the image below.
Below is our chosen image for the demo:
To download our sample image, you can run the following code:
# sample input image !wget https://pbs.twimg.com/media/B_UpX3WU8AA2j3r.jpg -O ./data/images/image.jpg
Alternatively, you can download and save the image on your computer.
As iterated earlier, we are breaking the process into two steps:
- Define the output folders for the output PDF and the hOCR data related to the docTR results.
import os # define output folder output_folder = "./output/" output_hocr_folder = output_folder + "hocr/" output_pdf_folder = output_folder + "pdf/" os.makedirs(output_hocr_folder,exist_ok=True) os.makedirs(output_pdf_folder,exist_ok=True)
Then load the image.
Note: if you are using a scanned PDF, you’ll need to use the DocumentFile.from_pdf method instead and run an OCR with docTR.
from doctr.models import ocr_predictor from doctr.io import DocumentFile # load image image_path = "./data/images/image.jpg" # extracting text from input image using docTR docs = DocumentFile.from_images(image_path) # load model model = ocr_predictor( det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True ) result = model(docs) # display ocr boxes result.show(docs)
Below we can see the docTR result which shows the detected and highlighted text in the image.
- For the next step, export the docTR OCR results as an XML file in hOCR format.
# export xml file xml_outputs = result.export_as_xml() with open(os.path.join(output_hocr_folder,"doctr_image_hocr.xml"),"w") as f : f.write(xml_outputs.decode())
After exporting the hOCR result of docTR as XML, we can use OCRmyPDF to convert it to an ocrized pdf.
from ocrmypdf.hocrtransform import HocrTransform output_pdf_path = output_pdf_folder + "hocr_output.pdf" hocr = HocrTransform( hocr_filename=output_pdf_path, dpi=300 ) # step to obtain ocirized pdf hocr.to_pdf( out_filename=output_pdf_path, image_filename=image_path, show_bounding_boxes=False, interword_spaces=False, )
Voila! Now we have created your ocrized PDF as desired.
In the Ubuntu terminal, for example, you may use the Ubuntu pdfgrep command to search a folder full of numeric or ocrized PDFs.
To do this, let’s first install pdfgrep
# first let's install pdfgrep sudo apt-get update sudo apt-get install pdfgrep
Now we can use pdfgrep to search for any information using a keyword. We can do simple searches with an exact match or use a regex for more flexibility. Let’s look at some examples:
Below, we want to look for year-specific information using the keyword “Year.”
pdfgrep -r "Year" ./hocr_output.pdf: APPLICANTS ForPublication Year2015-2016
We can also search for a specific time-lapse, say from 2010 to 2019, using a simple regex.
pdfgrep -r -P "\b201[0-9]\b" ./hocr_output.pdf: APPLICANTS ForPublication Year2015-2016 ./hocr_output.pdf:andsubmittotheVarsitarianofficeonorbeforeMARCH:27,2015.
From the above examples, you can see how easy it is to leverage ocrized PDF search power on a folder – using only a few lines of command.
OCRmyPDF is an application and library that adds text “layers” to images in PDFs, making scanned image PDFs searchable. It includes an image-oriented PDF optimizer, which by default runs with safe settings with the goal of improving compression with no loss of quality. Optimizations only occur after OCR and only if OCR succeeds. Optimization ranges from -00 through -03, where 0 disables optimization and 3 implements all options. In addition, it comes with tons of other options such as rotation correction, batch processing, selective ocrization, and so on!
This article helps overcome the major limitation of OCRmyPDF, which is limited by the Tesseract OCR engine. As a result, Tesseract is not as accurate as a state-of-the-art OCR solution. Poor quality scans could produce poor quality OCR. That is the reason we went with docTR as a replacement for the default OCR engine of OCRmyPDF.
Photo credit Canva Photo Collage