Extract data from passports with Mindee's API using Python 

 

In this tutorial, you will learn how to extract data from passports (pdfs or images) using python. This can help you automatically process id document for user onboarding or KYC processes.

 

If you want to get more details about how to use the mindee python library for passports, please see:

 

 

Tutorial

 

In this tutorial, you will learn how to extract passport data automatically using Mindee python helper library, transform a pdf into an image file, and how to highlight features on it to make human validation easier. 

 

The final result looks like this:

 

 

 

API Prerequisites

 

  1. You’ll need a free Mindee account. Sign up and confirm your email to login.
  2. A passport image or pdf. You can download one here or check out on google image

 

 

Setup the project

 

Create an empty directory wherever you want in your laptop, we’ll call ours “python_passport’, and create a ‘main.py’ file in it.

 
We recommend you to create a virtual environment before going to the next step, in order to avoid installing the required packages globally on your machine.To do so, run the following command:

 

(macOS and Linux)

python -m venv venv


(windows)

py -m venv venv

 

Then, you'll need to activate your venv, so that pip will install packages in that folder.

 

(macOS and Linux)

source venv/bin/activate


(Windows)

.\env\Scripts\activate

 

 

We need now to Install mindee python library from PyPi using pip, a package manager for Python.

 

pip install mindee

 

If the install throws an error, you might need to upgrade your version of pip. To do so, run:

 

pip install --upgrade pip 

 

Don't have pip installed? Try installing it, by running this from the command line:

 

$ curl https://bootstrap.pypa.io/get-pip.py | python

 

That’s it for the setup, let’s call the API.

 

 

Call the Mindee Passport API

 

 

Login onto the Mindee dashboard and enter the Passport API environment by clicking the following card.

 

 

mindee invoice api card

 

 

If you don’t have an API token yet for this API, go to the “credentials” section and create a new API Key. 

 

 

new api token

 

 

Open the “main.py” file you created a few minutes ago, and paste the following code:

 

 

from mindee import Client

filepath = "./path/to/passport.jpg"
passport_token = "my-token-here"

mindee_client = Client(passport_token=passport_token)

if __name__ == "__main__":

    response = mindee_client.parse_passport(filepath)

    print(response.passport)

 

 

Replace the “path/to/my/file” placeholder in the code with the path of the passport pdf or image you want to parse. Replace the “my-token-here” with the API token you created previously in the platform.

 

 

You can get back to your console and run the script

 

 

python main.py

 

 

You should see the details of the passport printed in your console:

 

-----Passport data-----
Filename: passport.jpg 
Full name: HENERT PUDARSAN 
Given names: HENERT 
Surname: PUDARSAN
Country: GBR
ID Number: 707797979
Issuance date: 2012-04-22
Birth date: 1995-05-20
Expiry date: 2017-04-22
MRZ 1: P<GBRPUDARSAN<<HENERT<<<<<<<<<<<<<<<<<<<<<<<
MRZ 2: 7077979792GBR9505209M1704224<<<<<<<<<<<<<<00
MRZ: P<GBRPUDARSAN<<HENERT<<<<<<<<<<<<<<<<<<<<<<<7077979792GBR9505209M1704224<<<<<<<<<<<<<<00
----------------------

 

Now let’s take a look at the passport object.

 

 

Extracted passport data

 

The API returns a list of different fields in the passport (given names, surname, date of birth …). You can find this list in the Extracted passport fields documentation

 

The passport object under the Response.passport attribute has all the features you need to get the passport data and coordinates in the image to highlight the results.

 

Here is a piece of code to show you how to get some of them:

 

from mindee import Client

filepath = "./path/to/passport.jpg"
passport_token = "my-token-here"

mindee_client = Client(passport_token=passport_token)

if __name__ == "__main__":
    response = mindee_client.parse_passport(filepath)

    print(response.passport)

    # To get the list of names
    print(response.passport.given_names)

    # Loop on each given name
    for given_name in response.passport.given_names:
        # To get the name string
        print(given_name.value)

    # To get the passport's owner surname (string)
    print(response.passport.surname.value)

    # To get the passport's owner gender (string among {"M", "F"}
    print(response.passport.gender.value)

    # To get the passport's owner full name (string)
    print(response.passport.full_name.value)

    # To get the passport's owner birth place (string)
    print(response.passport.birth_place.value)

    # To get the passport's owner date of birth (string)
    print(response.passport.birth_date.value)

    # To get the passport expiry date (string)
    print(response.passport.expiry_date.value)

    # To get the passport date of issuance (string)
    print(response.passport.issuance_date.value)

    # To get the passport  first line of machine readable zone (string)
    print(response.passport.mrz1.value)

    # To get the passport second line of machine readable zone (string)
    print(response.passport.mrz2.value)

    # To get the passport full machine readable zone (string)
    print(response.passport.mrz.value)

    # To get the passport id number (string)
    print(response.passport.id_number.value)

    # To get the passport country code (string)
    print(response.passport.country_code.value)

 

Run the script and check out the results in your console!

  

Now we are going to add a piece of code that will highlight the features extracted on the pdf. It can help you or your users to very quickly validate the data extraction.

 

 

Convert pdf into an image

 

If you’re using a pdf, we need first to convert the pdf into an image, as it’s more convenient to use for highlighting the features on it. If your passport is an image, you can skip this part.

 

To convert the pdf into a jpeg we are going to use the pdf2image python library; First install it in your environment:

 

pip install pdf2image

 

Important note: Dealing with pdfs is not easy and this library might not work directly because it sometimes requires some extra dependencies. Check out the github to make sure you installed it correctly.

 

Once installed, we are going to create an image out of our pdf:

 

 

from pdf2image import convert_from_path, convert_from_bytes
images = convert_from_path(filepath)

 

 

The images object is an array of Pillow images. Pillow is an image library for python. As we want to use openCV for highlighting the features on the image, we need to convert the pillow image into an OpenCV image. OpenCV is based on the numpy library, and to convert it we are going to use numpy directly.

 

If you don’t have numpy installed on your environment:

 

 

pip install numpy

 

 

Now let’s convert the Pillow image into an opencv format using numpy:

 

 

import numpy as np
from pdf2image import convert_from_path
 
def pdf_to_cv_image(filepath, page=0):
   image = convert_from_path(filepath)[page]
   return np.asarray(image)

 



 

Highlight features on the image

 

Let’s try to highlight the features as if someone did it with a pen.

 

First, you’ll need to install the computer vision python library OpenCV if you don’t have it already installed in your env. To do so, run:

 

 

pip install opencv-python

 

Each field extracted in the Response.passport object has a Field.bbox object. It's as simple list of bounding boxes relative vertices (% of the image width and height).

 

Check out what's inside for the total_incl.bbox for example, doing:

 

print(passport_data.passport.mrz1.bbox)

 

Now that we know how to access the bounding boxes coordinates, we are going to create our highlighter function in which we can feed any field bounding boxes in different steps, as follows:

 

  1. Check if the file is a pdf or an image and load it as an openCV image in both scenarios
  2. Create a mask image
  3. Loop on each feature coordinates and draw the feature rectangle on our mask
  4. Overlay the mask and original image with alpha
  5. Display image to the user

 

Note: each coordinate returned by the API is relative (in % of the image). You'll see there is a relative to absolute conversion in the code.

 

Here is the code step by step:

 

def highlight_features(img_path, coordinates):
   # step 1: Check if the file is a pdf or an image and load it as an openCV image in both scenarios
   if img_path.endswith(".pdf"):
       cv_image = pdf_to_cv_image(img_path)
   else:
       cv_image = cv2.imread(img_path)

   # step 2: create mask image
   overlay = cv_image.copy()
   h, w = cv_image.shape[:2]

   # step 3: Loop on each feature coordinates and draw the feature rectangle on our mask
   for coord in coordinates:
       if len(coord):
           # Create points absolute coordinates
           pt1 = (int(w*coord[0][0]), int(h*coord[0][1]))
           pt2 = (int(w*coord[2][0]), int(h*coord[2][1]))
           cv2.rectangle(overlay, pt1, pt2, (70, 230, 244), cv2.FILLED)

   # step 4: Overlay the mask and original image with alpha
   final_image = cv2.addWeighted(overlay, 0.5, cv_image, 0.5, 0)

   # step 5: Display image to the user
   cv2.imshow("highlghted_image", cv2.resize(final_image, (600, int(600*h/w))))
   cv2.waitKey(0)


 

Finally, we just have to modify a bit the code for executing the highlighting function and printing the results. Here is what the final code:

 

 

import cv2
import numpy as np
from pdf2image import convert_from_path
from mindee import Client


filepath = "./path/to/passport.jpg"
passport_token = "my-token-here"


def pdf_to_cv_image(filepath, page=0):
    image = convert_from_path(filepath)[page]
    return np.asarray(image)


def highlight_features(img_path, coordinates):
    # step 1: Check if the file is a pdf or an image and load it as an openCV image in both  scenarios
    if img_path.endswith(".pdf"):
        cv_image = pdf_to_cv_image(img_path)
    else:
        cv_image = cv2.imread(img_path)

    # step 2: create mask image
    overlay = cv_image.copy()
    h, w = cv_image.shape[:2]

    # step 3: Loop on each feature coordinates and draw the feature rectangle on our mask
    for coord in coordinates:
        if len(coord):
            pt1 = (int(w * coord[0][0]), int(h * coord[0][1]))
            pt2 = (int(w * coord[2][0]), int(h * coord[2][1]))
            cv2.rectangle(overlay, pt1, pt2, (70, 230, 244), cv2.FILLED)

    # step 4: Overlay the mask and original image with alpha
    final_image = cv2.addWeighted(overlay, 0.5, cv_image, 0.5, 0)

    # step 5: Display image to the user
    cv2.imshow("highlghted_image", cv2.resize(final_image, (600, int(600 * h / w))))
    cv2.waitKey(0)


mindee_client = Client(passport_token=passport_token)

if __name__ == "__main__":
    response = mindee_client.parse_passport(filepath)

    print(response.passport)

    highlight_features(
        filepath,
        [
            response.passport.surname.bbox,
            response.passport.given_names[0].bbox,
            response.passport.mrz1.bbox,
            response.passport.mrz2.bbox,
            response.passport.birth_place.bbox,
            response.passport.country.bbox,
            response.passport.issuance_date.bbox,
            response.passport.expiry_date.bbox,
            response.passport.birth_date.bbox,
            response.passport.id_number.bbox
        ]
    )


 

And the final result!

 


 

 

 

Conclusion

 

And there you have it! Mindee's passport API parsed a passport in under 2 seconds, matching nearly all of the sections of the passport quickly and accurately. Give it try with our free tier, and let us know in the chat how it works for you!