Extract data from passports with Mindee's API using Python
In this tutorial, you will learn how to extract data from passports (pdfs or images) using python. This can help you automatically process id document for user onboarding or KYC processes.
If you want to get more details about how to use the mindee python library for passports, please see:
- 1. Getting started with passport parsing in python
- 2. Extracted passport fields
- 3. Save and restore passport API responses
- 4. Passport data validations
- 5. Benchmark Passport OCR performances
Tutorial
In this tutorial, you will learn how to extract passport data automatically using Mindee python helper library, transform a pdf into an image file, and how to highlight features on it to make human validation easier.
The final result looks like this:
API Prerequisites
- You’ll need a free Mindee account. Sign up and confirm your email to login.
- A passport image or pdf. You can download one here or check out on google image
Setup the project
Create an empty directory wherever you want in your laptop, we’ll call ours “python_passport’, and create a ‘main.py’ file in it.
We recommend you to create a virtual environment before going to the next step, in order to avoid installing the required packages globally on your machine.To do so, run the following command:
(macOS and Linux)
python -m venv venv
(windows)
py -m venv venv
Then, you'll need to activate your venv, so that pip will install packages in that folder.
(macOS and Linux)
source venv/bin/activate
(Windows)
.\env\Scripts\activate
We need now to Install mindee python library from PyPi using pip, a package manager for Python.
pip install mindee
If the install throws an error, you might need to upgrade your version of pip. To do so, run:
pip install --upgrade pip
Don't have pip installed? Try installing it, by running this from the command line:
$ curl https://bootstrap.pypa.io/get-pip.py | python
That’s it for the setup, let’s call the API.
Call the Mindee Passport API
Login onto the Mindee dashboard and enter the Passport API environment by clicking the following card.
If you don’t have an API token yet for this API, go to the “credentials” section and create a new API Key.
Open the “main.py” file you created a few minutes ago, and paste the following code:
from mindee import Client
filepath = "./path/to/passport.jpg"
passport_token = "my-token-here"
mindee_client = Client(passport_token=passport_token)
if __name__ == "__main__":
response = mindee_client.parse_passport(filepath)
print(response.passport)
Replace the “path/to/my/file” placeholder in the code with the path of the passport pdf or image you want to parse. Replace the “my-token-here” with the API token you created previously in the platform.
You can get back to your console and run the script
python main.py
You should see the details of the passport printed in your console:
-----Passport data-----
Filename: passport.jpg
Full name: HENERT PUDARSAN
Given names: HENERT
Surname: PUDARSAN
Country: GBR
ID Number: 707797979
Issuance date: 2012-04-22
Birth date: 1995-05-20
Expiry date: 2017-04-22
MRZ 1: P<GBRPUDARSAN<<HENERT<<<<<<<<<<<<<<<<<<<<<<<
MRZ 2: 7077979792GBR9505209M1704224<<<<<<<<<<<<<<00
MRZ: P<GBRPUDARSAN<<HENERT<<<<<<<<<<<<<<<<<<<<<<<7077979792GBR9505209M1704224<<<<<<<<<<<<<<00
----------------------
Now let’s take a look at the passport object.
Extracted passport data
The API returns a list of different fields in the passport (given names, surname, date of birth …). You can find this list in the Extracted passport fields documentation
The passport object under the Response.passport
attribute has all the features you need to get the passport data and coordinates in the image to highlight the results.
Here is a piece of code to show you how to get some of them:
from mindee import Client
filepath = "./path/to/passport.jpg"
passport_token = "my-token-here"
mindee_client = Client(passport_token=passport_token)
if __name__ == "__main__":
response = mindee_client.parse_passport(filepath)
print(response.passport)
# To get the list of names
print(response.passport.given_names)
# Loop on each given name
for given_name in response.passport.given_names:
# To get the name string
print(given_name.value)
# To get the passport's owner surname (string)
print(response.passport.surname.value)
# To get the passport's owner gender (string among {"M", "F"}
print(response.passport.gender.value)
# To get the passport's owner full name (string)
print(response.passport.full_name.value)
# To get the passport's owner birth place (string)
print(response.passport.birth_place.value)
# To get the passport's owner date of birth (string)
print(response.passport.birth_date.value)
# To get the passport expiry date (string)
print(response.passport.expiry_date.value)
# To get the passport date of issuance (string)
print(response.passport.issuance_date.value)
# To get the passport first line of machine readable zone (string)
print(response.passport.mrz1.value)
# To get the passport second line of machine readable zone (string)
print(response.passport.mrz2.value)
# To get the passport full machine readable zone (string)
print(response.passport.mrz.value)
# To get the passport id number (string)
print(response.passport.id_number.value)
# To get the passport country code (string)
print(response.passport.country_code.value)
Run the script and check out the results in your console!
Now we are going to add a piece of code that will highlight the features extracted on the pdf. It can help you or your users to very quickly validate the data extraction.
Convert pdf into an image
If you’re using a pdf, we need first to convert the pdf into an image, as it’s more convenient to use for highlighting the features on it. If your passport is an image, you can skip this part.
To convert the pdf into a jpeg we are going to use the pdf2image python library; First install it in your environment:
pip install pdf2image
Important note: Dealing with pdfs is not easy and this library might not work directly because it sometimes requires some extra dependencies. Check out the github to make sure you installed it correctly.
Once installed, we are going to create an image out of our pdf:
from pdf2image import convert_from_path, convert_from_bytes
images = convert_from_path(filepath)
The images object is an array of Pillow images. Pillow is an image library for python. As we want to use openCV for highlighting the features on the image, we need to convert the pillow image into an OpenCV image. OpenCV is based on the numpy library, and to convert it we are going to use numpy directly.
If you don’t have numpy installed on your environment:
pip install numpy
Now let’s convert the Pillow image into an opencv format using numpy:
import numpy as np
from pdf2image import convert_from_path
def pdf_to_cv_image(filepath, page=0):
image = convert_from_path(filepath)[page]
return np.asarray(image)
Highlight features on the image
Let’s try to highlight the features as if someone did it with a pen.
First, you’ll need to install the computer vision python library OpenCV if you don’t have it already installed in your env. To do so, run:
pip install opencv-python
Each field extracted in the Response.passport
object has a Field.bbox
object. It's as simple list of bounding boxes relative vertices (% of the image width and height).
Check out what's inside for the total_incl.bbox for example, doing:
print(passport_data.passport.mrz1.bbox)
Now that we know how to access the bounding boxes coordinates, we are going to create our highlighter function in which we can feed any field bounding boxes in different steps, as follows:
- Check if the file is a pdf or an image and load it as an openCV image in both scenarios
- Create a mask image
- Loop on each feature coordinates and draw the feature rectangle on our mask
- Overlay the mask and original image with alpha
- Display image to the user
Note: each coordinate returned by the API is relative (in % of the image). You'll see there is a relative to absolute conversion in the code.
Here is the code step by step:
def highlight_features(img_path, coordinates):
# step 1: Check if the file is a pdf or an image and load it as an openCV image in both scenarios
if img_path.endswith(".pdf"):
cv_image = pdf_to_cv_image(img_path)
else:
cv_image = cv2.imread(img_path)
# step 2: create mask image
overlay = cv_image.copy()
h, w = cv_image.shape[:2]
# step 3: Loop on each feature coordinates and draw the feature rectangle on our mask
for coord in coordinates:
if len(coord):
# Create points absolute coordinates
pt1 = (int(w*coord[0][0]), int(h*coord[0][1]))
pt2 = (int(w*coord[2][0]), int(h*coord[2][1]))
cv2.rectangle(overlay, pt1, pt2, (70, 230, 244), cv2.FILLED)
# step 4: Overlay the mask and original image with alpha
final_image = cv2.addWeighted(overlay, 0.5, cv_image, 0.5, 0)
# step 5: Display image to the user
cv2.imshow("highlghted_image", cv2.resize(final_image, (600, int(600*h/w))))
cv2.waitKey(0)
Finally, we just have to modify a bit the code for executing the highlighting function and printing the results. Here is what the final code:
import cv2
import numpy as np
from pdf2image import convert_from_path
from mindee import Client
filepath = "./path/to/passport.jpg"
passport_token = "my-token-here"
def pdf_to_cv_image(filepath, page=0):
image = convert_from_path(filepath)[page]
return np.asarray(image)
def highlight_features(img_path, coordinates):
# step 1: Check if the file is a pdf or an image and load it as an openCV image in both scenarios
if img_path.endswith(".pdf"):
cv_image = pdf_to_cv_image(img_path)
else:
cv_image = cv2.imread(img_path)
# step 2: create mask image
overlay = cv_image.copy()
h, w = cv_image.shape[:2]
# step 3: Loop on each feature coordinates and draw the feature rectangle on our mask
for coord in coordinates:
if len(coord):
pt1 = (int(w * coord[0][0]), int(h * coord[0][1]))
pt2 = (int(w * coord[2][0]), int(h * coord[2][1]))
cv2.rectangle(overlay, pt1, pt2, (70, 230, 244), cv2.FILLED)
# step 4: Overlay the mask and original image with alpha
final_image = cv2.addWeighted(overlay, 0.5, cv_image, 0.5, 0)
# step 5: Display image to the user
cv2.imshow("highlghted_image", cv2.resize(final_image, (600, int(600 * h / w))))
cv2.waitKey(0)
mindee_client = Client(passport_token=passport_token)
if __name__ == "__main__":
response = mindee_client.parse_passport(filepath)
print(response.passport)
highlight_features(
filepath,
[
response.passport.surname.bbox,
response.passport.given_names[0].bbox,
response.passport.mrz1.bbox,
response.passport.mrz2.bbox,
response.passport.birth_place.bbox,
response.passport.country.bbox,
response.passport.issuance_date.bbox,
response.passport.expiry_date.bbox,
response.passport.birth_date.bbox,
response.passport.id_number.bbox
]
)
And the final result!
Conclusion
And there you have it! Mindee's passport API parsed a passport in under 2 seconds, matching nearly all of the sections of the passport quickly and accurately. Give it try with our free tier, and let us know in the chat how it works for you!