Extract data from invoices with Mindee's API using Python 

 

In this tutorial, you will learn how to extract data from invoices (pdfs or images) using python. This can help you automatically process invoices in any accounting software.

 

If you want to get more details about how to use the mindee python library:

 

 

Tutorial

 

The final script can be used within any python middleware application or REST API serving your frontend interfaces.

 

In the end of this tutorial, you will learn how to transform a pdf into an image file, and how to highlight features on it to make human validation easier. 

 

The final result looks like this:

 

extract data invoices python mindee api

 

 

API Prerequisites

 

  1. You’ll need a free Mindee account. Sign up and confirm your email to login.
  2. An invoice, look into your mailbox, download one here or check out on google image

 

 

Setup the project

 

Create an empty directory wherever you want in your laptop, we’ll call ours “python_invoice’.

 
We recommend you to create a virtual environment before going to the next step, in order to avoid installing the required packages globally on your machine.To do so, run the following command:

 

(macOS and Linux)

python -m venv venv


(windows)

py -m venv venv

 

Then, you'll need to activate your venv, so that pip will install packages in that folder.

 

(macOS and Linux)

source venv/bin/activate


(Windows)

.\env\Scripts\activate

 

 

We need now to Install mindee python library from PyPi using pip, a package manager for Python.

 

pip install mindee

 

If the install throws an error, you might need to upgrade your version of pip. To do so, run:

 

pip install --upgrade pip 

 

Don't have pip installed? Try installing it, by running this from the command line:

 

$ curl https://bootstrap.pypa.io/get-pip.py | python

 

That’s it for the setup, let’s call the API.

 

 

Call the Mindee Invoice API

 

 

Login onto the [Mindee dashboardt](https://platform.mindee.net/) and enter the Invoice API environment by clicking the following card.

 

 

mindee invoice api card

 

 

If you don’t have an API token yet for this API, go to the “credentials” section and create a new API Key. 

 

 

new api token

 

 

 

Open the “main.py” file you created a few minutes ago, and paste your sample code.  The path to your invoice will be placed in the variable "filepath" since we call it several times. Your main.py file should look like:

 

 

from mindee import Client

filepath = "/path/to/your/file"
invoice_token = "my-token-here"

mindee_client = Client(invoice_token=invoice_token)


if __name__ == "__main__":
     
     response = mindee_client.parse_invoice(filepath)

     print(response.invoice)

 

 

Replace the “path/to/my/file” placeholder in the code with the path of the invoice pdf or image you want to parse. Replace the “my-token-here” with the API token you created previously in the platform.

 

 

You can get back to your console and run the script

 

 

python main.py

 

 

You should see the details of the invoice appearing in your console:

 

-----Invoice data-----
Filename: 1p_invoice_b.pdf 
Invoice number: A-1414 
Total amount including taxes: 2608.2 
Total amount excluding taxes: 2415.0 
Invoice date: 2018-09-25
Supplier name: DESIGNS TURNPIKE CO
Taxes: 193.2 8.0%
Total taxes: 193.2
----------------------

 

Now let’s take a look at the Invoice object.

 

 

Extracted invoice data

 

The API returns a list of different fields in the invoice (total amount, taxes, date …). You can find this list in the “documentation” section in Mindee platform.

 

The Invoice object under the Response.invoice attribute has all the feature you need to get the invoice data and coordinates in the image to highlight the results.

 

Here is a piece of code to show you how to get some of them:

 

from mindee import Client

filepath = "/path/to/your/file"
invoice_token = "my-token-here"

mindee_client = Client(invoice_token=invoice_token)


if __name__ == "__main__":
     
     response = mindee_client.parse_invoice(filepath)

     # To get the total amount including taxes value (float), ex: 14.24
     print(response.invoice.total_incl.value)
     
     # To get the total amount excluding taxes value (float), ex: 10.21
     print(response.invoice.total_excl.value)
     
     # To get the total tax amount value (float), ex: 8.42
     print(response.invoice.total_tax.value)
     
     # To get the list of taxes
     print(response.invoice.taxes)

     # Loop on each Tax field
     for tax in taxes:
        # To get the tax amount
        print(tax.value)
     
        # To get the tax code for from a tax object
        print(tax.code)
       
        # To get the tax rate
        print(tax.rate)
     
     # To get the invoice date of issuance (string)
     print(response.invoice.invoice_date.value)
      
     # To get the invoice due date (string)
     print(response.invoice.invoice_date.value)

 

Run the script and check out the results in your console!

  

Now we are going to add a piece of code that will highlight the features extracted on the pdf. It can help you or your users to very quickly validate the data extraction.

 

 

Convert pdf into an image

 

If you’re using a pdf, we need first to convert the pdf into an image, as it’s more convenient to use for highlighting the features on it. If your invoice is an image, you can skip this part.

 

To convert the pdf into a jpeg we are going to use the pdf2image python library; First install it in your environment:

 

pip install pdf2image

 

Important note: Dealing with pdfs is not easy and this library might not work directly because it sometimes requires some extra dependencies. Check out the github to make sure you installed it correctly.

 

Once installed, we are going to create an image out of our pdf:

 

 

from pdf2image import convert_from_path, convert_from_bytes
images = convert_from_path(filepath)

 

 

The images object is an array of Pillow images. Pillow is an image library for python. As we want to use openCV for highlighting the features on the image, we need to convert the pillow image into an OpenCV image. OpenCV is based on the numpy library, and to convert it we are going to use numpy directly.

 

If you don’t have numpy installed on your environment:

 

 

pip install numpy

 

 

Now let’s convert the Pillow image into an opencv format using numpy:

 

 

import numpy as np
from pdf2image import convert_from_path
 
def pdf_to_cv_image(filepath, page=0):
   image = convert_from_path(filepath)[page]
   return np.asarray(image)

 



 

Highlight features on the image

 

Let’s try to highlight the features as if someone did it with a pen.

 

First, you’ll need to install the computer vision python library OpenCV if you don’t have it already installed in your env. To do so, run:

 

 

pip install opencv-python

 

Each field extracted in the Response.invoice object has a Field.bbox object. It's as simple list of bounding boxes relative vertices (% of the image width and height).

 

Check out what's inside for the total_incl.bbox for example, doing:

 

print(invoice_data.invoice.total_incl.bbox)

 

Now that we know how to access the bounding boxes coordinates, we are going to create our highlighter function in which we can feed any field bounding boxes in different steps, as follows:

 

  1. Check if the file is a pdf or an image and load it as an openCV image in both scenarios
  2. Create a mask image
  3. Loop on each feature coordinates and draw the feature rectangle on our mask
  4. Overlay the mask and original image with alpha
  5. Display image to the user

 

Note: each coordinate returned by the API is relative (in % of the image). You'll see there is a relative to absolute conversion in the code.

 

Here is the code step by step:

 

def highlight_features(img_path, coordinates):
   # step 1: Check if the file is a pdf or an image and load it as an openCV image in both scenarios
   if img_path.endswith(".pdf"):
       cv_image = pdf_to_cv_image(img_path)
   else:
       cv_image = cv2.imread(img_path)

   # step 2: create mask image
   overlay = cv_image.copy()
   h, w = cv_image.shape[:2]

   # step 3: Loop on each feature coordinates and draw the feature rectangle on our mask
   for coord in coordinates:
       if len(coord):
           # Create points absolute coordinates
           pt1 = (int(w*coord[0][0]), int(h*coord[0][1]))
           pt2 = (int(w*coord[2][0]), int(h*coord[2][1]))
           cv2.rectangle(overlay, pt1, pt2, (70, 230, 244), cv2.FILLED)

   # step 4: Overlay the mask and original image with alpha
   final_image = cv2.addWeighted(overlay, 0.5, cv_image, 0.5, 0)

   # step 5: Display image to the user
   cv2.imshow("highlghted_image", cv2.resize(final_image, (600, int(600*h/w))))
   cv2.waitKey(0)


 

Finally, we just have to modify a bit the code for executing the highlighting function and printing the results. Here is what the final code:

 

 

from mindee import Client
import cv2
import numpy as np
from pdf2image import convert_from_path


def pdf_to_cv_image(filepath, page=0):
   image = convert_from_path(filepath)[page]
   return np.asarray(image)


def highlight_features(img_path, coordinates):
   # step 1: Check if the file is a pdf or an image and load it as an openCV image in both       
   scenarios
   if img_path.endswith(".pdf"):
       cv_image = pdf_to_cv_image(img_path)
   else:
       cv_image = cv2.imread(img_path)

   # step 2: create mask image
   overlay = cv_image.copy()
   h, w = cv_image.shape[:2]

   # step 3: Loop on each feature coordinates and draw the feature rectangle on our mask
   for coord in coordinates:
       if len(coord):
           pt1 = (int(w*coord[0][0]), int(h*coord[0][1]))
           pt2 = (int(w*coord[2][0]), int(h*coord[2][1]))
           cv2.rectangle(overlay, pt1, pt2, (70, 230, 244), cv2.FILLED)

   # step 4: Overlay the mask and original image with alpha
   final_image = cv2.addWeighted(overlay, 0.5, cv_image, 0.5, 0)

   # step 5: Display image to the user
   cv2.imshow("highlghted_image", cv2.resize(final_image, (600, int(600*h/w))))
   cv2.waitKey(0)


filepath = "/path/to/your/file"
invoice_token = "my-token-here"

mindee_client = Client(invoice_token=invoice_token)


if __name__ == "__main__":
     
     response = mindee_client.parse_invoice(filepath)

     print(response.invoice)

     highlight_features(
        filepath,
        [
            response.invoice.total_incl.bbox,
            response.invoice.total_excl.bbox,
            response.invoice.invoice_number.bbox,
            response.invoice.supplier.bbox,
            response.invoice.invoice_date.bbox,
            response.invoice.due_date.bbox
        ]
    )


 

And the final result!

 


 

Conclusion

 

In just over a few seconds, an invoice was parsed and the features were highlighted. 
 

If you wanted to use this kind of script to display results to your users, i'd advise you to do the highlighting in the front-end application, as sending images back from your middleware is not the best option because of payload sizes. The other solution would be to store the final image using cv2.imwrite(...) but it would make your client download the result. 

 

If you have questions, please reach out to us in the chat widget in the bottom right.