Performances benchmark
In a few steps, you can use the mindee python library to perform benchmarks of passport extraction perfomances against your test dataset.
- Perform api calls for each file in your test set
- Create
mindee.Passport
objects from your data (csv, json or whatever) - Perform the benchmark and analyze the results
1. Perform api calls for each file in your test set
Before running the benchmark, we need to collect all the passport predictions from Mindee API and store them so we can perform the benchmark later without calling the API again.
To do so, assuming that all your files are stored into a "./test_dataset" folder and that we want to save all the responses into a "./mindee_responses" folder, here is the code:
from tqdm import tqdm
from mindee import Client
import os
TEST_DATA_PATH = "./passport_data"
RESPONSES_PATH = "./passport_responses"
mindee_client = Client(passport_token="your_passport_token_here")
# Loop over all your test files
for test_filename in tqdm(os.listdir(TEST_DATA_PATH)):
# Get the current file path
test_file_path = os.path.join(TEST_DATA_PATH, test_filename)
# To make sure we don't stop the process if an error occurs
try:
# Parse the current file
mindee_response = mindee_client.parse_passport(test_file_path)
# Store the response inside json file to be restored later
# In this example we use the test file name in the json filename to
# be able to retrieve the corresponding file
response_filepath = os.path.join(RESPONSES_PATH, test_filename + ".json")
mindee_response.dump(response_filepath)
except Exception as e:
# If an error occurs, print the filename so you can understand
# what happened
print(test_filename, e)
2. Create mindee.Passport
objects from your data (csv, json, whatever...)
The mindee.Passport
class contains a compare()
method that takes as inputs two mindee.Passport
objects. Before running our final script, we need now to create a mindee.Passport object
containing the true labels for each fields.
We'll use a csv file in this example and the pandas library.
ground_truth.csv header
,filename,birth_date,birth_place,country,expiry_date,gender,given_names,id_number,issuance_date,mrz1,mrz2,surname
To construct a Passport object from this dummy csv example, you can simply do:
import pandas as pd
from mindee import Passport
ground_truth_df = pd.read_csv("./labels.csv")
def passport_from_csv_row(df_row):
# For variable lengths features, split the string to get a list
given_names = df_row["given_names"].split("|")
return Passport(
birth_date=df_row["birth_date"],
birth_place=df_row["birth_place"],
country=df_row["country"],
expiry_date=df_row["expiry_date"],
gender=df_row["gender"],
given_names=given_names,
id_number=df_row["id_number"],
issuance_date=df_row["issuance_date"],
mrz1=df_row["mrz1"],
mrz2=df_row["mrz2"],
surname=df_row["surname"]
)
for index, df_row in ground_truth_df.iterrows():
passport_truth = passport_from_csv_row(df_row)
print(passport_truth)
Running this code should print in your console something like this:
-----Passport data-----
Filename: passport.png
Full name: HENERT PUDARSAN
Given names: HENERT
Surname: PUDARSAN
Country: GBR
ID Number: 707797979
Issuance date: 2012-04-22
Birth date: 1995-05-20
Expiry date: 2017-04-22
MRZ 1: P<GBRPUDARSAN<<HENERT<<<<<<<<<<<<<<<<<<<<<<<
MRZ 2: 7077979792GBR9505209M1704224<<<<<<<<<<<<<<00
MRZ: P<GBRPUDARSAN<<HENERT<<<<<<<<<<<<<<<<<<<<<<<7077979792GBR9505209M1704224<<<<<<<<<<<<<<00
----------------------
3. Perform the benchmark and see the results
Last step, now we need to wrap it all up.
The Benchmark class has two methods Benchmark.add
and Benchmark.save
for adding comparison between two Passport objects, and saving the final metrics.
import pandas as pd
from mindee import Response, Passport, Benchmark
import os
RESPONSES_PATH = "./passport_responses"
BENCHMARK_PATH = "./benchmark_passport"
benchmark = Benchmark(BENCHMARK_PATH)
ground_truth_df = pd.read_csv("./labels.csv")
def passport_from_csv_row(df_row):
given_names = df_row["given_names"].split("|")
return Passport(
birth_date=df_row["birth_date"],
birth_place=df_row["birth_place"],
country=df_row["country"],
expiry_date=df_row["expiry_date"],
gender=df_row["gender"],
given_names=given_names,
id_number=df_row["id_number"],
issuance_date=df_row["issuance_date"],
mrz1=df_row["mrz1"],
mrz2=df_row["mrz2"],
surname=df_row["surname"]
)
# Loop over each file in our csv
for index, df_row in ground_truth_df.iterrows():
try:
# Create ground truth passport object
ground_truth_passport = passport_from_csv_row(df_row)
# Load the mindee Response for the current file
mindee_response = Response.load(os.path.join(RESPONSES_PATH, df_row["filename"] + ".json"))
# Add the comparison between the two passports to the benchmark
benchmark.add(
Passport.compare(mindee_response.passport, ground_truth=ground_truth_passport),
df_row["filename"]
)
except Exception as e:
print(df_row["filename"], e)
benchmark.save()
Inside our benchmark folder, you should see a new directory was created, and a metrics.png file shoud have been created inside with the different metrics:
The Passport benchmark runs on the 12 fields as shown above. For each of them, you get an information of:
accuracy: the proportion of correct predictions
precision: The proportion of correct predictions among all the non null predictions