Blog / AI & Machine Learning

Learnings from building AI products as a PM

Emmanuelle Thomas

November 17, 2022

min read

Start for Free

When your product does not fit

As a Product Manager, the mission is always the same: building products that bring value to your users. However, the path to build one can vary a lot from one product to another. What happens if your product does not fit with the Agile manifesto? What if weekly or bi-weekly sprints does not apply to your product delivery time scale? What if building an MVP requires months of efforts? What would you do? Where would you start?

At Mindee, we are developing Deep learning based products. We apply computer vision to extract key information out of documents. That requires tons of data and complex algorithms, which means that it can take months to develop, yet the outcome is uncertain. You can spend weeks working on an algorithm that will fail to its mission. Even though, as a PM you won’t be able to remove this uncertainty to ship your product, you want to reduce it as much as you can.

B2B SaaS vs AI products — B2B Saas vs AI products

As a product team, we wanted to improve our velocity by shipping smaller products/features more frequently and limit the “tunnel effect” that we were seeing from our latest projects. We also wanted to reduce risks: what if we were spending months on a project that never ships?

On average, we identified that it took us between 4 to 6 months to build a new complex feature or product. We wanted to find ways to go below this 4 months threshold. That was our target to reach.

In this article we will explain the journey we have been through on the latest Invoice OCR API iteration, that consists of extracting line items from Invoices – a complex problem to tackle from a deep learning perspective. First, we will give more context on this feature and then we will deep dive on what we have implemented to reduce our delivery time and how we managed to ship a first version within 3 months.

A bit of context

Why it takes time to build Data science based products

From understanding business needs to deploying new models in production it can take months. But why does it take that much time?

deep learning products development cycle — AI Product development cycle

Use case analysis: You need to deeply understand the business use cases to help make decision later on (on performances for instance).
The data exploration is the phase where you deep dive into the data to understand distribution, patterns and edge cases.
Then, you need to start an Annotation project to get labeled data to train your model. This requires having collected data beforehand.
Once the data is annotated, it is then split into 3 different sets: the training set (data used for training), the validation set (data used for testing and fine tuning your models) and the test set (data used one and only time for the final evaluation of the model). The Modelling phase can start using the training set. It includes choosing, training and fine tuning models.
Once a model is ready, we need to test it to evaluate its performances and analyse where it makes mistakes. That is the Evaluation step.
When the model is ready, it can be released. That is the Deployment phase.

Training models can take days and it is an iterative process as you might need to improve your model based on its evaluation. For each new version of the model, we will analyse its performances and will keep on iterating and fine tuning it until its performances match business expectations.

AI Product development timeline — Timeline for deep learning product development

All those steps combined can take months all-together, because most steps can not easily be done in parallel. For instance you have to wait to have enough data annotated to start training your models. When you start a project, it is impossible to give a fair estimate on how long it is going to take:

because you need extreme accurate annotations and this will require a couple of reviews and iterations to make sure it reaches the quality required for the project.
because you don’t know how many iterations on your model will be necessary for it to reach the expected performances.
because, as we mentioned above, with deep learning, sometimes, the approach you choose will fail and you will start from scratch again with a new approach.

Our pilot project: line items extraction in Invoices

At Mindee, we are building products for Product Managers and Developers who want to automatically extract information out of documents. We develop in-house Deep Learning algorithms tailored to each type of document we want to process. Our AI technology is then exposed through APIs to make it accessible to any developer or PM that want to implement it within its software. We support Invoice, Receipt, Passport and custom documents. In this article we will focus on our Invoice OCR API that extracts key information from Invoices.

The existing version of the Invoice OCR API supports generic fields such as total amount, invoice number, due date, supplier name etc… The exhaustive list is available in our documentation.

Mindee Invoice OCR API output — Existing output for our Invoice OCR API

From the example above, we can see that the table and its lines are not extracted yet. Line items extraction refers to the extraction of the lines, as high-lighted in orange in the illustration below.

What is a Line items for Invoice OCR — Definition of line items on Invoice OCR

We had identified that it would be a challenging problem to crack to extract lines items using deep learning. The biggest caveat we wanted to prevent was to spend too much time in delivery and to spread our energy on multiple sub-problems. Working on this new feature was the opportunity to challenge our delivery and improve as we work as a team.

How we came up with delivery improvement ideas

It all starts with discovery 🙂 The first thing to understand was where there was friction in our organisation and where, as Product Managers, we could help the Data science team in the delivery process.

I have worked hand in hand with Rémy, our head of Data, to identify:

how we could break down the line items extraction problem in smaller problems to solve
how we could set intermediary goals that would prevent the “tunnel effect”

Moreover, during this phase, we also identified that Data scientists were facing a lot of questions during the model evaluation phase:

what are customers expectations with this product?
what are the metrics expectations, when do we know that the model is ready for production?
shall we support this case or not? should we keep improving or not?

Helping them with information ahead of those questions would prevent back and forth and make us gain a lot of time. That lead to the following proposal:

Having an overview of how customers would use it and how performance would impact their use case
Building tailor-made metrics that will help evaluate performances for this specific problem we are trying to solve
Being able to anticipate algorithm issues (where the model might struggle)
Having a clear understanding right from the start of what were the performances expectations
Knowing the priorities and on which problem they should focus on vs. the edge cases they should not care about

That lead to building this 4 steps action plan:

🔍 Use cases analysis: Deep dive into customer and industry use cases to extract key features and understand performance expectations.
🔬 Data exploration: Build a deep understanding of what are line items in documents. Identify common cases from edge cases.
⚡️ MVP scoping: From data exploration, build a list of priorities, scope a potential MVP, and define the Minimum Viable Metrics required to launch a 1st version.
🏗 Delivery: Define a delivery rhythm where Data Science and Product teams sit together to look at both quantitative evaluation of model performances (Data science metrics called recall and precision) and qualitative performances from model errors analysis.

Don't forget to check out our explanation on partial string matching!

How we implemented the action plan

The first part of our process was to build thorough requirements: from use cases understanding to data exploration to be able to set the right expectations. That is different from a traditional product requirement document (PRD) as the data exploration part is specific to Deep Learning based products.

1. 🔍 Use cases analysis

Invoices are processed in a lot of use cases such as Accounts Payable, Accounting, Procurement and many more. As an example, I will deep dive on Procurement use case to explain what we learned and how it helped build our requirements. For procurement, the goal is to match what has been ordered to what is delivered and then billed. This process is call the Three-way matching, as displayed in the illustration below.

Let’s say that in you need to buy 5 computers. Once you accept a quote from a dealer, a Purchase Order (PO) is issued containing all the information about items ordered. Unfortunately, your dealer does not have enough inventory so only 3 computers are delivered. The delivery note contains this information. The last step of the process is the invoicing: you want to make sure that the Invoice contains the right quantity of goods delivered with the right price.

3-way matching in procurement — Procurement three-way matching: Purchase order, Delivery note and Invoice matching using OCR

This gives a clear understanding on why line items extractions are business critical for procurement: you need the line details for each item to do the matching. More over, this use case gives also an idea of the information we need to extract for each line:

a code or description that identifies the item
quantity
unit price
total amount

We have done it on the other main use cases, and combined with all the customers feedbacks we had gathered in Productboard, as it was really exhaustive. That helped us build the list of the 10 fields that we aimed at extracting: description, product code, quantity, unit price, total amount, unit of measure, tax rate, tax amount, discount rate, discount amount .

🎉 Outcome:

With this phase, we documented exhaustively customer feedbacks and use cases analysis to define the fields we wanted to support. That gave to the team a clear understanding of business expectations.
That helped build our PRD template for upcoming Data science iteration on our products.

2. 🔬 Data exploration

As this is basically something really specific to Deep Learning based products, we will spend a bit more time on this part.

Data exploration is the process of looking at a sufficient amount of data so that:

it is representative as much as possible of what the model will encounter in production
you will find out what are the main use cases and what are potential edge cases

Why is it an essential part of our requirement definition?

Because it helps map real life documents to end-user expectations.
It helps focus on the right problem to solve based on occurrences statistics.
Because this help to identify main cases and edge cases. You can quickly see where the algorithm might struggle finding relevant information or where it will fail.

As this was my first time doing this exhaustive and meticulous data exploration, I have been working hand in hand with the Data science team to understand what we needed to look at in the exploration dataset. Together, we created a small checklist of things to look for, on which we iterated.

✅ Checklist:

Patterns identification

[ ] main use cases – show examples
[ ] edge cases – occurrences and examples

Generic information on document configuration

[ ] Image deformation distribution
[ ] Number of lines

Writing output

[ ] for each example write user expected output

Before reviewing some data, we defined the size of the sample dataset to minimize the error margin – in Data science everything is about statistics, being thorough is key. We build a 400+ document exploration dataset ensuring roughly a 5% error margin. And then it’s time to open your eyes 👀 , as you have to look at this data set a couple of times. This is the most tedious part of the process however it is the most important as we will see.

Patterns identification

The first time you look at those exploration documents, the idea is to understand what are the patterns you can identify. There are the common cases, that you encounter multiple times and that looks simple to extract, such as the one below: a single line table with a clear header and a standard lay out.

Line item extraction example — Table extraction: configuration easy to extract

And then there are all the edge cases you can see. I have described a couple of them below.

The approach is to identify for each edge case its occurence, how frequent it repeats itself in the data set. For those examples, you know you will have to be careful on how the model performs on those configurations.

table extraction for invoice OCR — Table extraction in Invoice documents: edge cases examples

Generic information on document configuration

All of this was documented in a notion database containing a list of various cases identified within the dataset. Here is an extract below, with the example of the number of lines. It was one information we were keen to understand: in average how many lines are there in the Invoices’ tables.

For instance here, having a clear distribution gives an idea of what needs to be supported: if we support tables with 10 lines we cover 98,2% of use cases. For a first iteration, no need to spend energy making sure we can extract tables with 50 lines. It is also critical to understand this early in the process as this is a parameter that can impact computation time.

data exploration for AI product development — Data exploration: understanding patterns within documents

Let’s take a look at another example with image deformation. Sometimes documents are scanned resulting on tilted, blurred or folded documents that can be then tricky to read. From experience, we knew that for this kind of document it would be difficult to extract tables.

image deformation examples — Data exploration: type of image deformation

What type of deformation do we need to support?

Example of data exploration on image deformation — Data exploration: image deformation distribution

Exploring data, gives us an idea of the type of deformation and their distribution within our sample data set. That also gives direction for a MVP. Thanks to his analysis, we decided that no efforts should be made to solve issues on severely tilted documents on the 1st iteration.

Writing output

A last learning on this, is that to go the extra mile for all examples we put in, i.e for each table, I have written what would be the expected output from an end user perspective. It is super effective to identify where there are new edge cases or to make new problems emerge. Putting yourself in the customer shoes is always a good idea 😉. It is also gives a sense of what is acceptable from a user perspective to what is not, and it helps share this information with the Data science team for decision making purposes.

In our requirements, it looks like that:

Writing PRD for AI product development — How to write output expectations to evaluate edge cases

🎉 Outcome:

Let’s be clear, data exploration is time consuming and requires a lot of rigour, however this has been key in the project success and delivery improvement.

It helped us gain a lot of time to clearly state which problem to focus on and which one to ignore. We had a clear vision on main cases on one hand. On the other hand, for each edge cases that we identified, we had a statistic on its occurrence. It helped us make data driven decision based on those statistics. We skipped some image deformation, we focused on solving the extraction of 10-line tables thanks to those statistics. We definitely gain a lot of time, effort and focus.
We built new tooling such as a checklist of configurations to look at on documents that we can re-use in the future
We organised a deep dive session with the team to share the learnings from this exploration. It helped them onboard with the requirements synthesis which was dense and gave them an overview on where we were heading.

You can also take a look at our article on confidence scores in machine learning!

3. ⚡️Building an MVP

In data science, we are obsessed with a way of evaluating the performances of our models. This is somehow our north star, but we are also very careful, because if you look only at those metrics you can miss important limits of your model. Working on deep learning problems is time infinite: you can always improve (under conditions of course), it is not a binary problem: work / does not work. It is then super important to define acceptance criteria for your model so that you know when to stop.

A bit of context: how do we evaluate model performances? Are you familiar with recall and precision? If not I suggest you read this excellent post by Jonathan our CEO. You can’t look at one without looking at the other, however for each feature we tend to focus on one that will be significant to our use case. And the metric is tightly linked to the user experience.

Pick a data science metric to focus on for model iteration

Let’s take an example: let’s say that I am the end user of a procurement software. I have downloaded the PO, the delivery note and now I am downloading the corresponding Invoice that contains the table below. A Deep Learning based OCR runs in this procurement software to extract key information from this Invoice, so that I do not have to fill all the information manually. But you know, sometimes algorithms make mistakes.

As a user, what type of errors do I prefer? what are the impacts on my end?

defining key metrics for AI product development — Example of a table vs metrics expectations

Recall vs Precision what does it mean for line items?

Recall – do I prefer to correct wrong information because the model made a mistake? It means that the model rather predict a line item even if it makes a mistake, because it is OK for me to make modifications. If you look at the table above, that means it would be OK if the model has extracted the line “Billing period: 01/01/21- 31/01/21” even though it is not a line item according to our definition (a line item is defined as an item description combined with pricing – there is no pricing associated to this description). In that case, Recall would be the metric to look at.
Precision – instead do I prefer to manually input information, because I do no want to make those modifications. This means that the model rather not predict when it is not “sure enough” of the result. So for instance here, focusing on Precision could mean that the only line item “Monthly subscription […]” might be missed under certain conditions.

As a summary, we use metrics to help define the type of errors we’d rather accept vs the ones we want to prevent.

For line items extractions, we focused on maximizing our recall, as it seems easier to modify errors rather than manually type all informations from a line.

Define your MVP

As our product did not really fit into how we imagine a traditional MVP, we decided to define what MVP looked like to us. Our definition of MVP is a combination of 2 acceptance criteria:

on one hand, it is quantitative and based on data science metrics. We defined our Minimum Viable Metrics: what are the minimum recall / precision we need to reach to launch? What is acceptable from a user perspective?
on the other hand, it is qualitative: are the main cases supported? what edge cases are already supported? what are the type of errors the model is making? are they acceptable?

By defining what the model must support for launch, both in terms of document configuration and in terms of performance metrics, we managed to break down the project into smaller iteration.

To do so, we identified:

on one hand, priorities to solve the various configurations identified within the dataset

Building a PRD for AI product development — Product requirement: priority definition for various configurations

on the other hand, we also split our 10 fields to extract into 3 different priorities based on our understanding from use cases. The idea was to be able to ship only with P1 fields if P2 and P3 were performing poorly.

Prioritizing for AI product development — Product Requirement: Priority definition for line items sub features

Analyse manually model errors and iterate

But working with metrics is not enough. You want to understand where the model fails to extract information: are there specific configuration performing badly? The only way is to look at documents where the model performs poorly, on your validation set. ⚠️ The idea is not to fix those on a case by case basis as this would introduce bias, however this gives trends on configuration that might be difficult for our models. Going back to your data exploration can give an idea on how frequent this configuration happens and if you should put time and efforts solving them.

🎉 Outcome:

From what we learned during data exploration, scoping problems into priorities helped us build our own kind of MVP. It gave us a clear framework to use during delivery.

Definition of the key metric to focus on to accelerate evaluation
MVP Scoping: mapping each field to extract with P1, P2 or P3 priority
MVP Scoping: for each configurations (edge cases) set urge to be solved (High to low)

4. 🏗 Delivery – improving velocity

To make sure we ship as fast as we could we set regular touch points. The idea of those touch points was to analyse:

the model performances (key metrics)
the model errors

We kept on referring back to our MVP definition – and challenged what we initially wanted to support:

we realised early that P3 fields (discount rate and discount amount) would not be supported in our first iteration, because they were performing poorly. By having set those priorities, we did not spend any time trying to improve their performances, we stayed focus on P1 and P2 fields.
we discovered new edge cases for which we did not have statistics. Then we did another round of data exploration to make sure we had the right statistics to make informed decisions.

That gave us more velocity and a clear target to reach between each iteration.

🎉 Outcome:

At this stage, the gain of time is really based on previous steps. Because we had a clear understanding of what to solve first, what to focus on, we were able to make decisions quickly.

What we learned

🎯 We reached our goal: it took us approx 3 months to ship the first version of the Line items extraction feature on our Invoice OCR API. From a deep learning perspective, extracting tables from Invoices was a complex problem to crack, it is a real team success! If you are curious to understand how we did to crack it, you can read this article that describes our approach.

🔥 What worked well:

spending time on Data exploration gave us a deep understanding of what could go wrong, where we would struggle, where we should put our efforts and instead where we should let go. It helped us make faster decisions.
pushing an MVP approach by setting clear priorities for each field and for each configuration to support helped us define when to continue iterating on our models and when it was Ok to stop. We finally outputted only 7 fields out of the 10 identified in the 1st version, really pushing for having a 1st version of the feature out as soon as possible.

📑 What we discovered:

Data exploration is a never ending a story: we kept coming back at it for each new problem we encountered. That is a powerful way to get statistics in front of problems. It enables data-driven decision. We aim at systemically using data exploration output to drive decision pragmatically.

Conclusion

Being a PM working on deep tech products coming from advanced deep learning research is very different from a SaaS product role. Deep learning based products imply longer development cycles and tailor-made development processes. Moreover, the model you have spent weeks to develop might fail and you might need to start again. As a PM, this requires building a different approach to existing product management methodologies.

As a summary, here are a couple of take aways from what we experimented to accelerate our delivery of our latest Invoice OCR API:

Understanding the data is key. It is time-consuming, but the ROI is maximum. It saved us a lot of time and energy.
If you want to communicate efficiently with the Data science team, you need to deeply understand the metrics and make sure that there are clear targets to reach.
Having set clear acceptance criteria gave us a clear vision on where we should be heading. We combined quantitative criteria (data science metrics targets) with qualitative criteria (document configuration and edge cases to support) to help set our fastest path towards launch.
Setting clear priorities on what to support helped us navigate deep learning uncertainties and reduce release scope when needed. We did not hesitate to narrow down our feature set for launch.
When standard methodologies does not apply to your own, then build your own: understand with your engineering team where there are gaps and work backwards. This is how we built our data exploration process and requirements.

As we are now close to launch this 1st iteration, we are looking forward to learn more about how we can improve what we have implemented so far and how we can deliver iteratively the next iteration of our models. We are now looking for more ideas to get even better, so we are looking for tips. And you, how do you work on your AI products? Any tips to share with us in comments?

Cover photo by Augustine Wong on Unsplash

‍

Frequently Asked Questions

Common questions about document processing and AI technologies that power modern document automation.

No items found.

Ready to transform your document processing?

Start automating your document workflows today with Mindee's intelligent document processing platform.

Start for Free

a phone with chat gpt, gemini, perplexity and claude

AI & Machine Learning

LLM Chunking: Strategies, Benefits, and Implementation

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Read article