Learnings from building AI products as a PM24 min read
As a Product Manager, the mission is always the same: building products that bring value to your users. However, the path to build one can vary a lot from one product to another. What happens if your product does not fit with the Agile manifesto? What if weekly or bi-weekly sprints does not apply to your product delivery time scale? What if building an MVP requires months of efforts? What would you do? Where would you start?
At Mindee, we are developing Deep learning based products. We apply computer vision to extract key information out of documents. That requires tons of data and complex algorithms, which means that it can take months to develop, yet the outcome is uncertain. You can spend weeks working on an algorithm that will fail to its mission. Even though, as a PM you won’t be able to remove this uncertainty to ship your product, you want to reduce it as much as you can.
As a product team, we wanted to improve our velocity by shipping smaller products/features more frequently and limit the “tunnel effect” that we were seeing from our latest projects. We also wanted to reduce risks: what if we were spending months on a project that never ships?
On average, we identified that it took us between 4 to 6 months to build a new complex feature or product. We wanted to find ways to go below this 4 months threshold. That was our target to reach.
In this article we will explain the journey we have been through on the latest Invoice OCR API iteration, that consists of extracting line items from Invoices – a complex problem to tackle from a deep learning perspective. First, we will give more context on this feature and then we will deep dive on what we have implemented to reduce our delivery time and how we managed to ship a first version within 3 months.
From understanding business needs to deploying new models in production it can take months. But why does it take that much time?
Use case analysis: You need to deeply understand the business use cases to help make decision later on (on performances for instance).
data explorationis the phase where you deep dive into the data to understand distribution, patterns and edge cases.
- Then, you need to start an
Annotationproject to get labeled data to train your model. This requires having collected data beforehand.
- Once the data is annotated, it is then split into 3 different sets: the training set (data used for training), the validation set (data used for testing and fine tuning your models) and the test set (data used one and only time for the final evaluation of the model). The
Modellingphase can start using the training set. It includes choosing, training and fine tuning models.
- Once a model is ready, we need to test it to evaluate its performances and analyse where it makes mistakes. That is the
- When the model is ready, it can be released. That is the
Training models can take days and it is an iterative process as you might need to improve your model based on its evaluation. For each new version of the model, we will analyse its performances and will keep on iterating and fine tuning it until its performances match business expectations.
All those steps combined can take months all-together, because most steps can not easily be done in parallel. For instance you have to wait to have enough data annotated to start training your models. When you start a project, it is impossible to give a fair estimate on how long it is going to take:
- because you need extreme accurate annotations and this will require a couple of reviews and iterations to make sure it reaches the quality required for the project.
- because you don’t know how many iterations on your model will be necessary for it to reach the expected performances.
- because, as we mentioned above, with deep learning, sometimes, the approach you choose will fail and you will start from scratch again with a new approach.
At Mindee, we are building products for Product Managers and Developers who want to automatically extract information out of documents. We develop in-house Deep Learning algorithms tailored to each type of document we want to process. Our AI technology is then exposed through APIs to make it accessible to any developer or PM that want to implement it within its software. We support Invoice, Receipt, Passport and custom documents. In this article we will focus on our Invoice OCR API that extracts key information from Invoices.
The existing version of the Invoice OCR API supports generic fields such as
supplier name etc… The exhaustive list is available in our documentation.
From the example above, we can see that the table and its lines are not extracted yet. Line items extraction refers to the extraction of the lines, as high-lighted in orange in the illustration below.
We had identified that it would be a challenging problem to crack to extract lines items using deep learning. The biggest caveat we wanted to prevent was to spend too much time in delivery and to spread our energy on multiple sub-problems. Working on this new feature was the opportunity to challenge our delivery and improve as we work as a team.
It all starts with discovery 🙂 The first thing to understand was where there was friction in our organisation and where, as Product Managers, we could help the Data science team in the delivery process.
I have worked hand in hand with Rémy, our head of Data, to identify:
- how we could break down the line items extraction problem in smaller problems to solve
- how we could set intermediary goals that would prevent the “tunnel effect”
Moreover, during this phase, we also identified that Data scientists were facing a lot of questions during the model evaluation phase:
- what are customers expectations with this product?
- what are the metrics expectations, when do we know that the model is ready for production?
- shall we support this case or not? should we keep improving or not?
Helping them with information ahead of those questions would prevent back and forth and make us gain a lot of time. That lead to the following proposal:
- Having an overview of how customers would use it and how performance would impact their use case
- Building tailor-made metrics that will help evaluate performances for this specific problem we are trying to solve
- Being able to anticipate algorithm issues (where the model might struggle)
- Having a clear understanding right from the start of what were the performances expectations
- Knowing the priorities and on which problem they should focus on vs. the edge cases they should not care about
That lead to building this 4 steps action plan:
- 🔍 Use cases analysis: Deep dive into customer and industry use cases to extract key features and understand performance expectations.
- 🔬 Data exploration: Build a deep understanding of what are line items in documents. Identify common cases from edge cases.
- ⚡️ MVP scoping: From data exploration, build a list of priorities, scope a potential MVP, and define the Minimum Viable Metrics required to launch a 1st version.
- 🏗 Delivery: Define a delivery rhythm where Data Science and Product teams sit together to look at both quantitative evaluation of model performances (Data science metrics called recall and precision) and qualitative performances from model errors analysis.
The first part of our process was to build thorough requirements: from use cases understanding to data exploration to be able to set the right expectations. That is different from a traditional product requirement document (PRD) as the data exploration part is specific to Deep Learning based products.
Invoices are processed in a lot of use cases such as Accounts Payable, Accounting, Procurement and many more. As an example, I will deep dive on Procurement use case to explain what we learned and how it helped build our requirements. For procurement, the goal is to match what has been ordered to what is delivered and then billed. This process is call the Three-way matching, as displayed in the illustration below.
Let’s say that in you need to buy 5 computers. Once you accept a quote from a dealer, a Purchase Order (PO) is issued containing all the information about items ordered. Unfortunately, your dealer does not have enough inventory so only 3 computers are delivered. The delivery note contains this information. The last step of the process is the invoicing: you want to make sure that the Invoice contains the right quantity of goods delivered with the right price.
This gives a clear understanding on why line items extractions are business critical for procurement: you need the line details for each item to do the matching. More over, this use case gives also an idea of the information we need to extract for each line:
- a code or description that identifies the item
- unit price
- total amount
We have done it on the other main use cases, and combined with all the customers feedbacks we had gathered in Productboard, as it was really exhaustive. That helped us build the list of the 10 fields that we aimed at extracting:
unit of measure,
discount amount .
- With this phase, we documented exhaustively customer feedbacks and use cases analysis to define the fields we wanted to support. That gave to the team a clear understanding of business expectations.
- That helped build our PRD template for upcoming Data science iteration on our products.
As this is basically something really specific to Deep Learning based products, we will spend a bit more time on this part.
Data exploration is the process of looking at a sufficient amount of data so that:
- it is representative as much as possible of what the model will encounter in production
- you will find out what are the main use cases and what are potential edge cases
Why is it an essential part of our requirement definition?
- Because it helps map real life documents to end-user expectations.
- It helps focus on the right problem to solve based on occurrences statistics.
- Because this help to identify main cases and edge cases. You can quickly see where the algorithm might struggle finding relevant information or where it will fail.
As this was my first time doing this exhaustive and meticulous data exploration, I have been working hand in hand with the Data science team to understand what we needed to look at in the exploration dataset. Together, we created a small checklist of things to look for, on which we iterated.
- Patterns identification
- [ ] main use cases – show examples
- [ ] edge cases – occurrences and examples
- Generic information on document configuration
- [ ] Image deformation distribution
- [ ] Number of lines
- Writing output
- [ ] for each example write user expected output
Before reviewing some data, we defined the size of the sample dataset to minimize the error margin – in Data science everything is about statistics, being thorough is key. We build a 400+ document exploration dataset ensuring roughly a 5% error margin. And then it’s time to open your eyes 👀 , as you have to look at this data set a couple of times. This is the most tedious part of the process however it is the most important as we will see.
- Patterns identification
The first time you look at those exploration documents, the idea is to understand what are the patterns you can identify. There are the common cases, that you encounter multiple times and that looks simple to extract, such as the one below: a single line table with a clear header and a standard lay out.
And then there are all the edge cases you can see. I have described a couple of them below.
The approach is to identify for each edge case its occurence, how frequent it repeats itself in the data set. For those examples, you know you will have to be careful on how the model performs on those configurations.
- Generic information on document configuration
All of this was documented in a notion database containing a list of various cases identified within the dataset. Here is an extract below, with the example of the
number of lines. It was one information we were keen to understand: in average how many lines are there in the Invoices’ tables.
For instance here, having a clear distribution gives an idea of what needs to be supported: if we support tables with 10 lines we cover 98,2% of use cases. For a first iteration, no need to spend energy making sure we can extract tables with 50 lines. It is also critical to understand this early in the process as this is a parameter that can impact computation time.
Let’s take a look at another example with
image deformation. Sometimes documents are scanned resulting on tilted, blurred or folded documents that can be then tricky to read. From experience, we knew that for this kind of document it would be difficult to extract tables.
What type of deformation do we need to support?
Exploring data, gives us an idea of the type of deformation and their distribution within our sample data set. That also gives direction for a MVP. Thanks to his analysis, we decided that no efforts should be made to solve issues on severely tilted documents on the 1st iteration.
- Writing output
A last learning on this, is that to go the extra mile for all examples we put in, i.e for each table, I have written what would be the expected output from an end user perspective. It is super effective to identify where there are new edge cases or to make new problems emerge. Putting yourself in the customer shoes is always a good idea 😉. It is also gives a sense of what is acceptable from a user perspective to what is not, and it helps share this information with the Data science team for decision making purposes.
In our requirements, it looks like that:
Let’s be clear, data exploration is time consuming and requires a lot of rigour, however this has been key in the project success and delivery improvement.
- It helped us gain a lot of time to clearly state which problem to focus on and which one to ignore. We had a clear vision on main cases on one hand. On the other hand, for each edge cases that we identified, we had a statistic on its occurrence. It helped us make data driven decision based on those statistics. We skipped some image deformation, we focused on solving the extraction of 10-line tables thanks to those statistics. We definitely gain a lot of time, effort and focus.
- We built new tooling such as a checklist of configurations to look at on documents that we can re-use in the future
- We organised a deep dive session with the team to share the learnings from this exploration. It helped them onboard with the requirements synthesis which was dense and gave them an overview on where we were heading.
In data science, we are obsessed with a way of evaluating the performances of our models. This is somehow our north star, but we are also very careful, because if you look only at those metrics you can miss important limits of your model. Working on deep learning problems is time infinite: you can always improve (under conditions of course), it is not a binary problem: work / does not work. It is then super important to define acceptance criteria for your model so that you know when to stop.
A bit of context: how do we evaluate model performances? Are you familiar with
precision? If not I suggest you read this excellent post by Jonathan our CEO. You can’t look at one without looking at the other, however for each feature we tend to focus on one that will be significant to our use case. And the metric is tightly linked to the user experience.
- Pick a data science metric to focus on for model iteration
Let’s take an example: let’s say that I am the end user of a procurement software. I have downloaded the PO, the delivery note and now I am downloading the corresponding Invoice that contains the table below. A Deep Learning based OCR runs in this procurement software to extract key information from this Invoice, so that I do not have to fill all the information manually. But you know, sometimes algorithms make mistakes.
As a user, what type of errors do I prefer? what are the impacts on my end?
Recall vs Precision what does it mean for line items?
Recall– do I prefer to correct wrong information because the model made a mistake? It means that the model rather predict a line item even if it makes a mistake, because it is OK for me to make modifications. If you look at the table above, that means it would be OK if the model has extracted the line “Billing period: 01/01/21- 31/01/21” even though it is not a line item according to our definition (a line item is defined as an item description combined with pricing – there is no pricing associated to this description). In that case,
Recallwould be the metric to look at.
Precision– instead do I prefer to manually input information, because I do no want to make those modifications. This means that the model rather not predict when it is not “sure enough” of the result. So for instance here, focusing on
Precisioncould mean that the only line item “Monthly subscription […]” might be missed under certain conditions.
As a summary, we use metrics to help define the type of errors we’d rather accept vs the ones we want to prevent.
For line items extractions, we focused on maximizing our
recall, as it seems easier to modify errors rather than manually type all informations from a line.
- Define your MVP
As our product did not really fit into how we imagine a traditional MVP, we decided to define what MVP looked like to us. Our definition of MVP is a combination of 2 acceptance criteria:
- on one hand, it is quantitative and based on data science metrics. We defined our Minimum Viable Metrics: what are the minimum
precisionwe need to reach to launch? What is acceptable from a user perspective?
- on the other hand, it is qualitative: are the main cases supported? what edge cases are already supported? what are the type of errors the model is making? are they acceptable?
By defining what the model must support for launch, both in terms of document configuration and in terms of performance metrics, we managed to break down the project into smaller iteration.
To do so, we identified:
- on one hand, priorities to solve the various configurations identified within the dataset
- on the other hand, we also split our 10 fields to extract into 3 different priorities based on our understanding from use cases. The idea was to be able to ship only with P1 fields if P2 and P3 were performing poorly.
- Analyse manually model errors and iterate
But working with metrics is not enough. You want to understand where the model fails to extract information: are there specific configuration performing badly? The only way is to look at documents where the model performs poorly, on your validation set. ⚠️ The idea is not to fix those on a case by case basis as this would introduce bias, however this gives trends on configuration that might be difficult for our models. Going back to your data exploration can give an idea on how frequent this configuration happens and if you should put time and efforts solving them.
From what we learned during data exploration, scoping problems into priorities helped us build our own kind of MVP. It gave us a clear framework to use during delivery.
- Definition of the key metric to focus on to accelerate evaluation
- MVP Scoping: mapping each field to extract with P1, P2 or P3 priority
- MVP Scoping: for each configurations (edge cases) set urge to be solved (High to low)
To make sure we ship as fast as we could we set regular touch points. The idea of those touch points was to analyse:
- the model performances (key metrics)
- the model errors
We kept on referring back to our MVP definition – and challenged what we initially wanted to support:
- we realised early that P3 fields (
discount amount) would not be supported in our first iteration, because they were performing poorly. By having set those priorities, we did not spend any time trying to improve their performances, we stayed focus on P1 and P2 fields.
- we discovered new edge cases for which we did not have statistics. Then we did another round of data exploration to make sure we had the right statistics to make informed decisions.
That gave us more velocity and a clear target to reach between each iteration.
At this stage, the gain of time is really based on previous steps. Because we had a clear understanding of what to solve first, what to focus on, we were able to make decisions quickly.
🎯 We reached our goal: it took us approx 3 months to ship the first version of the Line items extraction feature on our Invoice OCR API. From a deep learning perspective, extracting tables from Invoices was a complex problem to crack, it is a real team success! 🎉 If you are curious to understand how we did to crack it, you can read this article that describes our approach.
🔥 What worked well:
- spending time on Data exploration gave us a deep understanding of what could go wrong, where we would struggle, where we should put our efforts and instead where we should let go. It helped us make faster decisions.
- pushing an MVP approach by setting clear priorities for each field and for each configuration to support helped us define when to continue iterating on our models and when it was Ok to stop. We finally outputted only 7 fields out of the 10 identified in the 1st version, really pushing for having a 1st version of the feature out as soon as possible.
📑 What we discovered:
- Data exploration is a never ending a story: we kept coming back at it for each new problem we encountered. That is a powerful way to get statistics in front of problems. It enables data-driven decision. We aim at systemically using data exploration output to drive decision pragmatically.
Being a PM working on deep tech products coming from advanced deep learning research is very different from a SaaS product role. Deep learning based products imply longer development cycles and tailor-made development processes. Moreover, the model you have spent weeks to develop might fail and you might need to start again. As a PM, this requires building a different approach to existing product management methodologies.
As a summary, here are a couple of take aways from what we experimented to accelerate our delivery of our latest Invoice OCR API:
- Understanding the data is key. It is time-consuming, but the ROI is maximum. It saved us a lot of time and energy.
- If you want to communicate efficiently with the Data science team, you need to deeply understand the metrics and make sure that there are clear targets to reach.
- Having set clear acceptance criteria gave us a clear vision on where we should be heading. We combined quantitative criteria (data science metrics targets) with qualitative criteria (document configuration and edge cases to support) to help set our fastest path towards launch.
- Setting clear priorities on what to support helped us navigate deep learning uncertainties and reduce release scope when needed. We did not hesitate to narrow down our feature set for launch.
- When standard methodologies does not apply to your own, then build your own: understand with your engineering team where there are gaps and work backwards. This is how we built our data exploration process and requirements.
As we are now close to launch this 1st iteration, we are looking forward to learn more about how we can improve what we have implemented so far and how we can deliver iteratively the next iteration of our models. We are now looking for more ideas to get even better, so we are looking for tips. And you, how do you work on your AI products? Any tips to share with us in comments?