Continue Training your API while Using It: Annotations


In recent posts, we’ve walked through how to create a new Document extraction API using the Document Builder.  Once you have trained your API with the first training set (currently 20 documents), you can begin using your API for extracting data from your documents.  

Every additional set of documents you train the API with will further improve your results, but at some point, you need to get the model into production.  What if you could actually get your model in production AND have your users work to better train your model? 

You can! We call the process of taking extracted data and retraining the model annotation. In this post, we’ll annotate the US W-9 tax form model previously described

 

 

Testing your Document API

 

When your users test a document with your custom API, they make an API call similar to this one (in this case made with cURL):

 

curl -X POST 'https://api.mindee.net/v1/products/doug1/us_w9/v1/predict'
-H 'Authorization: Token {apitoken}     
-H 'content-type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW' 
-F document=@harry_potter.pdf   

 

This will upload Harry Potter’s W9 to the API, and we quickly get a response. 

NOTE: For space reasons, this is a partial response, showing just a few of the predictions.

 

{
  "city": {
    "confidence": 0.7,
    "page_id": 0,
    "values": [
      {
        "confidence": 0.62,
        "content": ",",
        "polygon": [
          [0.204, 0.347],[0.207, 0.347],[0.207, 0.362],[0.204, 0.362]
        ]

      },
      {
        "confidence": 0.79,
        "content": "CA",
        "polygon": [
          [0.212, 0.347],[0.23, 0.347],[0.23, 0.362],[0.212, 0.362]
        ]
      }
    ]

  },
  "name": {
    "confidence": 0.98,
    "page_id": 0,
    "values": [
      {
        "confidence": 1.0,
        "content": "Harry",
        "polygon": [
          [0.1, 0.12],[0.133, 0.12],[0.133, 0.133],[0.1, 0.133]
        ]

      },
      {
        "confidence": 0.96,
        "content": "Potter",
        "polygon": [
          [0.14, 0.12],[0.18, 0.12],[0.18, 0.132],[0.14, 0.132]
        ]
      }
    ]
  }
}

 

Looking at the prediction, the city name that was extracted 

, CA

is not correct (as Harry Potter fans know, it should read “Little Whinging”).  In the snippet above, the extracted name is correct (and in the full file, the other parameters were also correct).  You can also see that the confidence of the algorithm is quite low for the city name (0.7), and very high for Harry Potter (0.98). Generally, confidence of over 0.9 means that the algorithm has found the correct value.

 

When your prediction has a low confidence level - it is worthwhile to have a human check over the file to ensure that the values were extracted correctly before the error is added into your database.  We can use this same intervention to better train the model as well-meaning fewer poor predictions in the future.


 

Alternate values for the prediction

 

While a manual fix of the entries will fix your database accuracy, to retrain the model, we need to have other options available that can be sent back to the API.  You can do this by adding the annotations=true parameter to the prediction API call URL:

 

curl -X POST 'https://api.mindee.net/v1/products/doug1/us_w9/v1/predict?annotations=true'
-H 'Authorization: Token f7d710e56f215512f9fb4de536d5af8a'
-H 'content-type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW'
-F document=@harry_potter.pdf   

 

This tells the API to provide alternative candidates for each prediction (note, the API response is going to get a LOT bigger):

 

The first thing you’ll notice is that each prediction now has a ‘candidate_key’:

 

{
  "city": {
    "confidence": 0.7,
    "values": [
      {
        "candidate_key": "0151bc9f",
        "confidence": 0.62,
        "content": ",",
        "polygon": [
          [
            0.204,
            0.347
          ],
          [
            0.207,
            0.347
          ],
          [
            0.207,
            0.362
          ],
          [
            0.204,
            0.362
          ]
        ]
      },
      {
        "candidate_key": "8d26e17e",
        "confidence": 0.79,
        "content": "CA",
        "polygon": [
          [
            0.212,
            0.347
          ],
          [
            0.23,
            0.347
          ],
          [
            0.23,
            0.362
          ],
          [
            0.212,
            0.362
          ]
        ]
      }
    ]
  },
  "name": {
    "confidence": 0.98,
    "values": [
      {
        "candidate_key": "e02bb231",
        "confidence": 1.0,
        "content": "Harry",
        "polygon": [
          [
            0.1,
            0.12
          ],
          [
            0.133,
            0.12
          ],
          [
            0.133,
            0.133
          ],
          [
            0.1,
            0.133
          ]
        ]
      }
    ]
  }
}

 

If the API has correctly identified the value, we just return the predicted candidate keys (like for Harry’s first name).

 

However, we do not want to use the values of  “, CA” for the city label. We want to use the values “Little Whinging” instead. We need to find the candidate key for the strings “Little” and “Whinging”. In the API response with annotations, there is a new set of assets “OCR ->Candidates” available for each page and each element that is detected.  This JSON contains all of the OCR candidates for each label in the document. For just the city prediction, this JSON is 22,552 lines. It is that long because it contains every string that matches the the “city” requirements (a string with no digits) in the W-9 document.

 

In our case, we know what we’re looking for:

[
  {
    "content": "Little",
    "key": "7832ef34",
    "polygon": [
      [
        0.098,
        0.347
      ],
      [
        0.13,
        0.347
      ],
      [
        0.13,
        0.362
      ],
      [
        0.098,
        0.362
      ]
    ]
  },
  {
    "content": "Whinging",
    "key": "6495fd94",
    "polygon": [
      [
        0.136,
        0.347
      ],
      [
        0.202,
        0.347
      ],
      [
        0.202,
        0.362
      ],
      [
        0.136,
        0.362
      ]
    ]
  }
]

 

 

To send this annotation back to Mindee for training, we need to reply to the annotation endpoint of the API:

 

curl -X POST 'https://api.mindee.net/v1/products/doug1/us_w9/documents/<document_id>/annotations

 

The documentID is provided in the JSON response with the annotations. Since all of the other labels were correct, I used the label keys provided from the prediction API, only modifying those for city:

 

curl -X POST 'https://api.mindee.net/v1/products/doug1/us_w9/documents/<document_id>/annotations 
-H 'Authorization: Token {apikey}
-H 'content-type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW'
-D  
{
  "labels": [
    {
      "page_id": 0,
      "feature": "name",
      "selected": [
        "e02bb231",
        "89202491"
      ]
    },
    {
      "page_id": 0,
      "feature": "city",
      "selected": [
        "7832ef34",
        "6495fd94"
      ]
    },
    {
      "page_id": 0,
      "feature": "ssn",
      "selected": [
        "7dc15c77",
        "c48edb4a",
        "cd75f470",
        "d31e5e04",
        "dd411217",
        "8bebe180"
      ]
    },
    {
      "page_id": 0,
      "feature": "state",
      "selected": [
        "8d26e17e"
      ]
    },
    {
      "page_id": 0,
      "feature": "street_address",
      "selected": [
        "7ea04e96",
        "01abf396",
        "448cd325"
      ]
    },
    {
      "page_id": 0,
      "feature": "street_address",
      "selected": [
        "48804de1"
      ]
    }
  ]
}

 

This results in a 200 response, indicating that the annotations were added into the next training cycle for the API.

 

Fixing an annotation error

 

Should you detect an error in the initial POST, you can resend the data with a PUT command to overwrite the data that was uploaded initially.  Should you wish to remove the document from training, use a DELETE call to the same API endpoint, and all label predictions for the document will be removed.


 

Conclusion

 

With this simple interaction, we were able to make document predictions with our API AND provide the training information back to the algorithm, ensuring that the API prediction will continue to improve. This way, you can add up to 1000 annotations to your training dataset and make your model very robust and learn from your users’ feedback.