Extraction Quality Varies - How our Models improve over time

Our document extraction engine is designed to automatically extract structured financial data such as invoice numbers, dates, totals, and other key fields from a wide variety of unstructured financial documents. It uses a combination of AI-driven models and rule-based logic to deliver high-quality results across many document types.

While the system continues to improve, it is important to recognize that it cannot guarantee perfect accuracy in all cases due to the inherent limitations of machine learning and heuristic approaches.

This guide explains how the engine works, outlines the three core models involved, and describes how we evaluate updates to maintain quality and consistency.

System Architecture: The Three Models

The engine consists of three primary components, each playing a distinct role in the extraction process:

Templating Model, which learns from user corrections
Neural Network Model, which generalizes across document types
Heuristic Model, which uses structured rules and keyword matching

Each model has its own update process and evaluation criteria, as described below.

1. Templating Model - Learns from User Corrections

The templating model adapts in response to feedback submitted through the MyPaperFlow application. When a user corrects a value in a document, the system captures this feedback. For similar documents encountered in the future, the model applies the same logic, improving accuracy on recurring formats.

Characteristics:

Update speed: Fast, updates occur between consecutive documents
Primary benefit: Highly responsive to user input
Ideal use case: Repeating document structures and layouts
Limitation: Dependent on the quality of submitted corrections

Important Consideration:
Because the templating model learns directly from user input, it is essential that users submit only accurate corrections. Mistaken inputs can inadvertently teach the model incorrect behavior, which may then affect similar documents going forward. Careful and accurate feedback is key to ensuring continued model improvement.

2. Neural Network Model - Generalized AI Extraction

The neural network model forms the backbone of our extraction system and is responsible for the majority of field extractions. It has been trained on a large dataset covering a wide range of document types, formats, and languages. This model offers broad generalization and robustness.

Characteristics:

Update frequency: Periodic, typically once every few months
Update requirements: Deployed only if it improves overall performance across a diverse test set
Evaluation scope: Includes thousands of documents across languages, layouts, and categories
Limitation: Slow to update, requires extensive validation to avoid regressions

Even if a model update resolves a specific issue, we may choose not to deploy it if it degrades performance elsewhere. Maintaining consistent quality across all documents is our priority.

Extraction Quality Varies - How our Models improve over time

System Architecture: The Three Models

1. Templating Model - Learns from User Corrections

Characteristics:

Important Consideration:

2. Neural Network Model - Generalized AI Extraction

Characteristics:

Was this article helpful?

Comments