Deploying machine learning to find information in documents has become a key initiative for many companies. As AI capabilities have advanced over the years, the variety of document types that can be automatically processed has increased, which has led to substantial cost efficiencies and time saving.
However, not every document type can be processed using the same methodology. Some documents are highly structured (forms), others are unstructured (lease contracts), some have handwriting (correspondence), others have signatures or notary stamps, etc. Furthermore, the goals of data extraction vary depending on use case, business process, and desired outcomes.
For example, an operations team may be highly focused on finding key customer, timeline, or requirements data within a sales contract, while the legal department would look at the same contract and want to pull certain legal terms and verify the presence of required signatures. Finding both sets of information from the same document require several different data extraction techniques and the right intelligent capture strategy.
Let’s talk about 7 of the most common types of data extraction, how they work, and where they can be applied.
We’ll use a mortgage application package as an example throughout, as this process is frequently complicated due to its variety of document types. However, these 7 techniques can be used across any set of documents in any industry.
This technique is the most straightforward and commonly used. The system will look at a single page and pull the data within a specified area on the page. Using this technique is relatively simple, as a user would define the set of coordinates by drawing a zone on the page where the data always can be found. As you’d expect, this works best for highly structured forms, like a 1099 form that seldom changes its layout.
Often, the defined coordinates used in static zones tend to shift slightly around. Sometimes this is caused by a poorly captured file (blurry fax, skewed scanned image) or by versions of a form getting slight updates over the years. The W-9 form falls into this category, as it has undergone a few different revisions that have subtly modified its layout. Dynamic zones can account for these adjustments.
This technique can pull table data based on user-defined columns and rows. The system will always look for data at the intersection of a specific column and row, even if the columns or rows switch order from time to time. Common use cases include closing disclosures, invoices, and monthly statements for mortgage servicing.
Key-Value Pairs (KVP)
Moving up a level of sophistication, the KVP technique allows a system to look for a specific keyword (“Key”) and then find the associated data (“Value”) associated with that keyword. For example, a Key would be the form field “First Name” and the required Value would be “William,” or a Key “Date of Birth” and the Value “07/24/76.”
Good machine learning systems can take a first pass at identifying potential KVPs on a page without the user having to specify all the relationships up front, as well as supporting variations of the key-value (e.g. Date of Birth, DOB, D.O.B. etc.). The use of KVP is common on forms with various layouts, like a loan estimate or HUD-1.
Navigational Key-Value Pairs
Sometimes the Key and Value aren’t side-by-side on a page but instead are somewhat removed from each other. This typically happens in less structured documents like a mortgage note, where the specified interest rate (“Value”) is usually stated within a paragraph under the section header Interest (“Key”). A good extraction system can identify the Key and then navigate to the Value, even if it’s buried in several sentences of text.
An even more sophisticated technique than KVP is entity extraction. This method is generally applied to unstructured documents where a user can’t readily see patterns in the layout – a first payment letter, deed of trust, vesting deed, or legal property description. Entity extraction requires the use of predefined libraries of data that a system can reference to infer if it has found the required entities. For example, the system can look for all the proper names, or addresses, or social security numbers in a document – as long as the entity is defined by its characters or format.
The highest form of extraction is contextual entity extraction. It not only finds the required entity, but it can infer something about that entity based on the surrounding context. Perhaps you want to find the borrower’s name in a title policy or rider. A good extraction system would find all the proper names in the document and then present the name that’s closed to the defined term “Borrower” in the text. Generally, machine learning systems can review samples of documents and build their own rules about how to find the required entity and compare the extracted results with an authoritative source such as a system of record.
The seven techniques mentioned here aren’t exhaustive, and new methods arise continuously. I imagine this blog will look very different in a few years! However, I hope this gives you a good perspective about the various methods used to pull data out of documents – it can be as much art as it is science for different use cases.
A final note – good intelligent capture platforms can deploy all these extraction techniques in a straightforward manner without the use of tons of complicated rules or custom code. After all, the goal of data extraction is to save time and money, which is difficult to achieve if your system is expensive to implement and maintain! Free your organization of the tedious, time-wasting tasks with a solid intelligent capture platform.
For more information about how an intelligent capture system can help you save time and increase efficiency, reach out. Our solutions are designed to fit your business needs.