Data Extraction Overview

Data Extraction is a component in the Document Understanding Framework that helps in identifying very specific information that you are interested in, from your document types.

The information that can be targeted for Data Extraction is defined in the project Taxonomy, as the list of fields for a specific document type defined in it. A field that does not appear in your project's taxonomy cannot be configured for automatic data extraction.

The Data Extraction step of the Document Understanding Framework ensures that the configured extractors are called in the right order, for the right list of fields, for the right page range of the file being processed. This means that if, in the same file, there are two or more document types identified (for different page ranges), it is recommended that the Data Extraction step is executed multiple times, once for each classification result. Executing data extraction for one classification result with a certain page range will ensure data is targeted for extraction only from those pages and only for that document type.

How to Use the Data Extraction Component

Data Extraction is done through the Data Extraction Scope activity. To extract data from documents, you can use one or more extractors, as the scope activity has the role of configuring and executing one or more algorithms for data extraction and of offering an easy, unitary configuration option for all your needs.

In short, this is what the Data Extraction Scope does:

Provides all extractors (extraction algorithms) the necessary configurations and inputs for them to run.
Accepts one or more extractors.
Allows for field level activation, taxonomy mapping, and minimum confidence threshold settings at extractor level.
Reports extracted data in a unified manner, irrespective of the extractor that reported that particular data.

The Data Extraction Scope allows you to configure it by using the Configure Extractors wizard. You can customize

which fields are requested from each extractor,
what is the minimum confidence threshold for a given data point extractor by each classifier,
what is the taxonomy mapping, at field level, between the project taxonomy and the extractor's internal taxonomy (if any).

You can mix and match extractors, in a hybrid approach, in which you can request a few fields be extracted by a certain Extractor, while other fields are extracted by a different extractor.

You can even implement fall-back rules for data extraction: if a certain Extractor does not report an acceptable value for a given field, then call a back-up extractor.

It is important to note that the order of the extractors in the Data Extraction Scope is important:

extractors are executed with priority, from left to right;
an extracted value for a field is accepted only if it has a confidence equal to or above the minimum confidence threshold set for that extractor;

an extractor is executed only for the provided classification page range, and only for the fields that are requested of it according to the Data Extraction Scope configuration and the fields that have not already gotten an acceptable result from previous extractors.

Important: If the Data Extraction Scope does not request any field from a given extractor, then that extractor is not executed. This may be the case of an extractor not configured for a certain incoming document type, or the case of an extractor being used as "fall-back" and the previous extractors reported all expected data already.

Available Extractors

Based on the requirements of the use case, you can choose from several data extraction algorithms, called extractors.

You can use any extractor that is available in the UiPath.IntelligentOCR.Activities package, in other UiPath (UiPath.DocumentUnderstanding.ML.Activities) or third-party packages (UiPath.Abbyy.Activities).

The available Extractors are:

Regex Based Extractor
Form Extractor
Intelligent Form Extractor
Machine Learning Extractor
FlexiCapture Extractor

You can always build your own Extractor, by using the public Document Processing Contracts, thus being able to implement any algorithm that fits your use case.