Data Extraction is a component in the Document Understanding Framework that helps in identifying very specific information that you are interested in, from your document types.
The information that can be targeted for Data Extraction is defined in the project Taxonomy, as the list of fields for a specific document type defined in it. A field that does not appear in your project's taxonomy cannot be configured for automatic data extraction.
The Data Extraction step of the Document Understanding Framework ensures that the configured extractors are called in the right order, for the right list of fields, for the right page range of the file being processed. This means that if, in the same file, there are two or more document types identified (for different page ranges), it is recommended that the Data Extraction step is executed multiple times, once for each classification result. Executing data extraction for one classification result with a certain page range will ensure data is targeted for extraction only from those pages and only for that document type.
Data Extraction is done through the Data Extraction Scope activity. To extract data from documents, you can use one or more extractors, as the scope activity has the role of configuring and executing one or more algorithms for data extraction and of offering an easy, unitary configuration option for all your needs.
In short, this is what the Data Extraction Scope does:
The Data Extraction Scope allows you to configure it by using the Configure Extractors wizard. You can customize
You can mix and match extractors, in a hybrid approach, in which you can request a few fields be extracted by a certain Extractor, while other fields are extracted by a different extractor.
You can even implement fall-back rules for data extraction: if a certain Extractor does not report an acceptable value for a given field, then call a back-up extractor.
It is important to note that the order of the extractors in the Data Extraction Scope is important:
an extractor is executed only for the provided classification page range, and only for the fields that are requested of it according to the Data Extraction Scope configuration and the fields that have not already gotten an acceptable result from previous extractors.
Important: If the Data Extraction Scope does not request any field from a given extractor, then that extractor is not executed. This may be the case of an extractor not configured for a certain incoming document type, or the case of an extractor being used as "fall-back" and the previous extractors reported all expected data already.
Based on the requirements of the use case, you can choose from several data extraction algorithms, called extractors.
You can use any extractor that is available in the UiPath.IntelligentOCR.Activities package, in other UiPath (UiPath.DocumentUnderstanding.ML.Activities) or third-party packages (UiPath.Abbyy.Activities).
The available Extractors are:
You can always build your own Extractor, by using the public Document Processing Contracts, thus being able to implement any algorithm that fits your use case.