FormAnalyzer - system description
General Information
FormAnalyzer is a modular information system designed to digitize the archives of paper documents and automate form data capture process. It supports all necessary actions from paper document filing till the release of electronic text data reflecting the content of the originals. The system allows you to scan, control image quality, recognize and verify any kind of paper forms or documentation..The system may process correspondence and archival documents as well.
The system can process documents both with the previously defined layout of fields and any number of pages and documents non-standardized, such as documents of official correspondence and archival documents. Documents can include any attachments that do not contain fields for recognition. The system uses the character recognition technology ICR / OCR for the recognition of the content of documents filed with the printed text or handwriting blocks. Documents can also include bar codes and manually filled boxes OMR (optical mark recognition).
It is possible to process various types of forms in a single batch in any order. The different types of documents (forms) are recognized on the layout of fields and sorted automatically.
The FormAnalyzer system can also process informal documents such as correspondence automatically analyzing the structure to extract metadata based on various custom defined criterias.
The FormAnalyzer processing steps include document scanning,classification of the documents, automatic recognition, manual verification, image control and data export.
The FormAnalyzer system supports two modes of document processing:
- Document processing
This is the basic mode of document processing. It allows independent processing of documents at every stage of the work. The documents are grouped into virtual folders, which have a hierarchical structure. Folders have a similar function as directories in the file system. - Box (file) processing
documents processed in this mode are grouped into structured objects - boxes. Each box has a name. The name may be assigned by the user or generated automatically. All documents from the box are processed in the same way as documents from the previous mode. The distinguishing feature of this mode is that the user at every stage of the work has access to all documents from the box. The second difference is that the box includes attributes that define the box processing stage. This means that the moment of transition of box to the next stage of processing is achieved only after the completion of processing of all documents in the previous state. In addition, documents from the box may contain fields whose values are propagated within the same box. This reduces the time of data entry for fields whose values are the same for documents from the same box e.g. filling the first and last name of a person in the box of customer correspondence.
The system consists of the following modules:
- FormAnalyzer Designer
- FormAnalyzer Scan & Administrator
- FormAnalyzer Production Manager
- FormAnalyzer Engine (OCR/ICR BarCode)
- FormAnalyzer Verifier
- FormAnalyzer ImageControl
- FormAnalyzer Export
- FormAnalyzer Report
- FormAnalyzer Database
The actual structure of a system installation is determined by the required system capacity and characteristics of processed documents/forms.
FormAnalyzer example system configuration chart
FormAnalyzer system is manufactured by Arhat for over the last 15 years. Its functionality is constantly being expanded and reflects the needs of many users.
FormAnalyzer is constantly in use in both public and private organizations such as.: Agricultural Market Agency, PZU - insurance, Aviva - Commercial Union, ERA, FNTI - American document Imaging Company, where it processed many millions of pages of documents per month.
The company has the source code of the system and qualified personnel that allows it to offer the maintenance of the dedicated versions meeting the requirements of a particular user.
FormAnalyzer Designer
FormAnalyzer Designer is used to define templates of recognition of forms and the processing of metadata. Templates contain the information necessary for the process of recognition and verification, such as:
- layout of fields and metadata fields names
- layout and the type of positioning tags
defining characteristics for automatic identification of the type of form - definitions of scripts used to validate metadata
- link to external DLL and WebService services
- link to metadata value dictionaries
- definitions of data export formats
The input for the program are electronic images of samples of individual groups of forms. On the output it generates (saves) templates of document recognition (processing) schemas. The templates are written in the binary configuration files. Additionally, the configuration files are saved as XML to provide interconnectivity with other IT systems. Templates are recorded in the system database (FormAnalyzer DataBase), which allows automatic configuration of the work of other modules.
Templates are then used by the modules:
- FormAnalyzer Engine
- FormAnalyzer Verifier
- FormAnalyzer ImageControl
- FormAnalyzer Export
- FormAnalyzer Production Manager
Specifications:
Number of processing templates defined: | Unlimited |
Initial image processing: | rotation, skew removing, improved image quality (noise reduction, line removal, background removal), color dropout, crop-out |
Type of form field: | OCR / ICR / barcode / text / OMR |
Number of form fields: | Unlimited |
Type of the positioning tag: | CropMark/FixedPattern |
Template scaling: | on the basis of the image resolution and / or on the basis of the positioning marks |
Number of the positioning tags: | Unlimited |
Special features for OCR / ICR fields: | Numerical / containing the date / text |
Special features field containing the barcode: | Automatic recognition of one of the following codes: EAN 8/13, UPC (A, E 6digits ), Code39 (CDT, SST, FA), Code128, Codebar, ITF, Code93, Code32, PDF417, Postnet code, Data Matrix, QR code, Royal Post, Australian Post, Intelligent Mail. |
scripting validation: | validation scripts are written in VBScript or JScript. |
validation scripts features: | Validation scripts may be used for field value modification (creation, deletion) when used in the FormAnalyzer Engine and/or Verifier. Validation scripts may evaluate the validity of field(s) when used in FormAnalyzer Verifier |
external DLL's services | libraries are used for correcting automatically recognized data in the FormAnalyzer Engine module and to validate the value of the form in the FormAnalyzer Verifier module. |
WebService | WebService based functions may be automatically evaluated for metadata validation and modification. |
dictionaries of values | A dictionary may be used for correction of the automatically recognized form fields to the closest value from the dictionary (FormAnalyzer Engine). Validation of the fields in the module FormAnalyzer Verifier by checking the occurrence of a specified value in a dictionary. |
FormAnalyzer Designer - defining the recognition template.
The quality of the recognition results can be improved by using one of the following mechanisms (tools / software options):
- trigrams
Based on the statistics of the occurrence of the sequence of characters in the lexica of the given language (English, Polish, …) the system chooses a character from the set of equally plausible candidates, which provides the highest probability for a sequence of three (or more) of adjacent characters. This may be used for fields containing expressions of natural language, eg. names, occupations, street names, etc. - dictionaries
The FormAnalyzer Engine module will select the closest match from the dictionary, and the FormAnalyzer Verifier module will verify the occurrence of a specified value in the dictionary. These options can be used independently of each other. Dictionaries may be subject to updating. Updated data dictionary come from the stage of verification of documents. In the process of validation of the documents a multi-column (hierarchical) dictionaries may be applied. Each column in the dictionary will then be associated with another field. FormAnalyzer system will check the correctness of input of data between all fields belonging to the dictionary. For example, having a table of all postal codes and the corresponding town names this mechanism can be used to validate the postal code and town fields if they are in the correct relation. - Validation scripts
for each field the script can be written in VBScript or JScript (a subset of Visual Basic or JavaScript). The script may calculate the value of the field based on the recognition results (FormAnalyzer Engine and Verifier), or check whether the value of the field is valid (FormAnalyzer Verifier). - multi-field scripts
Just as in the previous case, but the script is associated with a group of fields, so you can verify the correlation between the fields for example Polish Id number contains encoded person sex and date of birth information so one may validate the fields ID, sex and date of birth with multi-field script. - external DLLs
External DLLs are used identically as validation scripts but are written eg. in C++ and compiled into a DLL, making it run faster. - external WebService
this mechanism provides centralized validation of documents by the different systems used in the company. WebService’s are most often used to validate the fields for which it is necessary to access external client databases. - Mask fields
Each field may contain a mask that defines the format of data entry. The mask can be used in fields in which the data format is fixed for example. ID number, VAT-Number, DOB - date of birth.
Additional metadata features definable using FormAnalyzer Designer
- obligatory
- allowance to acceptance of the value of the field with a warning of the inconsistency of the field with a prescribed validation criterion
- masks for the known data format
- retype - in cases where the field is marked for retype a validation consists of:
- manually correcting the data identified by the module FormAnalyzer Engine
- acceptance of the revised value of the field (after acceptance of the field it is displayed as blank)
- again fill in the value of a field
- compared to the results and, in case of compliance, the transition to the next field - comments - field level textual messages that will automatically appear in the FormAnalyzer Verifier module to inform the operator of the custom directives important while entering a given field
FormAnalyzer Designer allows you to specify how the operators shall respond in case a document does not meet the critical validation criterion. A list of rejection reasons may defined and eventually extended by operators.
FormAnalyzer Scan & Administrator
FormAnalyzer Scan&Administrator module supports the process of scanning and system administration.
Scanning
The program allows you to scan, search, view and print of the scanned document images.
Input for the program are paper forms. On the output an orderly lists of scanned document and/or boxes are registered in the FormAnalyzer DataBase. The images are stored in the file system volumes in the form of compressed binary images. The FormAnalyzer Scan&Administrator may automatically initiate further processing steps for documents scanned.
Depending on the type of scanner program allows you to work in simplex (scanning one side of the form) or duplex (simultaneous scanning both sides of each sheet form) mode.
The scanning process can be software-enhanced with the following functions:
- barcode detection and recognition
- detection of blank pages as separators of documents and/or for blank-page marking
- digital image processing in order to improve its quality (removal of the line, noise, cutting to the actual dimensions, background color dropout)
- Image rotation and deskewing
Specifications:
type of scanned documents/forms: | Single or multi page forms/documents scanned single side or duplex. |
scan mode: | Simplex and duplex (black and white, grayscale, color) |
automatic indexing: | Up to 2,147,483,647 documents |
document folders: | Tree structure up to 2,147,483,644 folders |
document boxes: | Up to 2,147,483,644 boxes. Each box has a unique name within a folder. Box name up to 64 characters. |
Scanning interface: | ISIS, ISIS-TWAIN |
Compression: | CCITT G3, G4, JPEG, LZW |
Scan Mode: | Single or multi-stream - simultaneous scanning mode for bitonal images and color. Allows to later choose the type (color mode) of target document. |
Exemplary window FormAnalyzer Scan&Administrator (scanning).
Administration
FormAnalyzer Scan&Administrator provides all the necessary functions to support the system administrator in system configuration (scanning templates, rejection reasons, access rights), administration, document search and review. It allows to monitor the audit trials of the document related activities and manually direct given document groups for the needed processing steps.
The program ensures the security of data stored in the system. It is used by a trained system administrator. The system administrator has the ability to monitor access to the system and define access rights for individual users or groups of users.
Specifications:
- automatic control over the whole process of processing
- view the status of processing for individual boxes and/or documents
- receive statistical data on individual boxes
create, modify and delete of users
create, modify and delete user groups - defining access rights to the functions (create, select, delete) for individual users or groups,
- manage users and passwords (lifetime passwords, validation passwords, forcing password changes, account locking after a given false attempts, pause accounts etc.)
- register and manage scanning templates for document processing. Scanning template contains all technical definitions of scanning - resolution, image size, double-feed detection, document separation definitions, and processing definitions
- data volume definition and management
Exemplary window FormAnalyzer Scan&Administrator (administration).
FormAnalyzer Production Manager
FormAnalyzer Production Manager is used to manage the work of verifiers and to define how to process documents in the system.
Defining how to process documents in the system include:
- Defining the way the document is processed in the system, along with the definition of exception handling
- The introduction of the double verification of the documents or their selected fields and the conditions necessary for the implementation of the double verification (eg. according to the value of the given document field - doubleverify field “amount” if the value is greater than 10000)
- Definition of terms of the requirements for approval of the document by the authorized operators
- Define processing path for poorly recognized and rejected documents
The management of the verification operators includes:
- Defining tasks for verifiers on the basis of a set of templates and the current state of the system load
- Assigning verifiers to carry out specific task
- Displays statistics allowing the observation of the current state of load on the system
Document workflow design
Operator workload management
FormAnalyzer Engine (OCR / ICR / barcode)
FormAnalyzer Engine is a service responsible for automatic document classification and data extraction.
Input for the program are electronic images of documents. On the output the results of automatic document classification and/or data extraction are written in the binary format into the FormAnalyzer DataBase. Further circulation of the document depends on the pre-defined processing scheme.
Specifications:
Scalability: | Simultaneous operation of several modules installed at different stations within one system |
Total recognized templates: | Unlimited |
Interpretation of the results of the recognition: | Date / numeric data / text / custom |
character recognition confidence level: | the position of the characters and the confidence level is stored for further use during metadata verification |
image pre-processing: | rotation, removing skew, improved image quality (noise reduction, line, backgrounds, character smoothing), color dropout, cropout |
form registration: | CropMark/FixedPattern |
OCR technology: | OmniFont |
ICR Technology: | recognition text (caps, lowercase, mixed), numerals and symbols |
barcode recognition: | Automatic recognition of one of the following codes: EAN 8/13, UPC (A, E 6digits), Code39 (CDT, SST, FA ), Code128, Codebar, ITF, Code93, Code32, PDF417, Postnet code, Data Matrix, QR code, Royal Post, Australian Post, Intelligent Mail |
Exemplary window FormAnalyzer Engine.
FormAnalyzer Verifier
FormAnalyzer Verifier is designed to allow the operator to verify and/or enter the documents metadata. The images of the documents are automatically presented.
Additionally, the FormAnalyzer Verifier performs the following functions:
- enables manual classification of documents - giving documents a processing template name, manual documents splitting and merging, delete blank pages, change the order of pages, rotate pages
- Dual verification of documents and arbitration in the event of inconsistent results
- Approval of documents which, in accordance with its contents and conditions designed in the processing template must be approved by an authorized user (eg. bank transfers with a value above a fixed amount)
- Conducting quality control of the work of verifiers. Information about the errors committed by employees are aggregated and used to assess the overall quality of the data and personal work.
- Updating dictionaries values from the actual content of the verified documents
Input for the program are the results of the automatic document recognition obtained by the FormAnalyzer Engine. On the output the verified metadata are stored in a database in the form of binary files, where they expect to be export to an external system.
Specifications:
- automatic verification procedure to protect against the introduction of data that do not meet certain criteria
- simultaneous operation of multiple modules installed at different stations in one system
- possible to switch directly between fields characterized by a low level of confidence or not meeting the validation criteria
- automatic selection of results with low confidence level
- Automatic single-field or multi-field validation criteria evaluation:
- the value of the metadata field once entered/accepted by the operator acceptance is automatically validated with the specified validation criteria (either script based or predefined)
- check, if the value of the multi-field validation criterion
- the automatic execution of the multi-field validation criterion once all required fields have been entered/accepted by the operator
FormAnalyzer Verifier metadata entering/checking.
In the verification mode the right side of the window displays the image of the document. Part of the image corresponding to a field which is subject to verification is automatically highlighted. The elements of the system interface can be individually adjusted by each operator.
Upper left corner of the window displays a magnified image of the actually verified field. Below is a list of field names to be verified with the results automatically extracted by the FormAnalyzer Engine. Recognition results corresponding to the currently entered field are highlighted in yellow, and the characters recognized with low confidence level in red. These characters are verified manually.
FormAnalyzer Export
FormAnalyzer Export is used to export the revised metadata and the images of pages of documents.
Input for the program are verified metadata. On the output the program creates files containing metadata and additional files with images. In addition, the exported files may contain selected system information about the document (eg. the document ID from the database, document folder from the database). The exported file format is determined by the module FormAnalyzer Designer.
Supported export file formats:
- CSV - text format in which fields are separated by selected separator
- XML - format compatible with XML
There are two modes of export:
- automatic mode, in which the program exports the verified data in the predefined schedule. In this mode, only the valid and ready documents may be exported
- on-demand, where the export operator selects the document source folder and the required status of the document processing.
Specifications:
- simultaneous export data from different types of forms,
- the simultaneous operation of multiple modules installed in various positions within a single system
- bitonal images exported as TIFF CCITT G3 / 4
- color images exported as TIFF with LZW compression or JPEG
- export of document pages in the stream of subsequent images or a multi-page TIFF
- metadata export: Windows-ASCII, UTF-8 or XML
FormAnalyzer Report
FormAnalyzer Report is used to create predefined or custom reports. The program allows you to generate a report based on the custom CrystalReport templates.
The FormAnalyzer Report contains several predefined reports that allow you to control the operation of the system:
- work for a period of time
- processing time
- evaluation of the users
- number of controlled documents
- to assess the results identifying
- the number of documents processed
- number of documents in the database
- list and properties of tasks
- exports sessions
- archiving sessions
Generated reports may be printed or saved in the following formats:
- Crystal Reports (rpt)
- Adobe Acrobat (pdf)
- Microsoft Word (doc)
- Microsoft Excel (xls)
- Rich Text Format (rtf)
The user can easily register, in the FormAnalyzer DataBase, the custom report templates. Such templates are then available for all system administrators.
FormAnalyzer Report, report generation.
FormAnalyzer Database
All the FormAnalyzer modules uses a centralized relational database, which provides the mechanisms to synchronize access to the data, protection and integrity. The database stores the recognized content of documents and information such as time to introduce the document to the database, user data, the status of the processed document etc.. The database ensures reliable operation of the system, which can concurrently support several hundreds of users.
The FormAnalyzer DataBase is supplied and installed with the system - supported operating platforms are Microsoft Windows Server 2003, 2008 R2, 2012 and newer. The database natively supports 64 bit operating systems.