Alpha Forms

Alpha Forms is designed for business documents and forms with most fields at predefined positions in the document.

Steps for typical setup:
  1. Use Boxes Editor to setup bounding boxes on reference image.
  2. Run and test recognition.
  3. Deploy OCR on the server or desktop in production.

While steps #2 and #3 can be done on Windows and Linux, initial step #1 can only be accomplished on Windows.

Requirements

While lower resolution images, 150dpi and lower, can produce good results images with 300dpi are recommended. If you are not sure about your image resolution please check image dimensions. Usually US Letter has width of about 2550 and height of 3300 pixels.

Note that PDF is not raster image format. It is a container. Think of the envelope that holds your image. That's what PDF is. You can not easily find image dpi and other properties without peeking inside PDF using special tools.

If you are on Windows 10 platform then right click on the raster image (TIFF, JPG, GIF, PNG). Click Properties and check "Details" tab. Image resolution and size should be listed there.

Do not resize images to get more pixels. Typical resizing algorithms do not have any deep concepts understanding text and symbols. They simply blow up pixels using fixed mathematical formulas. As a result resized images lose quality and become blurry.

OCR Engines

Alpha Forms come with few OCR engines to choose from: Ink OCR and Tesseract OCR.

Tesseract is open source OCR developed since 1985. It is solid choice for most recognition tasks. It supports multiple languages.
Ink OCR is relatively young OCR in development since 2015. It is designed with business documents in mind.

Which OCR is better depends on the nature and quality of your documents.

For Linux platform we do not provide Tesseract OCR as part of our installation package. Download and install Tesseract OCR using apt or rpm tools on Linux. Once Tesseract is installed it should be in the systems path.

Ink OCR

Ink OCR recognition success very much depends on how close your images match product requirements listed above.
Low resolution skewed misaligned document images with many random dots ("noise") present major problem for the processing. When symbols overlap or touch they form characters that OCR can not separate and can not recognize.

Ink OCR fights these challenges using various techniques. If it fails to recognize specific symbol "_" (underscore) is produced instead of actual symbol in the output. Each field gets a score.

There are two scores reported: Confidence and ConfidenceAverage.
Both scores are in range between 0 and 100. Confidence is a score for each letter recognized in the field.

ConfidenceAverage is convenient average of all the scores of all the field's symbols. It is calculated on a whole field. Since each field usually has many symbols confidence tends to get closer to 100 on fields with many symbols as value is averaged. For example address field may have one letter wrong so its confidence would be 98 or 97 depending on how many other correct symbols are in specific address.

Current recommendation is to treat any score below 100 as warning and any score below 80 as error in recognition. This might change as we work to improve processing.

Why bounding boxes?

Bounding boxes restrict recognition to only selected part of the page image. This improves processing performance since OCR does not have to scan full image looking for useful data and recognizable text. OCR runs faster because image space bounded by boxes is usually smaller than full image. Bounding boxes also provide mapping from parts of the image into extracted text (boxes are named).

Since boxes are named they become structured data fields. Structured data can be loaded into relational databases. It becomes easy to search for specific IDs, numbers or dates because data is mapped to specific fields.

OCR can take some processing "shortcuts" to increase accuracy because fields have types. For example: numeric fields do not expect to have letters so OCR can restrict range of expected values to numbers only and increase accuracy.

Setting up bounding boxes takes time. It also restricts recognition to only document types that have been setup.

No bounding boxes

OCR without bounding boxes can still process image but extracted data is not structured as named fields. OCR without bounding boxes is still useful because it makes documents in images searchable. So it is applicable in number of other business domains.
But all extracted information is just "data". It does not have extra meaning attached to it such as "invoice number" or "patient address" for example.
You (as developer) would have to build additional software to structure (attach names) to extracted text if OCR would have no bounding boxes but you need structured data and fields.

Next