Ideal accuracy is at the human level character recognition. Unfortunately, human level accuracy is
hard to achieve. Humans use context information about real world to recognize symbols.
For example: if lowercase letters "r" and "n" are very close together and touch at the top they form letter "m" (rn). Cases like that are very hard for software to detect and fix especially when it is in address information fields, names or codes where spell checking can not work.
This is only one of the examples. Letters "O", "o", number "0" and many other combinations are challenging for reliable OCR.
Most images are compressed. There are two different compression types: lossy and lossless.
Lossy compression alters some of the bits during compression in order to achieve better compression rate. Lossy compression relies on the fact that humans usually do not notice the difference in the image even when number of bits have been altered in it.
With lossless compression, every single bit of data that was originally in the file remains after the file is uncompressed.
From OCR point of view lossless is better because lossy compression loses pixel data and alters original image.
Formats PDF, JPG, GIF usually default to lossy compression. While formats TIFF and PNG default to lossless. If you have access to the scanner consider using TIFF and PNG formats when scanning.
Scanner software may have an option to set compression type and level even for JPG and PDF files. If you have a choice then use lossless compression. Your image files will be bigger but OCR recognition should improve.
Number of scanners support color drop out when pixels of certain color are not transfered to resulting image during scanning process. Usually those pixels are table borders and other seemingly non-essential data. This tends to improve accuracy on OCR engines that do not use bounding boxes. This in some cases is true for Alpha Forms OCR as well.
But table borders can be used for page alignment. Current Alpha Forms implementation uses additional pixels if they are available to perform more accurate page alignment and deskew. Depending on specifics of images you receive it might be better to leave the color in and get more precise page alignment.
XML and JSON outputs produced by OCR contain confidence levels and additional processing information.
Use them for more fine grained post-processing.
You may find that some fields in the document are always problematic. Recognizer may tend to always favor certain character over other character. Cases like that sometimes solved in post-processing step where script or program can apply fixed set of rules to clean up certain fields.
CSV output does not contain any extra processing information except "_" (underscores) for unrecognized symbols. CSV is easier to import directly into relational databases or spreadsheets.
If images still have low recognition accuracy (images are low quality, large skew, overlapping symbols, other technical challenges) OCR might be able to fulfill secondary role in production by focusing only on key fields. Running as verification tool for manual data entry.Next