Editor

Editor is designed to help you setup and test recognition project. Using boxes you make processing faster since OCR does not need to scan whole document looking for text and numbers in it.

You can also setup box types and limit range of recognizable symbols. This increases accuracy of the recognition since similar letter and number combinations can be excluded from the match.

Boxes Editor is important part of OCR tools group. Use it to setup fields you want to extract from images, adjust field types and test recognition.

Each project needs a reference image. It should be an image that is most closely matches other document images of that particular type.

Example of the main Editor screen.

For example if you want to process medical claims in HCFA 1500 form, pick an image of unskewed filled and scanned form with all the fields clearly in they place. Then import that image into Boxes Editor using "Image" button. Make sure to keep that reference image. If you decide to move OCR processing to another computer or server machine copy both project ".box" and reference image files.

During processing reference image is compared against incoming image. This is done to align boxes and perform other important OCR steps.

Formats

Current supported image formats are PDF, TIFF, JPG, GIF, PNG.
When PDF or TIFF has multiple pages only first page will get processed.

Start Project
Typical steps to start new project:

Tip: there is faster way to run recognition inside the Editor. Use F5 keyboard shortcut for "Quick Run". Quick Run will simply reuse last know file names for your project.

Working With Boxes

Once box is added to the Editor custom unique name field is assigned to it. You can rename the field. Do not use special symbols or punctuation in the field name. Some output formats such XML or JSON do not allow special symbols. So it is best to stick to letters and numbers only when renaming the fields.

Tip: there is faster way to add boxes. Press Ctrl-A keyboard shortcut to switch to "Add" mode and click on the reference image.

Example of adding new box to the main Editor screen.

Tip: there is faster way to rename fields. Simply select the box then right click on it. Rename dialog screen will be displayed.

When documents are scanned via Automatic Document Feeder (ADF) sometimes pages get slightly misaligned. Forms that have fields in very tight table cells may get part of table frame extracted and processed via OCR as part of the field.

If this happens try to adjust bounding box. If this does not help another option is to do the opposite: make box wider to capture table cell left and right walls then run post processing on recognized fields and trim first and last characters from OCR output.

Adjustments are essential when working with complex forms. It is also important to choose right image for initial template. Initial image should not be skewed. If it is skewed then every input image will get skewed to match template and OCR will get executed on skewed fields resulting in reduced recognition rate.

Box Types

Box type information is used by Ink OCR. It is not used by Tesseract OCR.

Box types restrict range of recognized characters. If you can limit range of symbols that can appear in a given box you can improve recognition and reduce errors.

For example if you know that certain box can only have whole numbers set its type to Integer. That way recognition will be limited to specific set of numeric symbols and will not misqualify "0" (zero) for "O" (uppercase O). Same applies to many other cases when symbols are very similar. Setting box types improves recognition.

Default box type is Alphanumeric. It is least restrictive and should be changed to any more strict type if possible.

For Alphanumeric OCR performs context analysis of the text itself but still can easily misqualify symbols that appear similar. This is especially hard problem with lower resolution images that have "noise" (random dots). Cases where "D" (uppercase D) looks similar to "O" (uppercase O) in certain fonts in low resolution images.

Many business forms are printed with text in uppercase and some fields can only contain numbers or dates. So it makes sense to restrict box types to only expected symbol ranges.

Box type symbol ranges:
  • Alphanumeric ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvxyz1234567890-@#$&()+=%;:?.,\"
  • Alphanumeric Uppercase ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-@#$&()+=%;:?.,\"
  • Text ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvxyz
  • Text Uppercase ABCDEFGHIJKLMNOPQRSTUVWXYZ
  • ID ABCDEFGHIJKLMNPQRSTUVWXYZ1234567890-
  • Integer 1234567890-
  • Number 1234567890-$()+%.,
  • Date 1234567890-/:.
  • Checkbox X

Instead of symbol OCR may produce "_" (underscore) for unrecognized fragment of the image.

Page Alignment And Skew

Depending on source of images documents might be misaligned or heavily skewed. It is important to pick right image as reference image for the project. Reference image will be used to align and deskew incoming documents.

Deskew process can handle images skewed up to 20 degrees. Alignment will adjust boxes to match boxes on the reference image. It helps to expand bounding boxes into unused extra white space around text field you want to extract. Expanded box will encompass text and chunk of white space around it so in case of slight misalignment pixels that make up symbols would still get captured by the box.

You can switch Debug Mode ON to see actual box images extracted during processing. This helps adjust-expand white space padding inside the boxes without grabbing parts of other text or graphics on the page.

Next