This is general technical blog on challenges building, testing and maintaining OCR systems. Problems discussed there apply to document processing as well as pattern matching, computer vision and machine learning.
Paper is still here. Even after digital-everything revolution paper documents are still all around us. Paperless office is becoming reality but strangely enough amount of paper printed world-wide does not decrease.
Why is paper so pervasive? There are few possible explanations:
According to wikipedia:
An affordance is a quality of an object, or an environment, which allows an individual to perform an action. For example, a knob affords twisting, and perhaps pushing, while a cord affords pulling.Paper is cheap. Easy to handle in small volumes: print, ship, bend, copy, store, carry around, etc. I am not defending use of paper. Unique affordances simply make it "sticky" and hard to replace in production systems.
What makes paper great at low volumes also makes it hard to handle at medium or large volumes. Books and magazines stored on one modern hard drive can fill entire rooms if converted back into paper format.
Researches and software developers realized that paper is here to stay. So at least in this transition period it would be nice to have some bridges that take electronic formats into paper and back. First bridge is already solved problem: printers act as first bridge and documents get printed on paper as part of normal workflow. Scanners and OCR software should act as second bridge by taking printed documents and bringing data back. Unfortunately it appears to be unsolved problem with no clear solution even after decades of research.
Accuracy is major issue with OCR software. Human level accuracy is not here yet. In some specific domains it is close: books scanned with custom hardware at high resolution. But in most cases it is far from perfect: forms with lots of graphical content, low resolution images, hand written documents.
OCR has long history of research. Need for reliable OCR comes up in surprisingly diverse domains. Example: reading street signs or tag plates on cars.
This is a little detour into world of data and information. At the base level there is structured and un-structured (free-form) data. Many pieces of data we deal every day might be structured or free-form data.
Structured data has fields or at least (in)visible labels attached to it. Example: web pages have headers, paragraphs and other HTML formatting. It is structured. Excel spreadsheets, CSV files, relational database records all filled with structured data.
Sometimes line between structured and un-structured data is blurry. Example: storing free-form text files inside big variable length text field in the relational database.
Having data inside structured fields has benefits. If you want to find patient's date of birth you can just search database records with field having date of birth. In case data is un-structured you would have to search everything for text that looks like date and only then somehow check that it is really patient's date of birth not service date or hospital admission date.
Benefits of structured data grow with dataset size.
If data is un-structured free-form then everything is just "text".
This is seemingly small but important point. If everything is just text it is hard to tell apart invoice numbers from purchase order numbers, invoice dates from shipping dates, etc.
Most off-the-shelf commercial OCRs treat all data as text. Simple example with invoice date that can not have letters. It gets processed as text with possible wrong recognition - letters inside dates. Let say you have a date "01/02/2010". Some extra black dots (noise) are in the image. OCR reads date but spits out "D1/D2/2010". Few zeros got replaced with letter "D".
Accuracy is lost if OCR treats all data as just "text".
There are two solutions to this problem:
Intelligent field type switch is hard to build. Business forms come with all kinds of codes, IDs, dates and address fields. One example is ISBN numbers for books. ISBN numbers contain only numbers but some may also contain letter "X" instead of very last digit. Tip: last digit in ISBN number is actually check digit and it can have values from 0 to 10. When check digit has value 10 it is represented by "X".
Same issue applies to other IDs and numbers: letter or special symbol ($) can be valid inside specific field. This and other issues make automatic field type detection hard.
If your OCR allows setting field types then use it. Setting specific field types should improve recognition accuracy.
Noise is all around us. It manifests itself as random dots in scanned images. Some noise also appears as chipped dots of the letters.
So it both adds and removes information from the image.
If you repeatedly print-scan and print-scan same document more and more noise will creep-in. Extra noise blurs similar looking symbols making them hard to distinguish: "O" becomes "0" or "D", "i" becomes "l" and so on.
Noise in digital communications is handled by adding error checking and or correcting codes to the transmission. Extra bits added to the message let you check if noise damaged the message, and maybe even let algorithm make a correction to fix it. I am simplifying this of course. But the idea is simple - extra payload in the message makes for much more reliable communications.
Romans and Greeks designed pretty alphabet without any expectations for upcoming digital age. Error correcting codes are not in the alphabet. If you are scanning typical text spell checking or dictionary can help. It does not help if you are scanning business forms since names, address information, IDs and codes are not in the dictionary.
There should be other means to check and correct OCR errors. One way is to have OCR with two or more recognition algorithms in it. If multiple OCR algorithms (in essence, multiple OCR engines) agree on specific field value then recognition confidence is high. If multiple recognition algorithms do not agree then confidence is low.
OCR with multiple recognition algorithms improves accuracy.
There is even more advanced variation of this: OCR with voting system in it. Few recognition algorithms are combined and each has different weight to final confidence score.
There are different problem domains hiding under one large umbrella called "OCR". Books, magazines, legal documents, business forms, checks, hand written documents all have specific requirements.
Different types of documents need different OCR engine designed for them. One-size-fits-all approach usually does not work well. OCR designed for books can not deal with business forms.
There is tendency in the industry to use general purpose packages for all the printed documents no matter the type. Why is it such a bad idea? Hope is that OCR will scan whole document and magically give you data in it.
General purpose OCR will have to read every pixel of the image and try to interpret it. Even if you only care
about dozen fields on the invoice whole image will be processed wasting CPU time on data you do not care about.
Some packages are smart enough to separate images and lines from symbols. If not then every group of pixels will be converted to symbols.
Image likely will have noise (small groups of pixels appear during scanning process) then even small group of dots will become a symbol. OCR will spit out tons of symbols that would not appear to be in the image.
All data will be un-structured leaving you with the job of finding and parsing out important fields you need from blocks of text.
Summary on general purpose OCR:
General purpose OCR is good for certain domains. Example: indexing and seearch. If you are scanning books or magazines and want to be able to search in them. Everything is just text. You simply want to be able search in it.
Use right tool for the job.
Designing good OCR requires balancing several competing factors. One of them: accuracy vs speed.
Given infinite time OCR could grind on the image until it produces better output. There is just small problem: no one will be interested in software that is much much slower than human operator. There are obvious constraints, and time is one of them.
Computer hardware is changing. Fifteen years ago multi-core architecture was uncommon. Now computers with 4 cores are common place. Number of cores per computer likely to increase in the near future.
Many OCR packages used today have been built over fifteen years ago at the time when multi-core support was non-essential. Imagine you have 4 core computer. If you run program that does not have multi-core support it will run on 1 core only. It will only use 25% of the available computer processing (100% / 4 = 25%).
If that is the case you can only take advantage of multi-core computer by running multiple instances of that OCR program.
Make sure hardware is fully utilized.
Our OCR products are built with multi-core computers in mind. We use programming language that has great support for concurrent and parallel execution.
Most OCR software companies recommend scanning at a minimum resolution of 300 dots per inch (DPI) for effective data extraction.
Then for every square inch of paper scanner is capturing 300 dots horizontally and 300 dots vertically or 300 x 300 = 90,000 dots per square inch. If you use 150 DPI setting instead of 300 DPI you will only see 22,500 dots per square inch as opposed to 90,000. Big difference. At higher resolution OCR engine simply has more data to work with.
Higher image resolution results in improved OCR accuracy.
You may think that resizing the image should help. It does not. Resize algorithms use pixel interpolation. There are few different interpolation techniques. In all cases it results in inserting more pixels based on other pixel values around. It works like "digital zoom" feature on the photo camera: image becomes more and more blury and box like as you zoom in.
Social Security Administration W2 and W3 forms are good examples of forms prepared for OCR color drop-out. Forms are printed on paper in specific color (usually red). Then text is printed in different color on that paper. In some cases forms filled by hand. Because later produced text ink and pre-printed form frames have different color they are easy to separate.
This is old technique. One of the ways to help OCR process business forms is to remove them from the image leaving only text just before it gets to OCR for recognition. Once frames of forms are removed (that color removed from image) OCR has much easier time recognizing all the symbols. It does not need to parse geometry or try to align frames.
Color drop-out has number of downsides. Forms have to be pre-printed with frames in special color. Everyone who sends you paper documents has to first request forms from you. This is possible if you are powerful organization that can force everyone to obey your rules.
Modern OCRs should not require color drop-out.
When pages are printed or scanned slight angles creep in. Text lines and form boxes may no longer be straight parallel to the boundries of the page. This is not a problem for humans. Reading whole book with pages at 10-25 degree angles gets annoying but does not decease accuracy of recognition at the human level.
It is a problem for OCR. One of the solutions is to deskew the page. Undo effect of the global page angle. One of the simplest ways is to measure the angle and then rotate the page. Dust off your school geometry books. Trigonometry is back.
Image skew is most obvious in the tables with long horizontal or vertical lines. This can be seen in the image above.
But how to find an angle for the skewed page? One way is to perform vertical projection.
Vertical projection is a histogram. Sum of total number of pixels for each row of the page image.
You basically write a
for loop and sum each row's total pixel count. If it is a book page then lines naturally emerge as clumps of
greater numbers with valleys of lower numbers. This works great on books with text lines clearly giving you valleys but to lesser success on
business forms and documents with graphic content.
Note: in our business form processing products we use other techniques to perform deskew.
After initial pre-processing most OCRs do not operate on color images. They do not even operate on grayscale images. Everything is just white and black.
Initial processing step performs image thresholding. Image is converted into binary black and white format and all future operations use this representation. There are exceptions. Grayscale images might provide additional insight into data and could give better results at the cost of more complex algorithms.
There are two most common thresholding methods: global and local. Lets assume we have grayscale image and white is represented by value 255 while black is represented by value 0.
Binary thresholding on a surface seems like a simple operation: loop over all the pixels in the image and convert all non-white pixels into black. So everything that is not equal to 255 becomes 0. If you do that all the random noise in the image will get amplified. Groups of barely visible random dots will emerge as blobs of black dots. It will also make existing symbols bolder and in some cases make them touch (this makes segmentation harder).
This is clearly not going to work. Better approach is to choose clever value and mark all values above that value white and all values equal-below that value as black. Let say we choose gray value of 230 to be our cut off point. Everything light gray becomes white (gets erased) and everything dark gray becomes black. This is global thresholding. Question remains: how to cleverly choose cut off point?
Global thresholding has limitations. It is not flexible. If image has non-uniform background light conditions then global thresholding will work fine on part of the image but will not produce good result on parts of the image with darker background pixels. This is less of a problem for paper scans but big problem for other types of OCR. Alternative to global there is local adaptive thresholding method. Instead of one fixed cut off value local thresholding uses variable value that changes through out the image.
OCR engines have many processing steps. Initially looking at the problem researches and software developers usually assume that pattern recognition of the symbol is most important step.
From our experience once you have exact blob of pixels that constitute a symbol pattern matching can be done with modern machine learning techniques. It is not simple but solvable problem.
Sometimes during printing and scanning some symbols touch. They can touch at the top. For example: "r" will touch "n" (rn) forming letter "m". This is extreme example. But other less extreme examples are also common.
Segmentation is much harder problem. Can we just ignore it? Unfortunately, no. Depending on quality of the images you are getting there can be many symbols that need to be split.
Letters "y" and "w" clearly touch in the image above. There is also tiny gray pixel bridge from "w" to "h". May not be clearly visible in this image. Depending on what image thresholding step inside OCR does this bridge may disappear (will get converted to white). If thresholding does not erase it then three letters "ywh" form single connected component and segmentation will have to deal with it.
OCR engines have to perform segmentation step and split letters before pattern recognition can begin. Even little thinking reveals that it is a big problem:
Many approaches to solve this problem have been tried with limited success.
One of the approaches is to over-segment. Super short explanation: use recognition part of OCR engine and attach probabilities to small graph. Each edge of the graph is a hypothesis with probability. Each vertex is segment of the symbol or full symbol. Choose path through the graph that scores highest total probability.
Segmentation is one of the main problems.
Pattern recognition is well covered subject in papers and books on machine learning. Most recent research is focused on neural networks.
Hardware is getting faster. Ops. Let me correct this statement.
Hardware is getting more cores. It is getting parallelized.
CPUs designed for general computing do not increase much in speed. When you buy new powerful computer you are likely getting more cores and more RAM.
But cores themselves are not much faster.
Hardware is getting faster for software that can take advantage of multi-core architecture.
If you can take advantage of increasing number of cores you can build much more powerful neural networks. Different neural networks enter the scene. Convolutional (CNN), time delay (TDNN), radial basis function (RBF), recurrent neural networks (RNN), and many more. Many of them simply experience renaissance. Hardware got fast enough to run and train them.
Some network configurations are hard to parallelize. Especially during training because typical backpropagation methods propagate error values back through the network in sequential fashion. There are some advanced techniques on how to do it.
With so many different neural network configurations it is easy to get lost trying to pick one that works for you. Start with the simplest one that requires least amount of time to train but gives decent results. You want to get "your feet wet" first.
Start small. Prefer simplicity first.
Subject that is not well covered in literature on neural networks and OCR is feature vectors. It is what and how you feed the neural network. What features are important and how to encode them as inputs for neural network?
Typical neural networks (NN) have fixed number of input nodes. Images of scanned paper are two dimensional. Group of your algorithms runs
and gets to the point where you have blobs of pixels in 2D space representing symbols to be recognized. Since you can have different size letters on the page
2D blobs can have different sizes. But number of NN input nodes is fixed. Do you see the problem?
Do you just rescale input blobs to fit fixed grid and then convert grid into a vector? Then train neural network with it and hope that "magic" of NN pattern recognition will work.
Problem with rescaling is that image rescaling is usually based on pixel interpolation algorithms. They operate same way on all images. If you simply downscale images of symbols into a fixed grid you get fusing effects. Suddenly letter "i" gets its dot fused to main body and becomes "l". If you upscale images into very large grid then interpolation algorithm creates pixels where they do not exist in the original image.
There is no silver bullet here. Design of feature vectors resulting in better pattern recognition is art.
The step where 2D blob representing symbol as pixels gets converted into feature vector that gets fed into neural network is critical.
There are numbers of papers published on OCR and its more general cousin - computer vision. If you are planning on developing OCR systems it is good idea to read them.
Big source of inspiration is The International Conference on Document Analysis and Recognition (ICDAR). It is an international academic conference held every two years. It covers number of topics including character recognition, printed/handwritten text recognition, graphics analysis and recognition and document analysis.
Ideas are good. They give you food for thought. But ideas without testable code are just that - ideas.
Take academic papers with a grain of salt.