Prediction of the Optical Character Recognition Accuracy based on the Combined Assessment of Image Binarization Results

1 Abstract —In the paper the problem of reliable evaluation of the effects of image binarization is discussed in view of image recognition accuracy. Considering the Optical Character Recognition methods, typically used for document images obtained by cameras or scanners, their accuracy is strongly dependent on the results of image binarization. Unfortunately, metrics typically used for the evaluation of binarization results, such as Peak Signal to Noise Ratio, Distance Reciprocal Distortion or Misclassification Penalty Metric, are not always well correlated with the recognition accuracy of individual characters. Therefore, a novel approach related to the use of combined metric for the assessment of binarization results is proposed and verified for the binary images obtained using some popular histogram-based methods from the original images with degraded quality. For the experimental prediction of the character recognition accuracy, the popular open source engine supported by Google, known as Tesseract, has been used


I. INTRODUCTION
The role of the Optical Character Recognition (OCR) algorithms in view of image analysis and recognition methods is still important regardless of the fact that more and more sophisticated methods based on the analysis of words and phrases are included in the modern OCR systems.Since the applications area of such image based text recognition methods is still growing, mainly due to the development of mobile devices equipped with relatively cheap cameras, there is still a need to develop some fast and reliable algorithms which should be able to recognize the individual characters properly in the presence of various distortions or different lighting conditions.A proper recognition of text from the document image captured by the smartphone's camera is not always an easy task assuming unknown lighting conditions and limited available resources.
Nevertheless, one of the most relevant elements of the text recognition workflows is still the image binarization step.Since several more or less complicated algorithms can be applied for this purpose, such as the popular methods proposed e.g. by Otsu [1], Sauvola [2], Niblack [3], Rosin [4] and Kapur [5], their performance differs significantly for lower quality input images, noticeably influencing the text recognition accuracy.

II. EVALUATION OF BINARIZATION RESULTS
One of the unsolved issues related to this field is still the evaluation of the binary images in view of its usefulness for further text recognition.Since the process of image binarization is ambiguous and various algorithms lead to different results, there is a need of reliable comparison of the outputs of the binarization algorithms.Unfortunately, there are no "blind" methods, which are even more popular in image quality assessment area and do not require the knowledge of the original image.For the evaluation of binarization results the original "ground-truth" image has to be provided since all the metrics are calculated on the base of relatively simple comparison of binary values for corresponding pixels.Therefore, both compared images must be geometrically matched to each other in order to obtain proper results.
Considering the problem of prediction of the OCR accuracy, an ideal solution would be the use of the "blind" (no-reference) metric but the development of such one would be possible only using a large image dataset with results of the character recognition.Moreover, such a metric would be probably suitable only for a limited number of image distortions, specific binarization methods and recognition algorithms.
The first step towards such solution should be the development of a full-reference image binarization evaluation metric, similarly as for general image quality assessment purposes, which would be well correlated with character recognition accuracy using different algorithms in the presence of various distortions.Nevertheless, the verification of such metric also requires the dataset of images containing various distortions, subjected to binarization using different methods, together with the numerical results of the OCR accuracy for each obtained binary image.
For the verification of the idea proposed in the paper,

Prediction of the Optical Character Recognition Accuracy based on the Combined Assessment of Image Binarization Results
Piotr Lech 1 , Krzysztof Okarma a relatively small such dataset has been prepared, consisting of three pristine images subjected to five types of distortions typical for machine printed documents.In order to simulate them the original images have been printed on the letterhead paper, on the other side of the previously printed paper, on the older paper sheet or the colour paper as well as subjected to wrinkling.The scanned images of such documents have been geometrically matched with "ground-truth" images and subjected to binarization using three popular histogrambased methods, namely Otsu, Kapur and Rosin algorithms.Illustration of the "ground truth" images used in the experiments is shown in Fig. 1, some results of distortions are presented in Fig. 2, whereas the exemplary obtained binarization results are illustrated in Fig. 3. Since there are some metrics which are often used for the evaluation of the binarization results by the comparison with the "ground-truth" binary image [6]- [8], a natural solution seems to be their application also for this purpose.Unfortunately, typically used well-known metrics which are fast to compute, such as Peak Signal to Noise Ratio (PSNR), Distance Reciprocal Distortion (DRD) [9] or Misclassification Penalty Metric (MPM) [10] turn out to be rather poorly correlated with OCR accuracy.For those reasons, we have focused on the development of the combined metric which should be better correlated with the results of the recognition of individual characters.In order to verify the validity and usefulness of the proposed approach some experiments have been conducted with the use of Tesseract [11] which is probably the most accurate open-source OCR software, developed previously in HP Labs and now supported by Google.

III. IDEA OF COMBINED METRIC FOR EVALUATION OF BINARIZATION OUTPUT
Recognition of individual characters on the binary image is strongly dependent on their shapes which may be influenced especially on the edges due to the improper choice of the threshold value.A general rule seems to be relatively simple -the higher number of pixels differing between the result of binarization and the "ground-truth" image causes more errors during the character recognition.The results of the OCR using Tesseract engine for each test image have been compared with the proper results obtained from the original document file and the number of errors has been calculated for each of them as well as the recognition accuracy defined as where Nerr denotes the number of errors and Ntotal stands for the total number of characters in the text.It is worth to mention that the recognition accuracy achieved using Tesseract for all "ground-truth" images has been equal to 1.
The results of the obtained recognition accuracy as well the values of three image binarization metrics (PSNR, DRD and MPM) have been stored in vectors consisting of 14 elements each (one image has been removed from the experiments due to improper binarization result in order to prevent its impact on the obtained results).Next, the Pearson's linear correlation coefficients (PCC) with the recognition accuracy have been calculated for each metric.
Individual metrics are defined as where GT is the "ground-truth" image and BW denotes the result of binarization where NU is the number of non-uniform (fully black or fully white) 8 × 8 blocks in the "ground-truth" image and K is the number of flipped pixels and for k-th flipped pixel ( , ) ( , ) ( , ), where W is 5 × 5 normalized weight matrix [9], whereas where D is the sum of all the pixel-to-contour distances of the ground truth object, FN and FP are the numbers of false positives and false negatives for which the distances d can be calculated respectively.In order to increase the correlation of metrics with the OCR accuracy, the Combined Binarization Evaluation Metric (CBEM) has been proposed in the following form , where a, b and c are the values of the parameters obtained by optimization.Such an idea comes from the general image quality assessment where a similar approach has been successfully applied [12]- [14] leading to significant increase of the correlation of metrics with subjective quality evaluations which are available in several dedicated databases.

IV. EXPERIMENTAL RESULTS
The maximum value of the Pearson's correlation coefficient between the Tesseract OCR accuracy and the proposed CBEM results has been obtained using the MATLAB's functions fminsearch and fminunc.The obtained values of the parameters of the combined metrics are equal to: a = -1.39,b = -0.83and c = -4.44 leading to a significant increase of the PCC value from 0.6158 for the best single metric to 0.7145 for the CBEM.The detailed results are presented in Table I together with the result obtained for the unweighted combined metric (without optimization of its parameters).The additional illustration of the advantages of the proposed approach is provided in Fig. 4-Fig.7 where the scatter plots illustrating the values of the metrics and the obtained recognition accuracy for each image are presented.The improved correlation of the proposed CBEM metric with the recognition accuracy can be obtained only due to the optimization of weighting coefficients as the PCC value achieved for the unweighted version of the CBEM lower even than for the use of single DRD or PSNR metric.0.86 0.88 0.9 0.92 0.94 0.96 0.98 1 6.5  Analysing the scatter plots presented in Fig. 4-Fig.7 much more linear relationship between the proposed metric and the OCR accuracy in comparison to three single metrics can be easily determined.

V. CONCLUSIONS
The novel approach based on the application and optimization of the combined metric proposed in the paper has led to great results.Since such methods have never been applied in the OCR applications and binary image analysis, it may be an interesting stimulation for the development of new OCR algorithms for degraded quality document images.
The additional verification conducted for more demanding images available in the DIBCO'2011 dataset, widely used by the community for the verification of document image binarization algorithms, has led to similar conclusions.However, due to some troubles caused by the presence of serious degradations as well as some historical gothic font shapes, the OCR accuracy values are not representative and therefore they have not been presented in the paper.
Nevertheless, the obtained results are encouraging for further research which should concentrate on the development of a larger database which should be annotated with the results of character recognition as well as the development of some metrics (preferably no-reference) even better correlated with the OCR accuracy at least for some typical font shapes and typical distortions.

Fig. 2 .
Fig. 2. Original images used during the experiments subjected to exemplary distortions.

Fig. 3 .
Fig. 3. Exemplary binarization results -from top to bottom: using Otsu, Kapur and Rosin algorithms.Since most of the metrics typically used for the evaluation of the binarization algorithms are based on the similar

Fig. 4 .Fig. 5 .
Fig. 4. Scatter plot illustrating the correlation between the DRD metric and the OCR accuracy.

Fig. 6 .Fig. 7 .
Fig.6.Scatter plot illustrating the correlation between the PSNR metric and the OCR accuracy.