Introduction

The recent developments in machine learning (ML) has enabled the automation of more complex tasks across multiple fields. For example, the explosion of deep learning has made extracting valuable information from data much more practical, flexible, and robust due to its ability to identify complex patterns in unstructured data. One very practical area where we’ve seen a lot of improvement recently is in so-called Intelligent Document Processing (IDP). The underlying goal of IDP is to extract relevant information from documents and convert them into structured data that can be later used either by a person, or another intelligent algorithm. This technology offers increased productivity, faster processes, reduced errors, and consequently brings more value to the customer.

For scanned or handwritten documents, extracting text accurately is a crucial step for IDP workflows, as it cannot be directly read from the file. Such extraction can be achieved with the use of Optical Character Recognition (OCR) - a technique that detects and classifies characters in an image to find the content of a document. Although OCR is one of the earliest addressed computer vision tasks, the recent developments in deep learning-enabled algorithms has led to much greater accuracy of detected words and sentences. This trend developed in parallel to the popularisation of cloud computing, and as the Big 3 cloud platforms, Microsoft Azure, Google Cloud Platform and Amazon Web Services, competed for market share, integrating the latest ML capabilities became part and parcel.

The cost of using the OCR service from one of the Big 3 is fairly comparable and relatively cheap, therefore they are primarily competing on performance. However, surprisingly we haven’t found any good comparisons, either from the vendors themselves, nor from a third-party. In fact the vendors do not even share the performance of their own service. This makes it difficult to evaluate which vendor to choose when integrating an OCR related feature into a product. This post was created to provide a comparison across the major OCR services. Curvestone has conducted experiments to assess their performance over three different datasets and we calculated metrics that indicate effectiveness for both detection of words bounding boxes and correctness of detected words.

Methodology

To compare OCR services, we needed to unify the method used to benchmark them. We were particularly interested in the performance of detection and the correctness of detected words. The general approach to assess single service performance over a single dataset was:

1. Flatten the dataset in a way that each page is a separate image

2. For each page:

       a. Detect words and their bounding boxes (using OCR service)

       b. Normalize format of bounding boxes

       c. Match detections with corresponding ground truth

3. Calculate metrics

4. Average per file metrics to assess service performance across the dataset

This section describes the approach to those problems and provides a list of compared OCR services.

Bounding boxes structure and matching

Different datasets and services use different bounding box formats. Before we matched detections, we needed to normalize their format. We transformed detections from each source so the resulting bounding boxes have the following format:

  • Each bounding box contained information about the word and its location on a page
  • We used the upper-left and lower-right corners to define the bounding box

To match the detection with ground truth we calculated the Intersection over Union (IoU) [explanation below] between all bounding box pairs and matched the ones that have IoU larger than a threshold (we found 0.5 to work well).

Metrics

We use the following metrics to compare OCR services:

  • Intersection over Union (IoU) - provides information about the overlap between two bounding boxes. It is calculated by dividing the area of intersection between bounding boxes by their combined area.
  • Accuracy - indicates the performance of the detection. It is calculated by dividing the number of matched bounding boxes by their  total number on a page
  • Levenshtein distance - it indicates the difference between the detected word and ground truth. Its value represents the number of single-character edits needed to match two words.

Providers and services

We wanted to provide the comparison between the three most widely used providers:

  • Microsoft Azure - cloud platform from Microsoft, it provides two OCR services:
  • Azure OCR - older service, presumably exists due to legacy reasons.
  • Azure Read - newer service using state-of-the-art techniques
  • Google Cloud Platform
  • Cloud Vision API - only OCR service from Google, using state-of-the-art techniques
  • Amazon Web Services
  • Amazon Textract - only OCR service from Amazon, using state-of-the-art techniques

Experiments

The experiments were performed over three datasets to check the performance of the services on documents of varying quality. We used two datasets that are commonly used to benchmark Intelligent Document Processing tasks:

  • DocBank - The dataset is composed of documents created using LaTex. It contains good quality documents that we are using to emulate digital documents (e.g., image-based PDF files)

Fig 1. Samples of documents from DocBank dataset.

  • FUNSD - The dataset contains noisy document scans, we are using it to check the performance for poor-quality documents

Fig 2. Samples of documents from FUNSD dataset.

From DocBank, we derived a synthetic dataset by scaling each image down and up by the factor of 2 and introducing blur. We did it to emulate the scanning of physical documents.

Fig 3. Comparison between real DocBank (left) and synthetically degraded DocBank dataset (right).

We refer to the datasets as:

  • Digital - DocBank
  • Degraded - DocBank with synthetically lowered quality
  • Noisy scan - FUNSD

Experiment 1 - Comparing services from Microsoft Azure

Microsoft Azure have two OCR services - Azure OCR and Azure Read. The former is an older service with an outdated model that does not perform as well as the latter, presumably still available to support legacy projects. Azure Read uses a state-of-the-art model for OCR, hence it should perform better. However, Microsoft does not share any insights into the difference in the quality of detections between these two services, hence we decided to check whether it is worth the effort of upgrading.

The numerical results in Table 1. indicate that Microsoft Read outperforms Azure OCR for all datasets. The difference between services increases with the decrease in the quality of input data. The obtained results are in line with the intuition - researchers now are focusing their efforts on developing networks that work well for data of increasingly worse quality.

Dataset

Service

Accuracy

Mean IoU

Levenshtein

Digital

Azure OCR

0.855

0.662

0.153

Azure Read

0.889

0.715

0.153

Degraded

Azure OCR

0.852

0.679

0.186

Azure Read

0.886

0.709

0.158

Noisy scan

Azure OCR

0.575

0.647

0.584

Azure Read

0.864

0.721

0.417

Table 1. Numerical comparison between performance of two Microsoft Azure OCR services. Azure Read outperforms the Azure OCR across all datasets, the difference between them increases with a decrease in the quality of the data.

The distribution of mean IoU is visualised in Fig 4. It presents information about how closely the detections match the ground truth. We can observe that Azure Read not only outperforms Azure OCR, but its results are also more consistent - its performance is more stable for a wide range of documents. In Fig 5. the distribution of Levenshtein distance between detected words and ground-truth is visualized. This metric indicates the correctness of detections, the lower the distance is, the closer it is to the ground truth. Again, Azure Read outperformed Azure OCR, especially for lower quality documents. It is worth noting that Azure Read had a few outliers, but on average it performed much better.

Fig 4. The distribution of IoU for all files in a given dataset. Azure Read not only outperforms Azure OCR, but its results are also much more consistent.

Fig 5. The distribution of Levenshtein distance for all files in a given dataset. The difference between the two services increases with a decrease in data quality.

Experiment 2 - Comparing the providers

In the second experiment, we compared OCR services from the Big 3 services: Azure Read from Microsoft Azure, Cloud Vision API from Google Cloud Platform and Amazon Textract from Azure Web Services. We decided not to include Azure OCR as it does not use the state-of-the-art model to detect text and performs worse than Azure Read.

The numerical results for this experiment are gathered in Table 2. For digital and degraded documents neither of the services outperforms the others, although Cloud Vision API seems to work worse than services from Amazon and Microsoft. For noisy scans, Amazon Textract turned out to be better in terms of the correctness of detected words, it also achieves moderately better accuracy and mean IoU.

Dataset

Service

Accuracy

Mean IoU

Levenshtein

Digital

Azure Read

0.889

0.715

0.153

Cloud Vision API (Google)

0.876

0.754

0.384

AWS

0.877

0.724

0.150

Degraded

Azure Read

0.886

0.709

0.158

Cloud Vision API (Google)

0.870

0.718

0.376

AWS

0.872

0.740

0.184

Noisy scan

Azure Read

0.864

0.721

0.417

Cloud Vision API (Google)

0.835

0.679

0.471

AWS

0.870

0.751

0.370

Table 2. The numerical comparison between the performance of OCR services. Results are similar across all providers for digital and degraded datasets. For noisy scans, Amazon Textract outperforms Azure Read and Cloud Vision API for all metrics.

The distributions of IoU and Levenshtein distance are visualized in Fig. 6 And Fig. 7, respectively. Although Amazon Textract and Azure Read yield similar mean IoU results, the latter seems to be more consistent across documents. In terms of correctness, Cloud Vision API is lagging behind its competitors. Surprisingly, it yields the best mean IoU score and the worst Levenshtein distance for digital documents. It means that it detects the location of words better than other services but it fails to read them correctly.

Fig 6. The distribution of IoU for all files in a given dataset.

Fig 7. The distribution of Levenshtein distance for all files in a given dataset.

Conclusions

We compared four Optical Character Recognition (OCR) services provided by the Big 3  cloud platforms: Microsoft, Google and Amazon. We used 3 documents data sets (2 unique, 1 synthesized) to analyse the performance across a range of variability to emulate real world situations. We found that the newer Azure Read, as expected, outperformed the older Azure OCR service, with the difference in performance increasing with decreasing quality of document. For consistently high quality documents, a migration from the older Azure OCR to Azure Read might not be justified purely by the performance depending on the use case. It is also worth noting that there are other factors that might impact the decision, e.g., pricing, asynchronous processing.

In terms of the cross-vendor comparison, Amazon Textract moderately outperformed the other services on noisy scanned documents, both in terms of word transcription and word location. For good quality documents, the differences between services was negligible and probably shouldn’t impact the decision to adopt. Instead, other factors such as; the wider cloud platform services/ecosystem, and pricing would be more important.