October 12, 2017

Image Data Generation for Optical Character Recognition(OCR)


In today's world, data is everywhere! From your smartwatch monitoring your heart rate to Uber knowing your location to send a cab to you. Data is the essential element in solving the problems faced by everyone. Data hides the pattern of human behavior to interaction with environment. With big giants like Google, Microsoft and other tech companies jumping on the ever-growing trends in Artificial Intelligence, we opted to solve a challenging task of teaching computers to read from printed text. That's right!

Image Data Generation for Optical Character Recognition(OCR)
Image of books

We here at Cloudadic Inc are on our way to solving the problem of perception in artificial intelligence. But, before we get started, in this blog post you will come across the methodologies we are using to get there.

Need for data!

To solve any kind of machine learning, deep learning or artificial intelligence problems, data is the most important part. To get the right data, right amount of data makes the difference in getting the best possible solutions. We here at Cloudadic, Inc are interested in Optical Character recognition problem in the field of computer vision, particularly the images of receipts, vouchers, and medicine/tablets names. You can find the images of receipts and various tablets on the internet, which are good for understanding how the data is diverse, and plays a key role in understanding the distribution of the data. Distribution in the sense, font, color, size and styling of the text data. The more diverse data we can get, understand and train our model, the better performance we can expect from our model.


Our point in generating synthetic data is to enable computer vision and deep learning algorithms to learn about the diverse representations of fonts, styling and distribution of alphabets in the English language. We generated the data using the PIL library in python which utilizes the font files present from the OS. Since we are focused only on reading from receipt and medicine images, we searched for the fonts which are most likely to be used in receipts. We collected around 20 different fonts. So, now that we have the fonts files, we want to generate the data as realistic as possible. In intend to do so, we first generate text that we need the image to contain and we set the amount of boldness the text should contain. In this case, the we made a list of alphabets, numbers and special characters. Later, we combined the both to generate images containing both words and numbers. The words were chosen from the Gutenberg corpus from NLTK. Generating images with text and white background is the ideal data we dream about, but in reality, the images are subjected to lot of noise, smudges and other noise. So in order for our model to be robust to all these we also generated images with grey background, added salt and pepper noise, rotated the text in random direction and made the data as close to the data in reality. In total, we generated 6.2 million images combined.

OCR executed image

Generating all the images are not just good enough, so we decided to go one step further into understanding the text which we can find in the wild. But we before we get into the more complicated stuff, here is a glimpse of our approach trying to segment words in an image.

Segmenting the Words from sentence in an Image

The objective for module 'Wordinator' is to segment words from an image of single sentence. The task was carried out using set of basic computer vision operation using OpenCV. Suppose say we have an image like this and we would like to segment each part of the image which contains the some kind of information.

OCR Image

Firstly, we applied operation like noise filtering and thresholding in order to enhance text and binarize the image so that it will look like this and will be easier for the computer to understand.

OCR Image

In the next operation we used 'Run Length Smearing Algorithm' to connect the nearby contours together which gives us the results below. The parameters of the algorithms were dynamically updated based on size and contours width and height.

Using contours analysis the coordinates of each words can be obtained by drawing rectangle around each contour. The coordinates of each rectangle is gives the location words in the image and each words can be cropped from the parent image.

And these cropped words can be used for next module of the pipeline.

Future Work

You'll soon be hearing from us regarding some fantastic products related to deep learning and computer vision, some cool applications in the field of virtual reality and augmented reality. In the near future, you will hear from us regarding the computer vision and deep learning methodologies we applied to overcome the challenge of enabling computers to read from printed text. Stay tuned to know more about our progress in this area and we will be discussing about the applying deep learning algorithms to computer vision problems.