Handwriting Recongnition


Table of contents

0. Introduction

1. The EMNIST Dataset

2. Image Pre-processing

2.1 Importing the images
2.2 Converting the images to grayscale
2.3 Binarizing the images
2.4 Segmenting the images
2.5 Rescaling and center the segmented images
2.6 Bringing everything together...

3. Prediction

3.1 Training our model
3.2 Measuring the accuracy
3.3 Predicting words

4. Conclusion and possible improvements

5. References and further reading

0. Introduction

This notebook aims to show how to create a simple Handwriting Recognition (HWR) application, able to recognize both letters and numbers. For that, we are going to use two libraries fundamentally:

The steps we will follow are:

1. The EMNIST Dataset

For this example, we will use the EMNIST Dataset [1] which is "a set of handwritten character digits derived from the NIST Special Database 19 and converted to a 28x28 pixel image format and dataset structure that directly matches the MNIST dataset". You can find more information about it here.

However, if you wish to use a different dataset, the process should be very similar to what we will follow here.

The EMNIST Dataset offers six different splits. In our case, we will use the "EMNIST Balanced" split, which contains 131,600 characters (letters and numbers) and 47 balanced classes. We can get it from Kaggle.

Once downloaded, let's load it with pandas.

The format of data is:


Class Image data
4 [ 0 0 254 214 ... 214 154 45 0 0 ]
21 [ 188 0 0 179 ... 245 70 244 0 0 ]
8 [ 0 45 177 89 ... 80 154 90 0 45 ]
11 [ 0 252 196 200 ... 61 251 0 0 0 ]
... ...

Let's separate the input from the target values:

We can now plot one of the images to check that everything is working fine. We'll use matplotlib for that.

The image data, as we have mentioned previously, is stored in only one dimension. In addition, the images are mirrored horizontally and rotated 90º. Because of this, we cannot plot them directly. First, we need to take the following steps:

As we said before, the labels will take values between 0 and 46. The correspondence between the labels and the characters is as follows:

Label Character
0 '0'
... ...
9 '9'
10 'A'
... ...
35 'Z'
36 'a'
... ...
46 't'

We might think that the conversion of the label into ASCII code would be as simple as carrying out an addition. For example, the label for 'A' is 10, and its ASCII code is 65. So, adding 55 to the label would be enough to get the character.

However, that's quite not right, since there are characters that share the same label, as we can see in the following image extracted from the EMNIST paper:

Balanced EMNSIT Dataset

This is due to the similarity of certain lowercase and uppercase letters. For example, the characters 'o' and 'O', 'x' and 'X', 'w' and 'W', etc. Telling whether they are lower or uppercase in an isolated context is very complicated, if not impossible.

To establish the correspondence label-character, due to the small irregularities mentioned, the simplest thing to do might be defining a dictionary.

Using the previous example...

It would also be interesting to implement a function that would allow us to find all the occurrences of a certain character in the dataset.

For example, let's look for all the occurrences of 'C':


2. Image Pre-processing

The goal of our application is to recognize not only individual characters, but entire words.

To do this, it is necessary to carry out a pre-processing of the image. This pre-processing will basically consist of:

Let's go step by step.


2.1 Importing the images

To import the images, we will use three libraries:


2.2 Converting the images to grayscale

Scikit-image provides the rgb2gray() method for converting an image to grayscale:


2.3 Binarizing the images

The next thing we have to do is to binarize the image, that is, to make the pixels of the image take only two values. In our case, we will make those two values 'True' or 'False'.

An important concept when binarizing an image is the threshold. Esentially, this value will set the limit between what will be 'True', and what will be 'False'.

Finding a threshold that is versatile enough to adapt to different images can be a complicated task. The scikit-image library provides a number of functions to find an appropriate threshold value. We will use threshold_otsu(), based on the Otsu method.

All pixels whose values are below the threshold, will become True; the rest, will be False. That is, the dark pixels (for example, those that correspond to a letter), will be True; the light pixels (the white background), will be False.


2.4 Segmenting the images

To separate the letters that make up the word, we will project the image horizontally and vertically.

We can collect the information from the projections in one-dimensional lists. The values will be boolean, since we just need to know if there is information in a certain column/row of pixels (True) or not (False).

Once we know where there's information, and where there's not, thanks to the functions defined above, we want to know in which pixel intervals the information is contained. We can store the start and end points of the information in a list, as illustrated below:

List with the horizontal projections

With this information, we are now able to segment the image:


2.5 Rescaling and center the segmented images

The resizing of the images can be done with the help of the resize() function from scikit-image. In order to avoid pixelated edges when resizing, we will first apply a Gaussian filter. Again, scikit-image provides a suitable function for this: gaussian().

To prevent the image from being distorted and the edges of the letter from sticking to the edge of the image, we will add a white frame around it, so that the letter is nicely centered in the image.

We will define a function that will add these borders and resize the image.

When returning, the image is normalized so that all its values are between 0 and 255 and are of type int, just like the samples of our dataset. This is necessary since the gaussian () function returns the image in float values between 0.0 and 1.0.

Let's create a function that segments the letters of the word, resizes them, and adds the white border to each of them:


2.6 Bringing everything together...

Finally, we will define a function that gathers all the steps involved in the pre-processing of the image:


3. Prediction

To predict the letters, we will use a neural network as a classifier. The scikit-learn library provides us with multiple alternatives when it comes to classifiers. We will use the MLPClassifier, a classifier that implements a multilayer perception.

This classifier supports numerous hyper-parameters that we can adjust to achieve better prediction. In our case, we will only define the number of hidden layers and neurons, but you can try more advanced configurations to improve its accuracy.


3.1 Training our model

To train the classifier, since we already defined our training sets in Section 1, we only need to call our classifier fit method.

Once our model is trained, we can save it so we don't lose it when we close the notebook. To do this, we are going to use the pickle library, which also allows us to load a pre-trained classifier.


3.2 Measuring the accuracy

Once we have trained our model, we will measure its accuracy. Let's define a function for that:


3.3 Predicting words

Finally, we can try to predict our own words. Let's define a function that joins everything we have seen so far:


3.3.1 Image #1


3.3.2 Image #2


3.3.3 Image #3


3.3.4 Image #4


3.3.5 Image #5


3.3.5 Image #6


Note: You might have thought that the previous examples are cherry-picked... Well... you're somehow right ;) But hey, we trained a very basic classifier, so it's not that bad, is it? That's why I challenge you to train better models, you might get much better results! You can learn more about the Multi-layer Perceptron (MLP) and scikit-learn's implementation of it here.


4. Conclusion and possible improvements

In this notebook we have explained a simple implementation of Handwritten Recognition (HWR), covering the pre-processing of the images, the training of a simple classifier, and last but not least, we have tested the whole thing with real-world examples.

One important thing to notice is that our implementation, because of being so simple, has some serious limitations. For example, because of the way we separate the letters (through projections), our algorithm would be unable to separate letters that are joined together, something that happens frequently in handwriting.

As for possible improvements, there would be quite some of them, although the most immediate would be:

5. References and further reading

[1] Cohen, G., Afshar, S., Tapson, J., & van Schaik, A. (2017). EMNIST: an extension of MNIST to handwritten letters. Retrieved from http://arxiv.org/abs/1702.05373.

Back to the begining