Recognizing Handwritten Digits using scikit_learn

Handwriting Recognition

Sharique Tabassum
6 min readSep 20, 2021

Recognizing handwritten text is a problem that can be traced back to the first automatic machines that needed to recognize individual characters in handwritten documents. Think about, for example, the ZIP codes on letters at the post office and the automation needed to recognize these five digits. Perfect recognition of these codes is necessary in order to sort mail automatically and efficiently. Included among the other applications that may come to mind is OCR (Optical Character Recognition) software. OCR software must read handwritten text, or pages of printed books, for general electronic documents in which each character is well defined. But the problem of handwriting recognition goes farther back in time, more precisely to the early 20th Century (1920s), when Emanuel Goldberg (1881–1970) began his studies regarding this issue and suggested that a statistical approach would be an optimal choice. To address this issue in Python, the scikit-learn library provides a good example to better understand this technique, the issues involved, and the possibility of making predictions

The scikit-learn library (http://scikit-learn.org/) enables us to approach this type of data analysis in a way that is slightly different from previous project. The data to be analyzed is closely related to numerical values or strings, but can also involve images and sounds.

Hypothesis to be tested

The Digits data set of scikit-learn library provides numerous data-sets that are useful for testing many problems of data analysis and prediction of the results. Some Scientist claims that it predicts the digit accurately 95% of the times. Perform data Analysis to accept or reject this Hypothesis.

The Digits Dataset

The scikit-learn library provides numerous datasets that are useful for testing many problems of data analysis and prediction of the results. Also in this case there is a dataset of images called Digits. This dataset consists of 1,797 images that are 8x8 pixels in size. Each image is a handwritten digit in grayscale.

Data Analysis

IMPORTING THE DATA

from sklearn import datasets
digits = datasets.load_digits()

DATASET DESCRIPTION

After loading the dataset, we can analyze the content. First, we can read lots of information about the datasets by calling the DESCR attribute.

print(digits.DESCR)

For a textual description of the dataset, the authors who contributed to its creation and the references will appear.

.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 5620
:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a total of 43 people, 30 contributed to the training set and different 13 to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of 4x4 and the number of on pixels are counted in each block. This generates an input matrix of 8x8 where each element is an integer in the range 0..16. This reduces dimensionality and gives invariance to small distortions.
For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,1994.

.. topic:: References

- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
Graduate Studies in Science and Engineering, Bogazici University.
- E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
- Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
Linear dimensionalityreduction using relevance weighted LDA. School of
Electrical and Electronic Engineering Nanyang Technological University.
2005.
- Claudio Gentile. A New Approximate Maximal Margin Classification
Algorithm. NIPS. 2000.

The images of the handwritten digits are contained in a digits.images array. Each element of this array is an image that is represented by an 8x8 matrix of numerical values that correspond to a grayscale from white, with a value of 0, to black, with the value 15.

digits.images[0]

We will get the following result:

array([[ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.],
[ 0., 0., 13., 15., 10., 15., 5., 0.],
[ 0., 3., 15., 2., 0., 11., 8., 0.],
[ 0., 4., 12., 0., 0., 8., 8., 0.],
[ 0., 5., 8., 0., 0., 9., 8., 0.],
[ 0., 4., 11., 0., 1., 12., 7., 0.],
[ 0., 2., 14., 5., 10., 12., 0., 0.],
[ 0., 0., 6., 13., 10., 0., 0., 0.]])

We can visually check the contents of this result using the matplotlib library.

import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation = ‘nearest’)

By launching this command, we will obtain the grayscale image as shown in the given Figure .

<matplotlib.image.AxesImage at 0x2fbbcf24c10>
One of the 1,797 handwritten digits

The numerical values represented by images, i.e., the targets, are contained in the digit.targets array.

digits.target

We will get the following result:

array([0, 1, 2, ..., 8, 9, 8])

It was reported that the dataset is a training set consisting of 1,797 images. We can determine if that is true.

digits.target.size

This will be the result:

1797

digits.data.shape

(1797, 64)

Learning and Predicting

Now that we have loaded the Digits datasets into our notebook let us define an SVC estimator.

An estimator that is useful in this case is sklearn.svm.SVC, which uses the technique of Support Vector Classification (SVC). Thus, we have to import the svm module of the scikit-learn library. We can create an estimator of SVC type and then choose an initial setting, assigning the values C and gamma generic values. These values can then be adjusted in a different way during the course of the analysis.

from sklearn import svm

svc = svm.SVC(gamma=0.001, C=100.)

We know that, once we define a predictive model, we must instruct it with a training set, which is a set of data in which we already know the belonging class. Given the large quantity of elements contained in the Digits dataset, we will certainly obtain a very effective model, i.e., one that’s capable of recognizing with good certainty the handwritten number. This dataset contains 1,797 elements, and so we can consider the first 1,791 as a training set and will use the last six as a validation set. We can see in detail these six handwritten digits by using the matplotlib library:

import matplotlib.pyplot as plt

%matplotlib inline

plt.subplot(321)

plt.imshow(digits.images[1791], cmap=plt.cm.gray_r, interpolation=’nearest’)

plt.subplot(322)

plt.imshow(digits.images[1792], cmap=plt.cm.gray_r, interpolation=’nearest’)

plt.subplot(323)

plt.imshow(digits.images[1793], cmap=plt.cm.gray_r, interpolation=’nearest’)

plt.subplot(324)

plt.imshow(digits.images[1794], cmap=plt.cm.gray_r, interpolation=’nearest’)

plt.subplot(325)

plt.imshow(digits.images[1795], cmap=plt.cm.gray_r, interpolation=’nearest’)

plt.subplot(326)

plt.imshow(digits.images[1796], cmap=plt.cm.gray_r, interpolation=’nearest’)

This will produce an image with 6 digits as shown in the figure below.

The six digits of the validation set

Now we can train the svc estimator that we defined earlier.

svc.fit(digits.data[1:1790], digits.target[1:1790])

After a short time, the trained estimator will appear with text output.

SVC(C=100.0, gamma=0.001)

Now we have to test our estimator, making it interpret the six digits of the validation set

svc.predict(digits.data[1791:1796])

We will obtain these results:

array([4, 9, 0, 8, 9])

If we compare them with the actual digits, as follows:

digits.target[1791:1796]

array([4, 9, 0, 8, 9])

Let’s try with different range of training and validation sets:

svc.fit(digits.data[10:20], digits.target[10:20])

SVC(C=100.0, gamma=0.001)

svc.predict(digits.data[10:20])

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

digits.target[10:20]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

svc.fit(digits.data[1005:1020], digits.target[1005:1020])

SVC(C=100.0, gamma=0.001)

svc.predict(digits.data[1005:1020])

array([6, 9, 6, 1, 7, 5, 4, 4, 7, 2, 8, 2, 2, 5, 7])

digits.target[1005:1020]

array([6, 9, 6, 1, 7, 5, 4, 4, 7, 2, 8, 2, 2, 5, 7])

CONCLUSION

We can see that the predicted and target values are same and the svc estimator has learned correctly. We have got 100% accurate predictions for the above cases. Hence, we accept the given hypothesis.

I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Experience. Thank you www.suvenconsultants.com

--

--