MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Ferrer, Miguel A.; Das, Abhijit; Diaz, Moises; Morales, Aythami; Carmona-Duarte, Cristina; Pal, Umapada

doi:10.1007/s12559-023-10193-w

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Open access
Published: 25 August 2023

Volume 16, pages 131–157, (2024)
Cite this article

Download PDF

You have full access to this open access article

Cognitive Computation Aims and scope Submit manuscript

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Download PDF

Miguel A. Ferrer¹,
Abhijit Das²,
Moises Diaz ORCID: orcid.org/0000-0003-3878-3867¹,
Aythami Morales³,
Cristina Carmona-Duarte¹ &
…
Umapada Pal⁴

994 Accesses
1 Citation
Explore all metrics

Abstract

Script identification plays a vital role in applications that involve handwriting and document analysis within a multi-script and multi-lingual environment. Moreover, it exhibits a profound connection with human cognition. This paper provides a new database for benchmarking script identification algorithms, which contains both printed and handwritten documents collected from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers. Further, these documents are segmented into lines and words, comprising a total of 13,979 and 86,655 lines and words, respectively, in the dataset. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods. The benchmark includes results at the document, line, and word levels with printed and handwritten documents. Results of script identification independent of the document/line/word level and independent of the printed/handwritten letters are also given. The new multi-lingual database is expected to create new script identifiers, present various challenges, including identifying handwritten and printed samples and serve as a foundation for future research in script identification based on the reported results of the three benchmarks.

ICDAR 2021 Competition on Script Identification in the Wild

Offline script recognition from handwritten and printed multilingual documents: a survey

Article 22 March 2021

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Article 18 May 2017

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

With the ever-increasing demand for the creation of a digital world, many Optical Character Recognition (OCR) algorithms have been developed over the years. A script can be defined as the graphic form of the writing system used to write a statement. The availability of large numbers of scripts makes the development of a universal OCR a challenging task. This is because the features needed for character recognition are usually a function of structural script properties and the number of possible classes or characters. The extremely high number of available scripts makes the task quite daunting, and as a result, most OCR systems are script-dependent [1].

Script identification is the initial cognitive process that occurs when a human reads printed or handwritten texts. Perceptual rules ensure that humans focus on borders and corners to ensure accurate identification. When it comes to automatic script identification, our aim is to utilize features rooted in cognitive principles to achieve optimal results.

In this paper, we propose using texture-based features on black-and-white images as the first step. These features will emphasize borders and corners, aligning with cognitive principles. The extracted features will then be inputted into machine learning schemes for script identification. In the second step, we will leverage deep learning classifiers that emulate the interconnected nature of human cognitive processes to perform the same task.

Our approach involves comparing texture features and machine learning schemes with deep learning paradigms to establish a benchmark for the shared new multi-lingual MDIW-13 script identification database. This benchmark is expected to serve as a valuable resource for evaluating and comparing diverse script identification methods.

The approach for handling documents in a multi-lingual and multi-script environment is divided into two steps: first, the script of the document, block, line, or word is estimated, and secondly, the appropriate OCR is used. This approach requires a script identifier and a bank of OCRs, at a rate of one OCR per possible script.

Many script identification algorithms have been proposed in the literature. A survey published in 2010 with a taxonomy of script identification systems can be found in [2]. A more recent global study on state-of-the-art script identification can be found in [3]. Instead, the survey in [4] is focused on Indic Scripts. These surveys report novel performances of script identification methods based on pattern recognition strategies.

Script identification can be conducted either offline, from scanned documents, or online if the writing sequence is available. Identification can also be classified either as printed or handwritten, with the latter being the more challenging. Script identification can be performed at different levels: page or document, paragraph, block, line, word, and character. An example for Indic scripts is given in [5].

As it is similar to any classical classification problem, the script identification problem is a function of the number of possible classes or scripts to be detected. Furthermore, any similarity in the structure of scripts represents an added challenge. If two or more scripts are very similar, then the identification complexity increases. For example, the Kannada and Telugu scripts are very similar and thus, lend themselves to confusion in many cases. Although documents with two scripts represent the most common problem, documents with three and more scripts can also be found [6].

A unified approach based on local patterns analysis was proposed in [7] for script identification at line level and improved in [8] for word level. It was applied to video frames in [9]. In these cases, histograms of local patterns are used as features describing both the direction distribution and global appearance of strokes. In a further step, Neural Networks have demonstrated their capacity to extract highly discriminant features from images when enough data is available. Consequently, Neural Networks with Deep Learning have been explored in many tasks that involve document analysis. Specifically, in [10], the authors proposed a Discriminative Convolutional Neural Network (DCNN). Their approach combines deep features obtained from three convolutional layers. Their results, which registered performance gains of over 90% in a database with 13 scripts, demonstrate the feature extraction capacity of DCNN for script identification tasks.

Other approaches have explored similar or optimized architectures like Discriminative CNN [10]. An example is given in [11], where the authors stated that addressing the script identification problem with state-of-the-art Convolutional Neural Network (CNN) classifiers is not straightforward, as they fail to address some key characteristics of scripts, e.g., their extremely variable aspect ratio. Instead of resizing input images to a fixed aspect ratio, the authors of [11] proposed a patch-based classification framework to preserve discriminative parts of the image. To this end, they used ensembles of conjoined networks to jointly learn discriminative stroke-part representations and their relative importance in a patch-based classification scheme.

CNNs have further been applied to handwritten script recognition, as proposed in [12]. In that work, an architecture composed of two convolutional layers was employed. The results in a database containing 5 scripts demonstrate the potential of CNNs in either handwritten or printed text. Recurrent Neural Networks (e.g., Long Short-Term Memory Networks) have been explored in the context of Arabic [13] and Indic [14] script identification. These network architectures allow capturing sequential information and achieving state-of-the-art performance. Also, a combination of individually trainable small CNNs with modifications in their architectures was used in [15] for multi-script identification.

Further, the authors in [16] introduced the extreme learning machine (ELM) technique, which generalizes the performance of neural networks. The authors studied this technique on 11 official Indic scripts and observed significant results when the sigmoidal activation function was used.

The power of CNN was also evidenced in [12] to identify Chinese, English, Japanese, Korean, or Russian scripts. The authors also evaluated whether the texts were handwritten or machine-printed and obtained excellent performances.

In summary, while most works claim identification rates exceeding 92%, each work, however, uses different datasets with different script combinations. Therefore, it is difficult to carry out a fair comparison of these different approaches. Moreover, the databases employed in related studies usually include two to four scripts. A few actually include an even higher number of scripts. The most popular scripts are Latin, Indian, Japanese and Chinese, with Greek, Russian and Hebrew also featuring here and there [2]. A common database allowing a fair comparison of different algorithms would thus be desirable.

While building a dataset used to be a costly endeavor, it has become much simpler and easier today, even though the task remains arduous and laborious. For instance, documents from different scripts can be generated using the Google Translate application, as in [8], for example. However, in this case, the font, size and background of the generated document will be the same, which is unrealistic.

To alleviate this drawback, this paper aims to offer a database for script identification, which consists of a wide variety of some of the most commonly used scripts, collected from real-life printed and handwritten documents. Further, along with the database, its benchmarking with texture-based features and deep learning are also showcased. The printed documents in the database were obtained from local newspaper and magazines, and therefore, comprise different fonts and sizes and cursive and bold text. A sample of the newspaper used can be seen in Fig 1. The handwritten part was obtained from volunteers from all over the world, who scanned and shared their manuscripts. A few samples of the handwritten documents can be seen in Fig 2.

The following three benchmarks of this database are provided for script identification using different handcrafted features: Local Binary Pattern [17], Quad-Tree Histogram of Templates [18], and Dense Multi-Block Local Binary templates with a Support Vector Machine as a classifier [19]. These script identifiers were used in a document analysis context in [4] and [5]. A benchmark with Deep Learning techniques is also included in our study to demonstrate the usefulness of this database to train deep models.

As a summary, the contributions of the work are listed as follows:

1.
A freely accessible multi-lingual database towards script identification called MDIW-13 (Multi-lingual and Multi-script Document Identification in the Wild. Number 13 refers to the number of scripts in the dataset).
2.
The database provides the possibility of handwritten and printed script identification.
3.
The database allows script identification at document, line, and word levels.
4.
The database enables cross-training, e.g., train with printed and test with handwriting; train with lines and test with words, among others.
5.
A benchmark with different standard parameters and classifiers is given for the sake of comparison.

Previous Works on Public Databases

The research community is interested in script identification as it can help in different document analysis tasks, such as OCR, handwriting recognition, document analysis or writer identification [20]. However, the number of script identification databases available is limited, so there is a significant need for publicly available databases.

Regarding the number of scripts, size, and availability of datasets for script identification, the most popular public databases contain only Roman and Arabic scripts. An example includes the database of the Maurdor project [21], which is contemporary to the MALIS-MSHD [22]. Other ones can be also used for script identification although they are devised for writer recognition [23]. Also exist databases of printed script [24]. Roman, Bengali and Devanagari databases were compiled in [25]. The authors proposed bi-script and tri-script word-level script identification benchmarks studying the performances in several classifiers. The literature also considers databases with peculiar scripts, which have not been thoroughly investigated in handwriting. An example can be seen in [26], where an Indic database includes the Meitei Mayek script. The SIW-13 [27] is a script identification benchmark with ten different scripts composed of printed text obtained from natural scene images. SIW-13 consists of 10 scripts, including English, Greek, Hebrew, Russian, Arabic, Thai, Tibetan, Korean, Kannada, Cambodian, Chinese, Mongolian, and Japanese. Also, in [28] is found PHDIndic_11, a publicly available dataset focused on 11 official Indic Scripts, which are used in the 22 official languages in India. Previous existing databases are summarized in Table 1.

Table 1 Summary of public script identification databases(H &P = Handwritten and Printed samples)

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

Abstract

Similar content being viewed by others

ICDAR 2021 Competition on Script Identification in the Wild

Offline script recognition from handwritten and printed multilingual documents: a survey

Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images

Introduction

Previous Works on Public Databases

MDIW-13: A New Database for Script Identification

Main Challenges in Data Collection

Main Challenges in Digitizing Printed Documents

Main Challenges in Digitizing Handwritten Documents

Background and Ink Equalization

Text line Segmentation

Word Segmentation

Script Identifiers

Local Binary Patterns for Script Detection

Quad-Tree Histogram of Templates for Script Detection

Script Features Based on Dense Multi-Block LBP Features

Classifier

Deep Neural Network Architectures

Benchmarking: Experiments

Training Sequences

Description of the Tasks

Used Metrics

Experimental Results

Benchmark 1: Handcrafted Feature Combination (LBP+quad-tree)

Benchmark 2: Handcrafted Feature (Dense Multi-Block LBP)

Benchmark 3: Deep Neural Networks

Discussion

Conclusion

Availability of Data and Materials

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Ethics Approval

Conflict of Interest

Additional information

Publisher's Note

Additional Results

Additional Results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation