eBook - ePub

Document Analysis and Text Recognition

Name: Document Analysis and Text Recognition
ISBN: 9789813229280

Benchmarking State-of-the-Art Systems

Volker Märgner,

Umapada Pal,

Apostolos Antonacopoulos,

304 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Document Analysis and Text Recognition

Benchmarking State-of-the-Art Systems

Volker Märgner,

Umapada Pal,

Apostolos Antonacopoulos,

About this book

-->

The compendium presents the latest results of the most prominent competitions held in the field of Document Analysis and Text Recognition. It includes a description of the participating systems and the underlying methods on one hand and the datasets used together with evaluation metrics on the other hand. This volume also demonstrates with examples, how to organize a competition and how to make it successful. It will be an indispensable handbook to the document image analysis community.

--> Contents:

Logical Structure and Segmentation:
- Logical Structure Extraction from Digitized Books (Antoine Doucet)
- Handwriting Segmentation (Nikolaos Stamatopoulos, Georgios Louloudis, and Basilis Gatos)
Digits and Mathematical Expressions:
- Handwritten Digit and Digit String Recognition (Markus Diem, Stefan Fiel, and Florian Kleber)
- Handwritten Mathematical Expressions (Harold Mouchère, Christian Viard-Gaudin, Richard Zanibi, and Utpal Garain)
Writer Identification and Signature Verification:
- Writer Identification (Georgios Louloudis, Nikolaos Stamatopoulos, and Basilis Gatos)
- Arabic Writer Identification Using AHTID/MW and KHATT Database (Fouad Slimane and Sameh Awaida)
- Signature Verification: Recent Developments and Perspectives (Muhammad Imran Malik, Sheraz Ahmed, Andreas Dengel, and Marcus Liwicki)
Text Recognition:
- Handwritten Text Recognition Competitions With the tranScriptorium Dataset (Joan Andreu Sánchez, Verónica Romero, Alejandro H Toselli, and Enrique Vidal)
- Multifont and Multisize Low-resolution Arabic Text Recognition Using APTI Database (Fouad Slimane and Christine Vanoirbeek)

-->
--> Readership: Researchers, academics, professionals and graduate students in pattern recognition/image analysis, neural networks, and innovation/technology. -->
Keywords:Document Analysis;Digitisation;Character Recognition;Mathematical Expression Recognition;Digit Recognition;Arabic Documents;Handwriting Recognition;Performance Evaluation;Benchmarking;CompetitionsReview:0

Tools to learn more effectively

Saving Books

Keyword Search

Annotating Text

Listen to it instead

Information

Publisher

Year

eBook ISBN

Topic

Subtopic

Computer Science General

Index

Computer Science

Part I
Logical Structure and Segmentation

Chapter 1

Logical Structure Extraction from Digitized Books

Antoine Doucet

1.1Introduction

Mass digitization projects, such as the Million Book Project, efforts of the Open Content Alliance, and the digitization work of Google, are converting whole libraries by digitizing books on an industrial scale [5]. The process involves the efficient photographing of books, page-by-page, and the conversion of the image of each page into searchable text through the use of optical character recognition (OCR) software.

Current digitization and OCR technologies typically produce the full text of digitized books with only minimal structure information. Pages and paragraphs are usually identified and marked up in the OCR, but more sophisticated structures, such as chapters, sections, etc., are not recognized. In order to enable systems to provide users with richer browsing experiences, it is necessary to make such additional structures available, for example, in the form of XML markup embedded in the full text of the digitized books.

The Book Structure Extraction competition aims to address this need by promoting research into automatic structure recognition and extraction techniques that could complement or enhance current OCR methods and lead to the availability of rich structure information for digitized books. Such structure information can then be used to aid user navigation inside books as well as to improve search performance [35].

The chapter is structured as follows. We start by placing the competition in the context of the work conducted at the Initiative for the Evaluation of XML Retrieval (INEX) Evaluation Forum [22]. We then describe the setup of the competition, including its goals and the task that has been set for its participants. The book collection used in the task is also detailed. The ground-truth-creation process and its outcome are next described, together with the corresponding evaluation metrics used and the final results, alongside brief descriptions of the participants’ approaches. We conclude with a summary of the competition and how it could be built upon.

1.1.1Background

Motivated by the need to foster research in areas relating to large digital book repositories (see e.g., [21]), the Book Track was launched in 2007 [22] as part of the INEX. Founded in 2002, INEX is an evaluation forum that investigates focused retrieval approaches [14] where structure information is used to aid the retrieval of parts of documents, relevant to a search query. Focused retrieval over books presents a clear benefit to users, enabling them to gain direct access to those parts of books (of potentially hundreds of pages in length) that are relevant to their information needs.

One major limitation of digitized books is the fact that their structure is physical, rather than logical. Following this, the evaluation and relevance judgments based on the book corpus have essentially been based on whole books and selections of pages. This is unfortunate considering that books seem to be the key application field for structured information retrieval (IR). The fact that, for instance, chapters, sections, and paragraphs are not readily available has been a frustration for the structured IR community gathered at INEX, because it does not allow us to test the techniques created for collections of scientific articles and for Wikipedia.

Unlike digitally born content, the logical structure of digitized books is not readily available. A digitized book is often only split into pages with possible paragraphs, lines, and word markup. This was also the case for the 50,000 digitized book collection of the INEX Book Search track [22]. The use of more meaningful structure, e.g., chapters, table of contents (ToC), bibliography, or back-of-book index, to support focused retrieval has been explored for many years at INEX and has been shown to increase retrieval performance [35].

To encourage research aiming to provide the logical structure of digitized books, we created the Book Structure Extraction competition, which we later brought to the community of document analysis.

Starting from 2008, within the second round of the INEX Book Track, we entirely created the methodology to evaluate the Structure Extraction process from digitized books: problem description, submission procedure, annotation procedure (and corresponding software), metrics, and evaluation.

1.1.2Context and Motivation

The overall goal of the INEX Book Track is to promote interdisciplinary research investigating techniques for supporting users in reading, searching, and navigating the full texts of digitized books and to provide a forum for the exchange of research ideas and contributions. In 2007, the Track focused on IR tasks [24].

However, since the collection was made of digitized books, the only structure that was readily available was that of pages, each page being easily identified from the fact that it corresponds to one and only one image file, as a result of the scanning process. In addition, a few other elements can easily be detected through OCR, as we can see with the DjVu file format (an example of which is given in Figure 1.1). This markup denotes pages, words (detected as regions of text separated by horizontal space), lines (regions of text separated by vertical space), and “paragraphs” (regions of text separated by a significantly wider vertical space than other lines). Those paragraphs, however, are only defined as internal regions of a page (by definition, they cannot span over different pages).

Hence, there is a clear gap to be filled between research in structured IR, which relies on a logical structure (chapters, sections, etc.), and the digitized book collection, which contains only the physical structure. From a cognitive point of view, retrieving book pages may be sensible with a paper book, but it is nonsense with a digital book. The BookML format, of which we give an example in Figure 1.2, is a better attempt to grasp the logical structure of books, but it remains clearly insufficient.

Figure 1.1. A sample DjVu XML document.

1.1.2.1Structured Information Retrieval Requires Structure

In the context of e-readers, even the concept of a page actually becomes questionable: What are pages if not a practical arrangement to avoid printing a book on a single 5 squared meter sheet of paper? For the moment, it seems, however, that users are still attached to the concept of a page, mostly as a convenient marker of “where did I stop last?”, but when they can actually bookmark any word, line, or fragment of the book, how long will users continue to bookmark pages?

It is important to remember that books as we know them are only a step in the history of reading devices, starting from the papyrus, a very long scroll containing a single sequence of columns of text, used during 3 millennia until the Roman codex brought up the concept of a page. The printing press in the 15th century allowed the shift from manual to mechanical copying, bringing books to the masses [36]. Because of reading devices, after switching from papyrus to paper, we are now living another dramatic change from the paper to the digital format; it is to be expected that the unnecessary implications of the paper format will disappear in the long run. All physical structure is bound to disappear or become widely unstable. For instance, should pages remain, the page content will vary widely every time the font size is changed, something that most e-readers allow.

Figure 1.2. A sample BookML document.

What shall remain, however, is the logical structure, whose reason to be is not practical motivations but an editorial choice of the author to structure his works and to facilitate the readers’ access. Unfortunately, it is exactly this part of the structure that the digitized book collection of INEX missed. On one hand, it seemed to be an ideal framework for structured IR, while on the other, the collection’s logical structure was hardly usable. This motivated the design of the Book Structure Extraction competition, to bridge the gap between the digitized books and the (structured) IR research community.

1.1.2.2Context

In 2008, during the second year of the INEX Book Track, the Book Structure Extraction task was introduced [25] and set up with the aim to evaluate automatic techniques for deriving structure from the OCR texts and page images of digitized books.

The first round of the Structure Extraction task was “beta” run in 2008 and permitted to set up appropriate evaluation infrastructure, including guidelines, tools to generate ground-truth data, evaluation measures, and a first test set of 100 books built by the organizers. The second round was run both at INEX 2009 [26] and additionally at the International Conference on Document Analysis and Recognition (ICDAR) [9] where it was accepted as an official competition. This allowed us to reach the document analysis community and bring a bigger audience to the effort while inviting competitors to present their approaches at the INEX workshop. This further allowed one to build up on the established infrastructure with an extended test set and a procedure for collaborative annotation that greatly reduced the effort needed for building the ground-truth. The competition was run again in 2010 at INEX [27] and in 2011 and 2013 at ICDAR [11, 12] (INEX runs every year, whereas ICDAR runs every second year).

In the next section, we will describe the full methodology that we put in place from scratch to evaluate the performance of Book Structure Extraction systems, as well as the challenges and contributions that this work involved.

1.2Book Collection

The INEX Book Search corpus contains 50,239 digitized, out-of-copyright books, provided by Microsoft Live Search and the Internet Archive [22]. It consists of books of different genres, including history books, biographies, literary studies, religious texts and teachings, reference works, encyclopedias, essays, proceedings, novels, and poetry.

Each book is available in three different formats: image files as portable document format (PDF), DjVu XML containing the OCR text and basic structure markup as illustrated in Figure 4.1, and BookML, containing a more elaborate structure constructed from the OCR and illustrated in Figure 1.2.

DjVu format. An <OBJECT> element corresponds to a page in a digitized book. A page counter, corresponding to the physical page number, is embedded in the @value attribute of the <PARAM> element, which has the @name=“PAGE” attribute. The logical page numbers (as printed inside the book) can be found (not always) in the header or the footer part of a page. Note, however, that headers/footers are not explicitly recognized in the OCR, i.e., the first paragraph on a page may be a header and the last one or more paragraphs may be part of a footer. Depending on the book, headers may include chapter/section titles and logical page numbers (although due to OCR error, the page number is not a...

Cover Page
Title
Copyright
Foreword
About the Editors
List of Contributors
Contents
Introduction
Part I Logical Structure and Segmentation
Part II Digits and Mathematical Expressions
Part III Writer Identification and Signature Verification
Part IV Text Recognition
Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Document Analysis and Text Recognition by Volker Märgner, Umapada Pal, Apostolos Antonacopoulos in PDF and/or ePUB format, as well as other popular books in Computer Science & Computer Science General. We have over one million books available in our catalogue for you to explore.

About this book

Tools to learn more effectively

Information

Table of contents

Frequently asked questions