Part I
Logical Structure and Segmentation
Chapter 1
Logical Structure Extraction from Digitized Books
Antoine Doucet
1.1Introduction
Mass digitization projects, such as the Million Book Project, efforts of the Open Content Alliance, and the digitization work of Google, are converting whole libraries by digitizing books on an industrial scale [5]. The process involves the efficient photographing of books, page-by-page, and the conversion of the image of each page into searchable text through the use of optical character recognition (OCR) software.
Current digitization and OCR technologies typically produce the full text of digitized books with only minimal structure information. Pages and paragraphs are usually identified and marked up in the OCR, but more sophisticated structures, such as chapters, sections, etc., are not recognized. In order to enable systems to provide users with richer browsing experiences, it is necessary to make such additional structures available, for example, in the form of XML markup embedded in the full text of the digitized books.
The Book Structure Extraction competition aims to address this need by promoting research into automatic structure recognition and extraction techniques that could complement or enhance current OCR methods and lead to the availability of rich structure information for digitized books. Such structure information can then be used to aid user navigation inside books as well as to improve search performance [35].
The chapter is structured as follows. We start by placing the competition in the context of the work conducted at the Initiative for the Evaluation of XML Retrieval (INEX) Evaluation Forum [22]. We then describe the setup of the competition, including its goals and the task that has been set for its participants. The book collection used in the task is also detailed. The ground-truth-creation process and its outcome are next described, together with the corresponding evaluation metrics used and the final results, alongside brief descriptions of the participants’ approaches. We conclude with a summary of the competition and how it could be built upon.
1.1.1Background
Motivated by the need to foster research in areas relating to large digital book repositories (see e.g., [21]), the Book Track was launched in 2007 [22] as part of the INEX. Founded in 2002, INEX is an evaluation forum that investigates focused retrieval approaches [14] where structure information is used to aid the retrieval of parts of documents, relevant to a search query. Focused retrieval over books presents a clear benefit to users, enabling them to gain direct access to those parts of books (of potentially hundreds of pages in length) that are relevant to their information needs.
One major limitation of digitized books is the fact that their structure is physical, rather than logical. Following this, the evaluation and relevance judgments based on the book corpus have essentially been based on whole books and selections of pages. This is unfortunate considering that books seem to be the key application field for structured information retrieval (IR). The fact that, for instance, chapters, sections, and paragraphs are not readily available has been a frustration for the structured IR community gathered at INEX, because it does not allow us to test the techniques created for collections of scientific articles and for Wikipedia.
Unlike digitally born content, the logical structure of digitized books is not readily available. A digitized book is often only split into pages with possible paragraphs, lines, and word markup. This was also the case for the 50,000 digitized book collection of the INEX Book Search track [22]. The use of more meaningful structure, e.g., chapters, table of contents (ToC), bibliography, or back-of-book index, to support focused retrieval has been explored for many years at INEX and has been shown to increase retrieval performance [35].
To encourage research aiming to provide the logical structure of digitized books, we created the Book Structure Extraction competition, which we later brought to the community of document analysis.
Starting from 2008, within the second round of the INEX Book Track, we entirely created the methodology to evaluate the Structure Extraction process from digitized books: problem description, submission procedure, annotation procedure (and corresponding software), metrics, and evaluation.
1.1.2Context and Motivation
The overall goal of the INEX Book Track is to promote interdisciplinary research investigating techniques for supporting users in reading, searching, and navigating the full texts of digitized books and to provide a forum for the exchange of research ideas and contributions. In 2007, the Track focused on IR tasks [24].
However, since the collection was made of digitized books, the only structure that was readily available was that of pages, each page being easily identified from the fact that it corresponds to one and only one image file, as a result of the scanning process. In addition, a few other elements can easily be detected through OCR, as we can see with the DjVu file format (an example of which is given in Figure 1.1). This markup denotes pages, words (detected as regions of text separated by horizontal space), lines (regions of text separated by vertical space), and “paragraphs” (regions of text separated by a significantly wider vertical space than other lines). Those paragraphs, however, are only defined as internal regions of a page (by definition, they cannot span over different pages).
Hence, there is a clear gap to be filled between research in structured IR, which relies on a logical structure (chapters, sections, etc.), and the digitized book collection, which contains only the physical structure. From a cognitive point of view, retrieving book pages may be sensible with a paper book, but it is nonsense with a digital book. The BookML format, of which we give an example in Figure 1.2, is a better attempt to grasp the logical structure of books, but it remains clearly insufficient.
Figure 1.1. A sample DjVu XML document.
1.1.2.1Structured Information Retrieval Requires Structure
In the context of e-readers, even the concept of a page actually becomes questionable: What are pages if not a practical arrangement to avoid printing a book on a single 5 squared meter sheet of paper? For the moment, it seems, however, that users are still attached to the concept of a page, mostly as a convenient marker of “where did I stop last?”, but when they can actually bookmark any word, line, or fragment of the book, how long will users continue to bookmark pages?
It is important to remember that books as we know them are only a step in the history of reading devices, starting from the papyrus, a very long scroll containing a single sequence of columns of text, used during 3 millennia until the Roman codex brought up the concept of a page. The printing press in the 15th century allowed the shift from manual to mechanical copying, bringing books to the masses [36]. Because of reading devices, after switching from papyrus to paper, we are now living another dramatic change from the paper to the digital format; it is to be expected that the unnecessary implications of the paper format will disappear in the long run. All physical structure is bound to disappear or become widely unstable. For instance, should pages remain, the page content will vary widely every time the font size is changed, something that most e-readers allow.
Figure 1.2. A sample BookML document.
What shall remain, however, is the logical structure, whose reason to be is not practical motivations but an editorial choice of the author to structure his works and to facilitate the readers’ access. Unfortunately, it is exactly this part of the structure that the digitized book collection of INEX missed. On one hand, it seemed to be an ideal framework for structured IR, while on the other, the collection’s logical structure was hardly usable. This motivated the design of the Book Structure Extraction competition, to bridge the gap between the digitized books and the (structured) IR research community.
1.1.2.2Context
In 2008, during the second year of the INEX Book Track, the Book Structure Extraction task was introduced [25] and set up with the aim to evaluate automatic techniques for deriving structure from the OCR texts and page images of digitized books.
The first round of the Structure Extraction task was “beta” run in 2008 and permitted to set up appropriate evaluation infrastructure, including guidelines, tools to generate ground-truth data, evaluation measures, and a first test set of 100 books built by the organizers. The second round was run both at INEX 2009 [26] and additionally at the International Conference on Document Analysis and Recognition (ICDAR) [9] where it was accepted as an official competition. This allowed us to reach the document analysis community and bring a bigger audience to the effort while inviting competitors to present their approaches at the INEX workshop. This further allowed one to build up on the established infrastructure with an extended test set and a procedure for collaborative annotation that greatly reduced the effort needed for building the ground-truth. The competition was run again in 2010 at INEX [27] and in 2011 and 2013 at ICDAR [11, 12] (INEX runs every year, whereas ICDAR runs every second year).
In the next section, we will describe the full methodology that we put in place from scratch to evaluate the performance of Book Structure Extraction systems, as well as the challenges and contributions that this work involved.
1.2Book Collection
The INEX Book Search corpus contains 50,239 digitized, out-of-copyright books, provided by Microsoft Live Search and the Internet Archive [22]. It consists of books of different genres, including history books, biographies, literary studies, religious texts and teachings, reference works, encyclopedias, essays, proceedings, novels, and poetry.
Each book is available in three different formats: image files as portable document format (PDF), DjVu XML containing the OCR text and basic structure markup as illustrated in Figure 4.1, and BookML, containing a more elaborate structure constructed from the OCR and illustrated in Figure 1.2.
DjVu format. An <OBJECT> element corresponds to a page in a digitized book. A page counter, corresponding to the physical page number, is embedded in the @value attribute of the <PARAM> element, which has the @name=“PAGE” attribute. The logical page numbers (as printed inside the book) can be found (not always) in the header or the footer part of a page. Note, however, that headers/footers are not explicitly recognized in the OCR, i.e., the first paragraph on a page may be a header and the last one or more paragraphs may be part of a footer. Depending on the book, headers may include chapter/section titles and logical page numbers (although due to OCR error, the page number is not a...