Document Processing Using Machine Learning
  1. 168 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

About this book

Document Processing Using Machine Learning aims at presenting a handful of resources for students and researchers working in the document image analysis (DIA) domain using machine learning since it covers multiple document processing problems. Starting with an explanation of how Artificial Intelligence (AI) plays an important role in this domain, the book further discusses how different machine learning algorithms can be applied for classification/recognition and clustering problems regardless the type of input data: images or text.

In brief, the book offers comprehensive coverage of the most essential topics, including:

ยท The role of AI for document image analysis

ยท Optical character recognition

ยท Machine learning algorithms for document analysis

ยท Extreme learning machines and their applications

ยท Mathematical foundation for Web text document analysis

ยท Social media data analysis

ยท Modalities for document dataset generation

This book serves both undergraduate and graduate scholars in Computer Science/Information Technology/Electrical and Computer Engineering. Further, it is a great fit for early career research scientists and industrialists in the domain.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, weโ€™ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere โ€” even offline. Perfect for commutes or when youโ€™re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Document Processing Using Machine Learning by Sk Md Obaidullah, KC Santosh, Teresa Goncalves, Nibaran Das, Kaushik Roy, Sk Md Obaidullah,KC Santosh,Teresa Goncalves,Nibaran Das,Kaushik Roy in PDF and/or ePUB format, as well as other popular books in Computer Science & Computer Science General. We have over one million books available in our catalogue for you to explore.

1

Artificial Intelligence for Document Image Analysis

Himadri Mukherjee , Payel Rakshit , Ankita Dhar , Sk Md Obaidullah , KC Santosh , Santanu Phadikar and Kaushik Roy
CONTENTS
1.1 Introduction
1.2 Optical Character Recognition
1.2.1 Dealing with Noise
1.2.2 Segmentation
1.2.3 Applications
1.2.3.1 Legal Industry
1.2.3.2 Banking
1.2.3.3 Healthcare
1.2.3.4 CAPTCHA
1.2.3.5 Automatic Number Recognition
1.2.3.6 Handwriting Recognition
1.3 Natural Language Processing
1.3.1 Tokenization
1.3.2 Stop Word Removal
1.3.3 Stemming
1.3.4 Part of Speech Tagging
1.3.5 Parsing
1.3.6 Applications
1.3.6.1 Text Summarization
1.3.6.2 Question Answering
1.3.6.3 Text Categorization
1.3.6.4 Sentiment Analysis
1.3.6.5 Word Sense Disambiguation
1.4 Conclusion
References

1.1 Introduction

There has been rapid development in technology which has aided in the digitization of documents. The number of digital documents has increased significantly over time [1, 2]. Information is now easily available on the Internet and can be distributed with ease. Such voluminous numbers of documents demand efficient processing. Digitized documents can be broadly categorized into two types, namely document images and text documents. In the case of document images, it is first essential to understand what is written. This requires optical character recognition (OCR) [3โ€“5]. Once the characters are identified, approaches based on natural language processing (NLP) [6โ€“8] need to be used to understand what is written. In the case of text documents, research in the fields of OCR and NLP started way back in the last century and different systems in languages like English are now commercially available [9โ€“11], but there have not been significant developments for Indic languages. One reason for this is the complex nature of Indic scripts [12]. This is also coupled with the unavailability of standard (and free) datasets for research.

1.2 Optical Character Recognition

Optical character recognition [13, 14] refers to the task of decoding what is written in a document. It does not involve understanding the written texts, but it does involve converting a scan or a picture of a document, identifying the characters and giving the text output of the identified words and characters. The document can be either handwritten or printed. In the case of handwritten documents, there are different variations which need to be considered prior to recognition. While writing, it is often observed that the texts have disparate degrees of slants. It is very important to deal with such slants while processing the documents. A document with characters having multiple degrees of slant is presented in Figure 1.1.
FIGURE 1.1
A document depicting multiple degrees of slant for different characters.
The second important factor which needs to be tackled is the similarity between different characters. For instance, the numeral โ€œ3โ€ is similar to โ€œเฆคโ€ in Bangla. This is illustrated in Figure 1.2. It is very important to handle such cases or else, if interpreted wrongly, the entire sentence might change.
FIGURE 1.2
Similarity between different characters in Bangla.
Another important aspect is inter-writer and intra-writer variation. It is often observed that handwritten texts show variation at the character level. That is, the same character is slightly different when written in two instances. This is known as intra-writer variation. Another variation is observed when two different writers write the same thing. The handwriting of disparate people differ from each other in most cases. This is known as inter-writer variation. Thus the system should be able to handle such differences. Inter- and intra-writer variations for a Bangla text are presented in Figures 1.3 and 1.4.
F...

Table of contents

  1. Cover
  2. Half-Title
  3. Title
  4. Copyright
  5. Contents
  6. Preface
  7. Editors
  8. Contributors
  9. 1. ArtificialIntelligenceforDocumentImageAnalysis
  10. 2. AnApproachtowardCharacterRecognitionofBanglaHandwrittenIsolatedCharacters
  11. 3. ArtisticMulti-CharacterScriptIdentification
  12. 4. AStudyontheExtremeLearningMachineandItsApplications
  13. 5. AGraph-BasedTextClassificationModelforWebTextDocuments
  14. 6. AStudyofDistanceMetricsinDocumentClassification
  15. 7. AStudyofProximityofDomainsforTextCategorization
  16. 8. SupervisedLearningforAggressionIdentificationandAuthorProfilingoverTwitterDataset
  17. 9. TheEffectofUsingFeaturesComputedfromGeneratedOfflineImagesforOnlineBanglaHandwrittenCharacterRecognition
  18. 10. HandwrittenCharacterRecognitionforPalm-LeafManuscripts
  19. Index