Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications brings together all the information, tools and methods a professional will need to efficiently use text mining applications and statistical analysis. Winner of a 2012 PROSE Award in Computing and Information Sciences from the Association of American Publishers, this book presents a comprehensive how-to reference that shows the user how to conduct text mining and statistically analyze results. In addition to providing an in-depth examination of core text mining and link detection tools, methods and operations, the book examines advanced preprocessing techniques, knowledge representation considerations, and visualization approaches. Finally, the book explores current real-world, mission-critical applications of text mining and link detection using real world example tutorials in such varied fields as corporate, finance, business intelligence, genomics research, and counterterrorism activities. The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. This makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, the textual data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to account. As the Internet expands and our natural capacity to process the unstructured text that it contains diminishes, the value of text mining for information retrieval and search will increase dramatically. - Extensive case studies, most in a tutorial format, allow the reader to 'click through' the example using a software program, thus learning to conduct text mining analyses in the most rapid manner of learning possible - Numerous examples, tutorials, power points and datasets available via companion website on Elsevierdirect.com - Glossary of text mining terms provided in the appendix

Trusted by 375,005 students

Access to over 1.5 million titles for a fair monthly price.

Study more efficiently using our study tools.

Publisher

Academic Press

Year

2012

Print ISBN

9780123869791

eBook ISBN

9780123870117

Topic

Mathematics

Subtopic

Probability & Statistics

Index

Mathematics

Foreword 1

FOREWORD No. 1: for “Practical Text Mining and Statisticall Analysis for Non-structured Text Data Applicationsns” from the viewpoint of a person trained and having a life-long internationally noted career in traditional statistics, especially Discriminant Analysis and the Jackknife procedure, and Co-Author Gary Miner’s mentor in statistical analysis:

Text mining and analysis is a new area of research that is less than 15 years old. So we can expect new editions of this book as the field matures. This book concerns the topics of extracting information from text and finding interesting patterns in the results. The book covers working with text documents. The introduction explains that the Library of Congress has many “documents” that include text (books, reports, emails, etc.), recordings (sounds), and images (including stills and motion pictures). Apparently text mining does not include mining sounds and images, although I can visualize it expanding to include scripts and lyrics. Of course, songs and musical contributions involve much more than text. This may be a direction for the future.

Much of text mining is observational, and as such, it may pay to be cautious about drawing conclusions. For example, in data mining one may wish to make inferences about consumer behavior (e.g., purchases in a market chain). However, since the market may not have all brands of all products, no information can be inferred about nonstocked brands. Also, words and phrases are inherently unordered, so many statistical methods that rely on continuous data will not apply. Methods based on tables and proportions will be the order of the day.

There are always concerns about new areas of science. As scientists become familiar with the pitfalls and benefits, practices will change and new methods will be added. For example, the standard of evidence for an association in a text mining context might be quite different from that accepted as legal evidence. The adequacy of accuracy in websites is an area in which concern has been noted. This is not an area that text mining has been concerned with, but some “facts” distributed on the web can be badly distorted. One need only examine political candidates’ websites. Additionally, one need only look at the letters to the editor of local newspapers to observe misinformation and disinformation. I hope that text mining can include some figures of merit on documents (based on accuracy of information, completeness of information, and currency of information, among other things). In meta-analysis, studies that don’t measure up are minimized or discarded. A key concern is the set of texts that aren’t included in the database.

The main steps are as follows:

• Define the purpose of the study: This is all too easy to omit. The researcher may be so familiar with the field that the purpose is very clear to him or her, but someone else in the field may have a different idea.

• Determine the availability of the data: This implies a full literature review and cataloguing of data sets, and knowing the definitions of the terms in each set is imperative as well. As the data sets are typically documents, each author will likely have his or her own definition.

• Prepare the data: Proper encoding of the documents makes life much easier. Methods discussed in this book make it much simpler to use. This will include feature extraction, entity extraction, and reduction of dimension. The last may not mean any fewer variables, but they will be combined in ways that reduce the number of items in the models—definitely a nontrivial task.

• Develop and assess the models: One must consider how the models apply in the context of the problem. Assessing model fit, for example, will evaluate how well the models predict the data.

• Evaluate the models: This is close to the previous bullet.

• Deploy the results: In this step, the researcher shares his or her work with the public. For example, with insurance fraud, one would demonstrate that fraud could be reduced using these models. The models would identify key words in fraudulent claims that could be used in future work—in some instances, keeping the results confidential to avoid alerting the bad guys.

All of these steps are detailed in the chapters.

English (and any language) has a great deal of redundancy, so it is not necessary to mine each word. Many words carry great meaning in context, so there may not be a need to have all words encoded. For example, in a medical context, the terms tumor, cancer, and lesion may be used more or less interchangeably.

The book has 28 tutorials, which should be read as the chapters are read. I also strongly recommend that the reader have access to one of the programs mentioned in the book. Find a general purpose text miner and work through the exercises.

Be alert for items of interest. Get some examples of your own; this will likely be the best way to learn text mining.

Peter A. Lachenbruch, Ph.D.

Oregon State University, Professor of Public Health (2006–present)

Past president of the American Statistical Association

Dr. Peter Lachenbruch received his Ph.D. in biostatistics from UCLA. He has held positions on the faculties of the University of North Carolina (1965–1976), the University of Iowa (1976–1985), and UCLA (1985–1994). He was employed by the FDA/CBER from 1994 to 2005 and retired from there as the director of the Division of Biostatistics. He is currently Professor of Public Health at Oregon State University (2006–present). He is a Fellow of the American Statistical Association and a former elected member of the International Statistical Institute. He has held many professional offices and was the president of the American Statistical Association for 2008.

He has statistical interests in discriminant analysis, two-part models, model-independent inference, statistical computing, and data analysis. He is known for making the jackknife re-sampling method an accepted procedure, and in recent years he has published on validation of neural networks using hybrid resampling methods. He has application interests in rheumatology, psychiatry, pediatrics, gerontology and accident epidemiology. He has more than 180 publications in these fields. Dr. Lachenbruch serves on the Editorial Boards of Statistics in Medicine, Methods of Information in Medicine, Journal of Biopharmaceutical Statistics, and Statistical Methods in Medical Research.

Foreword 2

Practical Text Mining and Statistical Analysis for Nonstructured Text Data Applications follows in the tradition of the popular Handbook of Statistical Analysis & Data Mining Applications (Elsevier, 2009), which was authored by three of this new text’s six authors: Robert Nisbet, John Elder, and Gary Miner. Three additional authors—Thomas Hill, Dursun Delen, and Andrew Fast—contributed to this book, each bringing substantial experience in text mining.

Practical Text Mining was written to be used as an application-oriented text mining handbook. At some thousand pages in length, the authors have created a text that will to my mind serve as the standard resource in the field for many years. The authors have managed to provide readers with a comprehensive, yet thoroughly understandable, overview of the theory and scope of text mining. In addition, they provide 28 tutorials that assist readers in actually applying theory to a wide range of practical applications using up-to-date statistical techniques.

Text mining is one of those disciplines that are now emerging as cutting-edge technologies. Only a few years ago, standard desktop computers did not have the physical capability to handle huge amount of textual material or to engage in the complex analysis of page-length material. Storage capacity and in particular MHz and RAM were insufficient to appropriately analyze large or complex textual situations. Recently, however, PC-type computers have become powerful and fast enough to engage in sophisticated text mining analysis, which itself has the capability of assisting researchers to better understand a host of social and biological issues.

Several text mining applications now exist to enable researchers to engage in text mining activities. The book does not limit itself to using only one text mining package but rather provides step-by-step instructions on how to work with several of the foremost applications, including StatSoft’s STATISTICA Data Miner and Text Miner, SAS Enterprise Miner & Text Miner, and IBM-SPSS Modeler Premium. The authors also give readers using RapidMiner, Topsy, Weka, and Salford System’s STM™, CART®, and TreeNet® tutorials employing these applications. For all of the tutorials, readers are guided through such text mining processes as classification, prediction, named entity extraction, feature selection, dimensionality reduction, singular value decomposition, clustering, focused web crawling, web mining, and other associated techniques.

Part I takes the reader through the historical background of text mining, provides general text mining theory, and presents common applications and their associated statistical and data management tools. Part II is called the text mining laboratory in which the 28 tutorials are provided. Each tutorial is authored by the book’s authors and/or by guest authors having special expertise in the subject of the tutorial and the software used for it. Code is provided for the tutorials where applicable, and annotated screenshots are displayed throughout in order to make it easier for readers to replicate the specific tutorial example, as well as to enable them to work through their own project analysis.

Text mining has an extremely wide range of applications. From the original applications of evaluating text to determine if it was authored by the individual claimed as author or if more than one author was involved in writing the textual material or to determining when it was written, text mining is now reaching into nearly every academic field. The tutorials presented in this book give the reader a thorough flavor of both what text mining can do now and solid hints as to its future capabilities. And importantly, they give clear instructions on how to actually engage in meaningful text mining and related statistical analysis.

Of the number of statistics books that are published each year by the major publishers, only a few books stand out as really being important, meaning that they positively influence how future research is done in the subject area of the text. I believe that Practical Text Mining is just such a book. The text offers the reader everything I believe is important for this type of book: the historical background of the subject, the basic principles, the methodological techniques, interpretations, caveats, and methods of comparative evaluation. It also provides the code and annotated screenshots needed to enable readers to replicate examples and to utilize the book as a reference for years to come. Miner and his colleagues have put together just the sort of text needed for understanding and learning the complex subject of text mining. Well done!

Joseph M. Hilbe, J.D., Ph.D.

Arizona State University and Jet Propulsion Laboratory, 16 October, 2011

Joseph M. Hilbe, Ph.D., is emeritus professor at the University of Hawaii, adjunct professor of statistics at Arizona State University, Solar System Ambassador with NASA/Jet Propulsion Laboratory at California Institute of Technology, and a faculty member with the Institute for Statistics Education (Statistics.com), for which he is also a member of the three-person advisory council.

Dr. Hilbe is an elected Fellow of the American Statistical Association (ASA) and was a founding member of the ASA Health Policy Statistics Section executive committee. He is also an elected member (Fellow) of the International Statistical Institute (ISI), the world association of statisticians, for which he chairs both the ISI Sports Statistics and the ISI Astrostatistics committees. In addition, he heads the International Astrostatistics Network, the global association of astrostatisticians. Dr. Hilbe has authored well over one hundred journal articles and is currently on the editorial boards of six statistics journals. He is also editor-in-chief of the Springer Series in Astrostatistics and has written many best-sellers, including Negative Binomial Regression (Cambridge University Press, 2007, 2011) and Logistic Regression Models (Chapman & Hall/CRC, 2009). With James Hardin, Hilbe is author of Generalized Estimating Equations (Chapman & Hall/CRC, 2002) and three editions of Generalized Linear Models and Extensions (Stata Press, 2001, 2007, 2011). He is also author of R for Stata Users (Springer, with R. Muenchen, 2010) and of the soon to be published Astrostatistical Challenges for the New Astronomy (Springer) and Methods of Statistical Model Estimation (Chapman & Hall/CRC, with Andrew Robinson).

Foreword 3

The field of text mining—the process of finding patterns in unstructured data—has been an active area of research for more than two decades. Of course, even defining text mining is not easy, and as this book so systematically lays out, text mining is defined broadly enough to include everything from basic search to natural language processing (NLP). Unfortunately, the predictive analytics algorithms most analysts use are designed to build models from structured data, not unstructured data. Because it is difficult to convert unstructured data into the more useful structured format, unstructured data are often left unexplored.

Today, with a growing collection of unstructured data, increasing computer power, and enhanced software capabilities, text mining now tops most lists of future trends in analytics. However, merely topping a list still leaves undone the actual building and deploying of text mining solutions. Even in the most straightforward of applications, text mining is a very complex process. At the core of these complexities lies the fact that interpreting the ambiguities and subtleties of language is difficult. Moreover, most analysts embarking on text mining projects do not have an academic background in linguistics or text analysis and therefore must learn “on the job.”

This book is a practitioner’s book and is organized similarly to the acclaimed Handbook of Statistical Analysis and Data Mining Applications by Nesbit, Elder, and Miner. It begins with an overview of text mining and NLP, a daunting task in its own right. The history chapter provides an excellent background of text mining. Most users will find the “The Seven Practice Areas of Text Analytics” chapter insightful and useful for explaining text mining to peers and decision makers. But this book goes beyond providing only a survey of the field: It also provides practical steps for building models using textual data, including the important steps of data preparation and modeling.

Many of us learned data mining and text mining by reading textbooks, going to conferences, and taking courses. But we learn more deeply by “doing”—that is, by incorporating unstructured data in models for projects where we must find a solution. Pr...

Cover image
Title page
Table of Contents
Copyright
Dedication
Endorsements for Practical Text Mining & Statistical Analysis for Non-structured Text Data Applications
Foreword 1
Foreword 2
Foreword 3
Acknowledgments
Preface
About the Authors
Introduction
List of Tutorials by Guest Authors
Part I: Basic Text Mining Principles
Part II: Introduction to the Tutorial and Case Study Section of This Book
Part III: Advanced Topics
Glossary
Index
How to Use the Data Sets and the Text Mining Software on the DVD or on Links for Practical Text Mining

Frequently asked questions

Can I cancel at any time?

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

Can I download books?

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

How does Perlego work?

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1.5 million books across 990+ topics, we’ve got you covered! Learn about our mission

Do you support text-to-speech?

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Can I read on my tablet or smartphone?

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Is Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications an online PDF/ePUB?

Yes, you can access Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications by Gary D. Miner,John Elder,Andrew Fast,Thomas Hill,Robert Nisbet,Dursun Delen,Gary Miner in PDF and/or ePUB format, as well as other popular books in Mathematics & Probability & Statistics. We have over 1.5 million books available in our catalogue for you to explore.

Related ISBNs

About this book

Trusted by 375,005 students

Information

Foreword 1

Foreword 2

Foreword 3

Table of contents

Frequently asked questions