Taming Text
eBook - ePub

Taming Text

How to Find, Organize, and Manipulate It

Grant Ingersoll, Thomas S. Morton, Drew Farris

Share book
  1. 320 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Taming Text

How to Find, Organize, and Manipulate It

Grant Ingersoll, Thomas S. Morton, Drew Farris

Book details
Book preview
Table of contents
Citations

About This Book

Summary Taming Text, winner of the 2013 Jolt Awards for Productivity, is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. This book explores how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. The book guides you through examples illustrating each of these topics, as well as the foundations upon which they are built. About this Book
There is so much text in our lives, we are practically drowningin it. Fortunately, there are innovative tools and techniquesfor managing unstructured information that can throw thesmart developer a much-needed lifeline. You'll find them in thisbook.
Taming Text is a practical, example-driven guide to working withtext in real applications. This book introduces you to useful techniques like full-text search, proper name recognition, clustering, tagging, information extraction, and summarization.You'll explore real use cases as you systematically absorb thefoundations upon which they are built.Written in a clear and concise style, this book avoids jargon, explainingthe subject in terms you can understand without a backgroundin statistics or natural language processing. Examples arein Java, but the concepts can be applied in any language.Written for Java developers, the book requires no prior knowledge of GWT. Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book. Winner of 2013 Jolt Awards: The Best Books—one of five notable books every serious programmer should read. What's Inside

  • When to use text-taming techniques
  • Important open-source libraries like Solr and Mahout
  • How to build text-processing applications

About the Authors
Grant Ingersoll is an engineer, speaker, and trainer, a Lucenecommitter, and a cofounder of the Mahout machine-learning project. Thomas Morton is the primary developer of OpenNLP and Maximum Entropy. Drew Farris is a technology consultant, software developer, and contributor to Mahout, Lucene, and Solr. "Takes the mystery out of verycomplex processes." —From the Foreword by Liz Liddy, Dean, iSchool, Syracuse University Table of Contents

  • Getting started taming text
  • Foundations of taming text
  • Searching
  • Fuzzy string matching
  • Identifying people, places, and things
  • Clustering text
  • Classification, categorization, and tagging
  • Building an example question answering system
  • Untamed text: exploring the next frontier

Frequently asked questions

How do I cancel my subscription?
Simply head over to the account section in settings and click on “Cancel Subscription” - it’s as simple as that. After you cancel, your membership will stay active for the remainder of the time you’ve paid for. Learn more here.
Can/how do I download books?
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
What is the difference between the pricing plans?
Both plans give you full access to the library and all of Perlego’s features. The only differences are the price and subscription period: With the annual plan you’ll save around 30% compared to 12 months on the monthly plan.
What is Perlego?
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Do you support text-to-speech?
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Is Taming Text an online PDF/ePUB?
Yes, you can access Taming Text by Grant Ingersoll, Thomas S. Morton, Drew Farris in PDF and/or ePUB format, as well as other popular books in Ciencia de la computación & Tratamiento de datos. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Manning
Year
2012
ISBN
9781638353867

Chapter 1. Getting started taming text

In this chapter
  • Understanding why processing text is important
  • Learning what makes taming text hard
  • Setting the stage for leveraging open source libraries to tame text
If you’re reading this book, chances are you’re a programmer, or at least in the information technology field. You operate with relative ease when it comes to email, instant messaging, Google, YouTube, Facebook, Twitter, blogs, and most of the other technologies that define our digital age. After you’re done congratulating yourself on your technical prowess, take a moment to imagine your users. They often feel imprisoned by the sheer volume of email they receive. They struggle to organize all the data that inundates their lives. And they probably don’t know or even care about RSS or JSON, much less search engines, Bayesian classifiers, or neural networks. They want to get answers to their questions without sifting through pages of results. They want email to be organized and prioritized, but spend little time actually doing it themselves. Ultimately, your users want tools that enable them to focus on their lives and their work, not just their technology. They want to control—or tame—the uncontrolled beast that is text. But what does it mean to tame text? We’ll talk more about it later in this chapter, but for now taming text involves three primary things:
  • The ability to find relevant answers and supporting content given an information need
  • The ability to organize (label, extract, summarize) and manipulate text with little-to-no user intervention
  • The ability to do both of these things with ever-increasing amounts of input
This leads us to the primary goal of this book: to give you, the programmer, the tools and hands-on advice to build applications that help people better manage the tidal wave of communication that swamps their lives. The secondary goal of Taming Text is to show how to do this using existing, freely available, high quality, open source libraries and tools.
Before we get to those broader goals later in the book, let’s step back and examine some of the factors involved in text processing and why it’s hard, and also look at some use cases as motivation for the chapters to follow. Specifically, this chapter aims to provide some background on why processing text effectively is both important and challenging. We’ll also lay some groundwork with a simple working example of our first two primary tasks as well as get a preview of the application you’ll build at the end of this book: a fact-based question answering system. With that, let’s look at some of the motivation for taming text by scoping out the size and shape of the information world we live in.

1.1. Why taming text is important

Just for fun, try to imagine going a whole day without reading a single word. That’s right, one whole day without reading any news, signs, websites, or even watching television. Think you could do it? Not likely, unless you sleep the whole day. Now spend a moment thinking about all the things that go into reading all that content: years of schooling and hands-on feedback from parents, teachers, and peers; and countless spelling tests, grammar lessons, and book reports, not to mention the hundreds of thousands of dollars it takes to educate a person through college. Next, step back another level and think about how much content you do read in a day.
To get started, take a moment to consider the following questions:
  • How many email messages did you get today (both work and personal, including spam)?
  • How many of those did you read?
  • How many did you respond to right away? Within the hour? Day? Week?
  • How do you find old email?
  • How many blogs did you read today?
  • How many online news sites did you visit?
  • Did you use instant messaging (IM), Twitter, or Facebook with friends or colleagues?
  • How many searches did you do on Google, Yahoo!, or Bing?
  • What documents on your computer did you read? What format were they in (Word, PDF, text)?
  • How often do you search for something locally (either on your machine or your corporate intranet)?
  • How much content did you produce in the form of emails, reports, and so on?
Finally, the big question: how much time did you spend doing this?
If you’re anything like the typical information worker, then you can most likely relate to IDC’s (International Data Corporation) findings from their 2009 study (Feldman 2009):
Email consumes an average of 13 hours per week per worker... But email is no longer the only communication vehicle. Social networks, instant messaging, Yammer, Twitter, Facebook, and LinkedIn have added new communication channels that can sap concentrated productivity time from the information worker’s day. The time spent searching for information this year averaged 8.8 hours per week, for a cost of $14,209 per worker per year. Analyzing information soaked up an additional 8.1 hours, costing the organization $13,078 annually, making these two tasks relatively straightforward candidates for better automation. It makes sense that if workers are spending over a third of their time searching for information and another quarter analyzing it, this time must be as productive as possible.
Furthermore, this survey doesn’t even account for how much time these same employees spend creating content during their personal time. In fact, eMarketer estimates that internet users average 18 hours a week online (eMarketer) and compares this to other leisure activities like watching television, which is still king at 30 hours per week.
Whether it’s reading email, searching Google, reading a book, or logging into Facebook, the written word is everywhere in our lives.
We’ve seen the individual part of the content picture, but what about the collective picture? According to IDC (2011), the world generated 1.8 zettabytes of digital information in 2011 and “by 2020 the world will generate 50 times [that amount].” Naturally, such prognostications often prove to be low given we can’t predict the next big trend that will produce more content than expected.
Even if a good-size chunk of this data is due to signal data, images, audio, and video, the current best approach to making all this data findable is to write analysis reports, add keyword tags and text descriptions, or transcribe the audio using speech recognition or a manual closed-captioning approach so that it can be treated as text. In other words, no matter how much structure we add, it still comes back to text for us to share and comprehend our content. As you can see, the sheer volume of content can be daunting, never mind that text processing is also a hard problem on a small scale, as you’ll see in a later section. In the meantime, it’s worthwhile to think about what the ideal applications or tools would do to help stem the tide of text that’s engulfing us. For many, the answer lies in the ability to quickly and efficiently hone in on the answer to our questions, not just a list of possible answers that we need to then sift through. Moreover, we wouldn’t need to jump through hoops to ask our questions; we’d just be able to use our own words or voice to express them with no need for things like quotations, AND/OR operators, or other things that make it easier on the machine but harder on the person.
Though we all know we don’t live in an ideal world, one of the promising approaches for taming text, popularized by IBM’s Jeopardy!-playing Watson program and Apple’s Siri application, is a question answering system that can process natural languages such as English and return actual answers, not just pages of possible answers. In Taming Text, we aim to lay some of the groundwork for building such a system. To do this, let’s consider what such a system might look like; then, let’s take a look at some simple code that can find and extract key bits of information out of text that will later prove to be useful in our QA system. We’ll finish off this chapter by delving deeper into why building such a system as well as other language-based applications is so hard, along with a look at how the chapters to follow in this book will lay the foundation for a fact-based QA system along with other text-based systems.

1.2. Preview: A fact-based question answering system

For the purposes of this book, a QA system should be capable of ingesting a collection of documents suspected to have answers to questions that users might ask. For instance, Wikipedia or a collection of research papers might be used as a source for finding answers. In other words, the QA system we propose is based on identifying and analyzing text that has a chance of providing the answer based on patterns it has seen in the past. It won’t be capable of inferring an answer from a variety of sources. For instance, if the system is asked “Who is Bob’s uncle?” and there’s a document in the collection with the sentences “Bob’s father is Ola. Ola’s brother is Paul,” the system wouldn’t be able to infer that Bob’s uncle is Paul. But if there’s a sentence that directly states “Bob’s uncle is Paul,” you’d expect the system to be able to answer the question. This isn’t to say that the former example can’t be attempted; it’s just beyond the scope of this book.
A simple workflow for building the QA system described earlier is outlined in figure 1.1.
Figure 1.1. A simple workflow for answering questions posed to a QA system
Naturally, such a simple workflow hides a lot of details, and it also doesn’t cover the ingestion of the documents, but it does allow us to highlight some of the key components needed to process users’ questions. First, the ability to parse a user’s question and determine what’s being asked typically requires basic functionality like identifying words, as well as the ability to understand what kind of answer is appropriate for a question. For instance, the answer to “Who is Bob’s uncle?” should likely be a person, whereas the answer to “Where is Buffalo?” probably requires a place-name to be returned. Second, the need to identify candidate answers typically involves the...

Table of contents