eBook - ePub

Taming Text

Name: Taming Text
Author: Grant Ingersoll, Thomas S. Morton, Drew Farris

How to Find, Organize, and Manipulate It

Grant Ingersoll, Thomas S. Morton, Drew Farris

Condividi libro

320 pagine
English
ePUB (disponibile sull'app)
Disponibile su iOS e Android

eBook - ePub

Taming Text

How to Find, Organize, and Manipulate It

Grant Ingersoll, Thomas S. Morton, Drew Farris

Dettagli del libro

Anteprima del libro

Indice dei contenuti

Citazioni

Informazioni sul libro

Summary Taming Text, winner of the 2013 Jolt Awards for Productivity, is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. This book explores how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. The book guides you through examples illustrating each of these topics, as well as the foundations upon which they are built. About this Book
There is so much text in our lives, we are practically drowningin it. Fortunately, there are innovative tools and techniquesfor managing unstructured information that can throw thesmart developer a much-needed lifeline. You'll find them in thisbook.
Taming Text is a practical, example-driven guide to working withtext in real applications. This book introduces you to useful techniques like full-text search, proper name recognition, clustering, tagging, information extraction, and summarization.You'll explore real use cases as you systematically absorb thefoundations upon which they are built.Written in a clear and concise style, this book avoids jargon, explainingthe subject in terms you can understand without a backgroundin statistics or natural language processing. Examples arein Java, but the concepts can be applied in any language.Written for Java developers, the book requires no prior knowledge of GWT. Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book. Winner of 2013 Jolt Awards: The Best Books—one of five notable books every serious programmer should read. What's Inside

When to use text-taming techniques
Important open-source libraries like Solr and Mahout
How to build text-processing applications

About the Authors
Grant Ingersoll is an engineer, speaker, and trainer, a Lucenecommitter, and a cofounder of the Mahout machine-learning project. Thomas Morton is the primary developer of OpenNLP and Maximum Entropy. Drew Farris is a technology consultant, software developer, and contributor to Mahout, Lucene, and Solr. "Takes the mystery out of verycomplex processes." —From the Foreword by Liz Liddy, Dean, iSchool, Syracuse University Table of Contents

Getting started taming text
Foundations of taming text
Searching
Fuzzy string matching
Identifying people, places, and things
Clustering text
Classification, categorization, and tagging
Building an example question answering system
Untamed text: exploring the next frontier

Domande frequenti

Come faccio ad annullare l'abbonamento?

È semplicissimo: basta accedere alla sezione Account nelle Impostazioni e cliccare su "Annulla abbonamento". Dopo la cancellazione, l'abbonamento rimarrà attivo per il periodo rimanente già pagato. Per maggiori informazioni, clicca qui

È possibile scaricare libri? Se sì, come?

Al momento è possibile scaricare tramite l'app tutti i nostri libri ePub mobile-friendly. Anche la maggior parte dei nostri PDF è scaricabile e stiamo lavorando per rendere disponibile quanto prima il download di tutti gli altri file. Per maggiori informazioni, clicca qui

Che differenza c'è tra i piani?

Entrambi i piani ti danno accesso illimitato alla libreria e a tutte le funzionalità di Perlego. Le uniche differenze sono il prezzo e il periodo di abbonamento: con il piano annuale risparmierai circa il 30% rispetto a 12 rate con quello mensile.

Cos'è Perlego?

Perlego è un servizio di abbonamento a testi accademici, che ti permette di accedere a un'intera libreria online a un prezzo inferiore rispetto a quello che pagheresti per acquistare un singolo libro al mese. Con oltre 1 milione di testi suddivisi in più di 1.000 categorie, troverai sicuramente ciò che fa per te! Per maggiori informazioni, clicca qui.

Perlego supporta la sintesi vocale?

Cerca l'icona Sintesi vocale nel prossimo libro che leggerai per verificare se è possibile riprodurre l'audio. Questo strumento permette di leggere il testo a voce alta, evidenziandolo man mano che la lettura procede. Puoi aumentare o diminuire la velocità della sintesi vocale, oppure sospendere la riproduzione. Per maggiori informazioni, clicca qui.

Taming Text è disponibile online in formato PDF/ePub?

Sì, puoi accedere a Taming Text di Grant Ingersoll, Thomas S. Morton, Drew Farris in formato PDF e/o ePub, così come ad altri libri molto apprezzati nelle sezioni relative a Ciencia de la computación e Tratamiento de datos. Scopri oltre 1 milione di libri disponibili nel nostro catalogo.

Informazioni

Editore

Manning

Anno

2012

ISBN

9781638353867

Argomento

Ciencia de la computación

Categoria

Tratamiento de datos

Chapter 1. Getting started taming text

In this chapter

Understanding why processing text is important
Learning what makes taming text hard
Setting the stage for leveraging open source libraries to tame text

If you’re reading this book, chances are you’re a programmer, or at least in the information technology field. You operate with relative ease when it comes to email, instant messaging, Google, YouTube, Facebook, Twitter, blogs, and most of the other technologies that define our digital age. After you’re done congratulating yourself on your technical prowess, take a moment to imagine your users. They often feel imprisoned by the sheer volume of email they receive. They struggle to organize all the data that inundates their lives. And they probably don’t know or even care about RSS or JSON, much less search engines, Bayesian classifiers, or neural networks. They want to get answers to their questions without sifting through pages of results. They want email to be organized and prioritized, but spend little time actually doing it themselves. Ultimately, your users want tools that enable them to focus on their lives and their work, not just their technology. They want to control—or tame—the uncontrolled beast that is text. But what does it mean to tame text? We’ll talk more about it later in this chapter, but for now taming text involves three primary things:

The ability to find relevant answers and supporting content given an information need
The ability to organize (label, extract, summarize) and manipulate text with little-to-no user intervention
The ability to do both of these things with ever-increasing amounts of input

This leads us to the primary goal of this book: to give you, the programmer, the tools and hands-on advice to build applications that help people better manage the tidal wave of communication that swamps their lives. The secondary goal of Taming Text is to show how to do this using existing, freely available, high quality, open source libraries and tools.

Before we get to those broader goals later in the book, let’s step back and examine some of the factors involved in text processing and why it’s hard, and also look at some use cases as motivation for the chapters to follow. Specifically, this chapter aims to provide some background on why processing text effectively is both important and challenging. We’ll also lay some groundwork with a simple working example of our first two primary tasks as well as get a preview of the application you’ll build at the end of this book: a fact-based question answering system. With that, let’s look at some of the motivation for taming text by scoping out the size and shape of the information world we live in.

1.1. Why taming text is important

Just for fun, try to imagine going a whole day without reading a single word. That’s right, one whole day without reading any news, signs, websites, or even watching television. Think you could do it? Not likely, unless you sleep the whole day. Now spend a moment thinking about all the things that go into reading all that content: years of schooling and hands-on feedback from parents, teachers, and peers; and countless spelling tests, grammar lessons, and book reports, not to mention the hundreds of thousands of dollars it takes to educate a person through college. Next, step back another level and think about how much content you do read in a day.

To get started, take a moment to consider the following questions:

How many email messages did you get today (both work and personal, including spam)?
How many of those did you read?
How many did you respond to right away? Within the hour? Day? Week?
How do you find old email?
How many blogs did you read today?
How many online news sites did you visit?
Did you use instant messaging (IM), Twitter, or Facebook with friends or colleagues?
How many searches did you do on Google, Yahoo!, or Bing?
What documents on your computer did you read? What format were they in (Word, PDF, text)?
How often do you search for something locally (either on your machine or your corporate intranet)?
How much content did you produce in the form of emails, reports, and so on?

Finally, the big question: how much time did you spend doing this?

If you’re anything like the typical information worker, then you can most likely relate to IDC’s (International Data Corporation) findings from their 2009 study (Feldman 2009):

Email consumes an average of 13 hours per week per worker... But email is no longer the only communication vehicle. Social networks, instant messaging, Yammer, Twitter, Facebook, and LinkedIn have added new communication channels that can sap concentrated productivity time from the information worker’s day. The time spent searching for information this year averaged 8.8 hours per week, for a cost of $14,209 per worker per year. Analyzing information soaked up an additional 8.1 hours, costing the organization $13,078 annually, making these two tasks relatively straightforward candidates for better automation. It makes sense that if workers are spending over a third of their time searching for information and another quarter analyzing it, this time must be as productive as possible.

Furthermore, this survey doesn’t even account for how much time these same employees spend creating content during their personal time. In fact, eMarketer estimates that internet users average 18 hours a week online (eMarketer) and compares this to other leisure activities like watching television, which is still king at 30 hours per week.

Whether it’s reading email, searching Google, reading a book, or logging into Facebook, the written word is everywhere in our lives.

We’ve seen the individual part of the content picture, but what about the collective picture? According to IDC (2011), the world generated 1.8 zettabytes of digital information in 2011 and “by 2020 the world will generate 50 times [that amount].” Naturally, such prognostications often prove to be low given we can’t predict the next big trend that will produce more content than expected.

Even if a good-size chunk of this data is due to signal data, images, audio, and video, the current best approach to making all this data findable is to write analysis reports, add keyword tags and text descriptions, or transcribe the audio using speech recognition or a manual closed-captioning approach so that it can be treated as text. In other words, no matter how much structure we add, it still comes back to text for us to share and comprehend our content. As you can see, the sheer volume of content can be daunting, never mind that text processing is also a hard problem on a small scale, as you’ll see in a later section. In the meantime, it’s worthwhile to think about what the ideal applications or tools would do to help stem the tide of text that’s engulfing us. For many, the answer lies in the ability to quickly and efficiently hone in on the answer to our questions, not just a list of possible answers that we need to then sift through. Moreover, we wouldn’t need to jump through hoops to ask our questions; we’d just be able to use our own words or voice to express them with no need for things like quotations, AND/OR operators, or other things that make it easier on the machine but harder on the person.

Though we all know we don’t live in an ideal world, one of the promising approaches for taming text, popularized by IBM’s Jeopardy!-playing Watson program and Apple’s Siri application, is a question answering system that can process natural languages such as English and return actual answers, not just pages of possible answers. In Taming Text, we aim to lay some of the groundwork for building such a system. To do this, let’s consider what such a system might look like; then, let’s take a look at some simple code that can find and extract key bits of information out of text that will later prove to be useful in our QA system. We’ll finish off this chapter by delving deeper into why building such a system as well as other language-based applications is so hard, along with a look at how the chapters to follow in this book will lay the foundation for a fact-based QA system along with other text-based systems.

1.2. Preview: A fact-based question answering system

For the purposes of this book, a QA system should be capable of ingesting a collection of documents suspected to have answers to questions that users might ask. For instance, Wikipedia or a collection of research papers might be used as a source for finding answers. In other words, the QA system we propose is based on identifying and analyzing text that has a chance of providing the answer based on patterns it has seen in the past. It won’t be capable of inferring an answer from a variety of sources. For instance, if the system is asked “Who is Bob’s uncle?” and there’s a document in the collection with the sentences “Bob’s father is Ola. Ola’s brother is Paul,” the system wouldn’t be able to infer that Bob’s uncle is Paul. But if there’s a sentence that directly states “Bob’s uncle is Paul,” you’d expect the system to be able to answer the question. This isn’t to say that the former example can’t be attempted; it’s just beyond the scope of this book.

A simple workflow for building the QA system described earlier is outlined in figure 1.1.

Figure 1.1. A simple workflow for answering questions posed to a QA system

Naturally, such a simple workflow hides a lot of details, and it also doesn’t cover the ingestion of the documents, but it does allow us to highlight some of the key components needed to process users’ questions. First, the ability to parse a user’s question and determine what’s being asked typically requires basic functionality like identifying words, as well as the ability to understand what kind of answer is appropriate for a question. For instance, the answer to “Who is Bob’s uncle?” should likely be a person, whereas the answer to “Where is Buffalo?” probably requires a place-name to be returned. Second, the need to identify candidate answers typically involves the...