
Taming Text
How to Find, Organize, and Manipulate It
- 320 pages
- English
- ePUB (mobile friendly)
- Available on iOS & Android
About this book
Summary Taming Text, winner of the 2013 Jolt Awards for Productivity, is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. This book explores how to automatically organize text using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. The book guides you through examples illustrating each of these topics, as well as the foundations upon which they are built. About this Book
There is so much text in our lives, we are practically drowningin it. Fortunately, there are innovative tools and techniquesfor managing unstructured information that can throw thesmart developer a much-needed lifeline. You'll find them in thisbook.
Taming Text is a practical, example-driven guide to working withtext in real applications. This book introduces you to useful techniques like full-text search, proper name recognition, clustering, tagging, information extraction, and summarization.You'll explore real use cases as you systematically absorb thefoundations upon which they are built.Written in a clear and concise style, this book avoids jargon, explainingthe subject in terms you can understand without a backgroundin statistics or natural language processing. Examples arein Java, but the concepts can be applied in any language.Written for Java developers, the book requires no prior knowledge of GWT. Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book. Winner of 2013 Jolt Awards: The Best Booksâone of five notable books every serious programmer should read. What's Inside
- When to use text-taming techniques
- Important open-source libraries like Solr and Mahout
- How to build text-processing applications
About the Authors
Grant Ingersoll is an engineer, speaker, and trainer, a Lucenecommitter, and a cofounder of the Mahout machine-learning project. Thomas Morton is the primary developer of OpenNLP and Maximum Entropy. Drew Farris is a technology consultant, software developer, and contributor to Mahout, Lucene, and Solr. "Takes the mystery out of verycomplex processes." âFrom the Foreword by Liz Liddy, Dean, iSchool, Syracuse University Table of Contents
- Getting started taming text
- Foundations of taming text
- Searching
- Fuzzy string matching
- Identifying people, places, and things
- Clustering text
- Classification, categorization, and tagging
- Building an example question answering system
- Untamed text: exploring the next frontier
Frequently asked questions
- Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
- Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Information
Chapter 1. Getting started taming text
- Understanding why processing text is important
- Learning what makes taming text hard
- Setting the stage for leveraging open source libraries to tame text
- The ability to find relevant answers and supporting content given an information need
- The ability to organize (label, extract, summarize) and manipulate text with little-to-no user intervention
- The ability to do both of these things with ever-increasing amounts of input
1.1. Why taming text is important
- How many email messages did you get today (both work and personal, including spam)?
- How many of those did you read?
- How many did you respond to right away? Within the hour? Day? Week?
- How do you find old email?
- How many blogs did you read today?
- How many online news sites did you visit?
- Did you use instant messaging (IM), Twitter, or Facebook with friends or colleagues?
- How many searches did you do on Google, Yahoo!, or Bing?
- What documents on your computer did you read? What format were they in (Word, PDF, text)?
- How often do you search for something locally (either on your machine or your corporate intranet)?
- How much content did you produce in the form of emails, reports, and so on?
Email consumes an average of 13 hours per week per worker... But email is no longer the only communication vehicle. Social networks, instant messaging, Yammer, Twitter, Facebook, and LinkedIn have added new communication channels that can sap concentrated productivity time from the information workerâs day. The time spent searching for information this year averaged 8.8 hours per week, for a cost of $14,209 per worker per year. Analyzing information soaked up an additional 8.1 hours, costing the organization $13,078 annually, making these two tasks relatively straightforward candidates for better automation. It makes sense that if workers are spending over a third of their time searching for information and another quarter analyzing it, this time must be as productive as possible.
1.2. Preview: A fact-based question answering system
Figure 1.1. A simple workflow for answering questions posed to a QA system

Table of contents
- Copyright
- Brief Table of Contents
- Table of Contents
- Foreword
- Preface
- Acknowledgments
- About this Book
- About the Cover Illustration
- Chapter 1. Getting started taming text
- Chapter 2. Foundations of taming text
- Chapter 3. Searching
- Chapter 4. Fuzzy string matching
- Chapter 5. Identifying people, places, and things
- Chapter 6. Clustering text
- Chapter 7. Classification, categorization, and tagging
- Chapter 8. Building an example question answering system
- Chapter 9. Untamed text: exploring the next frontier
- Index
- List of Figures
- List of Tables
- List of Listings