Text Processing with Ruby
eBook - ePub

Text Processing with Ruby

Extract Value from the Data That Surrounds You

  1. 272 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Text Processing with Ruby

Extract Value from the Data That Surrounds You

About this book

Most information in the world is in text format, and programmers often find themselves needing to make sense of the data hiding within. You want to do this efficiently, avoiding labor-intensive, manual work—and Ruby is ideally suited to this task.

Text Processing with Ruby takes a practical approach to working with text:

  • First, Acquire: Explore Ruby's core and standard library, and what's possible with IO and its derived classes like File. Extract text into your Ruby programs from the file system and standard input. Process delimited files such as CSVs, and write utilities that interact with other programs in text-processing pipelines. Process web pages with Nokogiri to pull out information from even the messiest of HTML, and decipher character encoding mysteries.
  • Second, Transform: Use regular expressions to match, extract, and replace patterns in text. Write a parser using Ruby's StringScanner library. Use Natural Language Processing techniques to extract keywords and implement fuzzy searching.
  • Finally, Load: Write the transformed text and data to standard output, files and other processes. Serialize text into JSON, XML, and CVS, and use ERB to create more complex formats.

You'll soon be able to tackle even the most enormous and entangled text with ease, scything through gigabytes of data and effortlessly extracting the bits that matter.

Top Five Text Processing Tips
by Rob Miller, author of Text Processing with Ruby

Clean up your data first
Data in the real world is messy. It almost always pays off to take some
time to normalize different sources of data and to get them into the
same format before you begin whatever actual processing you need to do.
You'll have less exceptions and special cases in your code, and it'll be
a lot more resilient.

Master regular expressions
There are definitely some text processing problems that can't be solved
with regular expressions, but not that many. While they're not always
the best or more readable option, knowing regular expressions well will
get you out of many tight spots, and even more often than that will be
the first step towards a more robust solution.

Break your problem into discrete steps
Almost all text processing tasks, no matter how complicated they seem on the face of it, are really a series of small transformations. Figuring out how to frame your problem in this way will make it easy to take a pipeline approach, where your text flows through a series of small,
discrete steps, each of which transform the data in a particular way and
then passes it on. Such programs are both easier to reason about and
easier to modify and extend.

Figure out a strategy for missing data
Data in the real world, as well as being messy, also frequently has gaps. Decide early on how you're going to cope with that — how you'll represent the absence of particular fields or properties — and you'll
avoid messiness later on.

Make the most of existing tools
There are hundreds of command-line tools that exist solely to process
textual data. Each of them is capable of performing a particular
transformation, which means you don't need to reinvent the wheel. If you
use existing tools for the parts of your problem that have already been
solved, all that remains is to solve the unique problem that you have.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Text Processing with Ruby by Rob Miller in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Processing. We have over one million books available in our catalogue for you to explore.

Table of contents

  1. Text Processing with Ruby