Using OpenRefine
eBook - ePub

Using OpenRefine

  1. 114 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Using OpenRefine

About this book

In Detail

Data is supposed to be the new gold, but how can you unlock the value in your data? Managing large datasets used to be a task for specialists, but you don't have to worry about inconsistencies or errors anymore. OpenRefine lets you clean, link, and publish your dataset in a breeze.

Using OpenRefine takes you on a practical tour of all the handy features of this well-known data transformation tool. It is a hands-on recipe book that teaches you data techniques by example. Starting from the basics, it gradually transforms you into an OpenRefine expert.

This book will teach you all the necessary skills to handle any large dataset and to turn it into high-quality data for the Web. After you learn how to analyze data and spot issues, we'll see how we can solve them to obtain a clean dataset. Messy and inconsistent data is recovered through advanced techniques such as automated clustering. We'll then show extract links from keyword and full-text fields using reconciliation and named-entity extraction.

Using OpenRefine is more than a manual: it's a guide stuffed with tips and tricks to get the best out of your data.

Approach

The book is styled on a Cookbook, containing recipes - combined with free datasets - which will turn readers into proficient OpenRefine users in the fastest possible way.

Who this book is for

This book is targeted at anyone who works on or handles a large amount of data. No prior knowledge of OpenRefine is required, as we start from the very beginning and gradually reveal more advanced features. You don't even need your own dataset, as we provide example data to try out the book's recipes.

Trusted byΒ 375,005 students

Access to over 1 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Using OpenRefine


Table of Contents

Using OpenRefine
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example files
Errata
Piracy
Questions
1. Diving Into OpenRefine
Introducing OpenRefine
Recipe 1 – installing OpenRefine
Windows
Mac
Linux
Recipe 2 – creating a new project
File formats supported by OpenRefine
Recipe 3 – exploring your data
Recipe 4 – manipulating columns
Collapsing and expanding columns
Moving columns around
Renaming and removing columns
Recipe 5 – using the project history
Recipe 6 – exporting a project
Recipe 7 – going for more memory
Windows
Mac
Linux
Summary
2. Analyzing and Fixing Data
Recipe 1 – sorting data
Reordering rows
Recipe 2 – faceting data
Text facets
Numeric facets
Customized facets
Faceting by star or flag
Recipe 3 – detecting duplicates
Recipe 4 – applying a text filter
Recipe 5 – using simple cell transformations
Recipe 6 – removing matching rows
Summary
3. Advanced Data Operations
Recipe 1 – handling multi-valued cells
Recipe 2 – alternating between rows and records mode
Recipe 3 – clustering similar cells
Recipe 4 – transforming cell values
Recipe 5 – adding derived columns
Recipe 6 – splitting data across columns
Recipe 7 – transposing rows and columns
Summary
4. Linking Datasets
Recipe 1 – reconciling values with Freebase
Recipe 2 – installing extensions
Recipe 3 – adding a reconciliation service
Recipe 4 – reconciling with Linked Data
Recipe 5 – extracting named entities
Summary
A. Regular Expressions and GREL
Regular expressions for text patterns
Character classes
Quantifiers
Anchors
Choices
Groups
Overview
General Refine Expression Language (GREL)
Transforming data
Creating custom facets
Solving problems with GREL
Index

Using OpenRefine

Copyright Β© 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2013
Production Reference: 1040913
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78328-908-0
www.packtpub.com
Cover Image by Aniket Sawant ()

Credits

Authors
Ruben Verborgh
Max De Wilde
Reviewers
Martin Magdinier
Dr. Mateja Verlic
Acquisition Editor
Sam Birch
Commissioning Editor
Subho Gupta
Technical Editors
Anita Nayak
Harshad Vairat
Project Coordinator
Sherin Padayatty
Proofreader
Paul Hindle
Indexer
Hemangini Bari
Production Coordinator
Nilesh R. Mohite
Cover Work
Nilesh R. Mohite

Foreword

At the time I joined Metaweb Technologies, Inc. in 2008, we were building up Freebase in earnest; entity by entity, fact by fact. Now you may know Freebase through its newest incarnation, Google's Knowledge Graph, which powers the "Knowledge panels" on www.google.com.
Building up "the world's database of everything" is a tall order that machines and algorithms alone cannot do, even if raw public domain data exists in abundance. Raw data from multiple sources must be cleaned up, homogenized, and then reconciled with data already in Freebase. Even that first step of cleaning up the data cannot be automated entirely; it takes the common sense of a human reader to know that if both 0.1 and 10,000,000 occur in a column named cost, they are very likely in different units (perhaps millions of dollars and dollars respectively). It also takes a human reader to decide that UCBerkley means the same as University of California in Berkeley, CA, but not the same as Berkeley DB.
If these errors occur often enough, we might as well have given up or just hired enough people to perform manual data entry. But these errors occur often enough to be a problem, and yet not often enough that anyone who has not dealt with such data thinks simple automation is sufficient. But, dear reader, you have dealt with data, and you know how unpredictably messy it can be.
Every dataset that we wanted to load into Freebase became an iterative exercise in programming mixed with manual inspection that led to hard-coding transformation rules, from turning two-digit years into four-digits, to swapping given name and surname if there is a comma in between them. Even for most of us programmers, this exercise got old quickly, and it was painful to start every time.
So, we created Freebase Gridworks, a tool for cleaning up data and making it ready for loading into Freebase. We designed it to be a database-spreadsheet hybrid; it is interactive like spreadsheet software and programmable like databases. It was this combination that made Gridworks the first of its kind.
In the process of creating and then using Gridworks ourselves, we realized that cleaning, transforming, and just playing with data is crucial and generally useful, even if the goal is not to load data into Freebase. So, we redesigned the tool to be more generic, and released its Version 2 under the name "Google Refine" after Google acquired Metaweb.
Since then, Refine has been well received in many different communities; data journalists, open data enthusiasts, librarians, archivists, hacktivists, and even programmers and developers by trade. Its adoption in the early days spread through word of mouth, in hackathons and informal tutorials held by its own users.
Having proven itself through early adopters, Refine now needs better organized efforts to spread and become a mature product with a sustainable community around it. Expert users, open source contributors, and data enthusiast groups are actively teaching how to use Refine on tours and in the classroom. Ruben and Max from the Free Your Metadata team have taken the next logical step in consolidating those tutorials and organizing those recipes into this handy missing manual for Refine.
Stepping back to take in the bigger picture, we...

Table of contents

  1. Using OpenRefine

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere β€” even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Using OpenRefine by Ruben Verborgh, Max De Wilde in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.