![]()
Table of Contents
Using OpenRefine
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers and more
Why Subscribe?
Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example files
Errata
Piracy
Questions
1. Diving Into OpenRefine
Introducing OpenRefine
Recipe 1 β installing OpenRefine
Windows
Mac
Linux
Recipe 2 β creating a new project
File formats supported by OpenRefine
Recipe 3 β exploring your data
Recipe 4 β manipulating columns
Collapsing and expanding columns
Moving columns around
Renaming and removing columns
Recipe 5 β using the project history
Recipe 6 β exporting a project
Recipe 7 β going for more memory
Windows
Mac
Linux
Summary
2. Analyzing and Fixing Data
Recipe 1 β sorting data
Reordering rows
Recipe 2 β faceting data
Text facets
Numeric facets
Customized facets
Faceting by star or flag
Recipe 3 β detecting duplicates
Recipe 4 β applying a text filter
Recipe 5 β using simple cell transformations
Recipe 6 β removing matching rows
Summary
3. Advanced Data Operations
Recipe 1 β handling multi-valued cells
Recipe 2 β alternating between rows and records mode
Recipe 3 β clustering similar cells
Recipe 4 β transforming cell values
Recipe 5 β adding derived columns
Recipe 6 β splitting data across columns
Recipe 7 β transposing rows and columns
Summary
4. Linking Datasets
Recipe 1 β reconciling values with Freebase
Recipe 2 β installing extensions
Recipe 3 β adding a reconciliation service
Recipe 4 β reconciling with Linked Data
Recipe 5 β extracting named entities
Summary
A. Regular Expressions and GREL
Regular expressions for text patterns
Character classes
Quantifiers
Anchors
Choices
Groups
Overview
General Refine Expression Language (GREL)
Transforming data
Creating custom facets
Solving problems with GREL
Index
![]()
Copyright Β© 2013 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: September 2013
Production Reference: 1040913
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78328-908-0
www.packtpub.com
![]()
Authors
Ruben Verborgh
Max De Wilde
Reviewers
Martin Magdinier
Dr. Mateja Verlic
Acquisition Editor
Sam Birch
Commissioning Editor
Subho Gupta
Technical Editors
Anita Nayak
Harshad Vairat
Project Coordinator
Sherin Padayatty
Proofreader
Paul Hindle
Indexer
Hemangini Bari
Production Coordinator
Nilesh R. Mohite
Cover Work
Nilesh R. Mohite
![]()
At the time I joined Metaweb Technologies, Inc. in 2008, we were building up Freebase in earnest; entity by entity, fact by fact. Now you may know Freebase through its newest incarnation, Google's Knowledge Graph, which powers the "Knowledge panels" on www.google.com.
Building up "the world's database of everything" is a tall order that machines and algorithms alone cannot do, even if raw public domain data exists in abundance. Raw data from multiple sources must be cleaned up, homogenized, and then reconciled with data already in Freebase. Even that first step of cleaning up the data cannot be automated entirely; it takes the common sense of a human reader to know that if both 0.1 and 10,000,000 occur in a column named cost, they are very likely in different units (perhaps millions of dollars and dollars respectively). It also takes a human reader to decide that UCBerkley means the same as University of California in Berkeley, CA, but not the same as Berkeley DB.
If these errors occur often enough, we might as well have given up or just hired enough people to perform manual data entry. But these errors occur often enough to be a problem, and yet not often enough that anyone who has not dealt with such data thinks simple automation is sufficient. But, dear reader, you have dealt with data, and you know how unpredictably messy it can be.
Every dataset that we wanted to load into Freebase became an iterative exercise in programming mixed with manual inspection that led to hard-coding transformation rules, from turning two-digit years into four-digits, to swapping given name and surname if there is a comma in between them. Even for most of us programmers, this exercise got old quickly, and it was painful to start every time.
So, we created Freebase Gridworks, a tool for cleaning up data and making it ready for loading into Freebase. We designed it to be a database-spreadsheet hybrid; it is interactive like spreadsheet software and programmable like databases. It was this combination that made Gridworks the first of its kind.
In the process of creating and then using Gridworks ourselves, we realized that cleaning, transforming, and just playing with data is crucial and generally useful, even if the goal is not to load data into Freebase. So, we redesigned the tool to be more generic, and released its Version 2 under the name "Google Refine" after Google acquired Metaweb.
Since then, Refine has been well received in many different communities; data journalists, open data enthusiasts, librarians, archivists, hacktivists, and even programmers and developers by trade. Its adoption in the early days spread through word of mouth, in hackathons and informal tutorials held by its own users.
Having proven itself through early adopters, Refine now needs better organized efforts to spread and become a mature product with a sustainable community around it. Expert users, open source contributors, and data enthusiast groups are actively teaching how to use Refine on tours and in the classroom. Ruben and Max from the Free Your Metadata team have taken the next logical step in consolidating those tutorials and organizing those recipes into this handy missing manual for Refine.
Stepping back to take in the bigger picture, we...