eBook - ePub

Using OpenRefine

Name: Using OpenRefine
ISBN: 9781783289080

Ruben Verborgh,

Max De Wilde,

114 pages
English
ePUB (mobile friendly)
Available on iOS & Android

eBook - ePub

Using OpenRefine

Ruben Verborgh,

Max De Wilde,

About this book

In Detail

Data is supposed to be the new gold, but how can you unlock the value in your data? Managing large datasets used to be a task for specialists, but you don't have to worry about inconsistencies or errors anymore. OpenRefine lets you clean, link, and publish your dataset in a breeze.

Using OpenRefine takes you on a practical tour of all the handy features of this well-known data transformation tool. It is a hands-on recipe book that teaches you data techniques by example. Starting from the basics, it gradually transforms you into an OpenRefine expert.

This book will teach you all the necessary skills to handle any large dataset and to turn it into high-quality data for the Web. After you learn how to analyze data and spot issues, we'll see how we can solve them to obtain a clean dataset. Messy and inconsistent data is recovered through advanced techniques such as automated clustering. We'll then show extract links from keyword and full-text fields using reconciliation and named-entity extraction.

Using OpenRefine is more than a manual: it's a guide stuffed with tips and tricks to get the best out of your data.

Approach

The book is styled on a Cookbook, containing recipes - combined with free datasets - which will turn readers into proficient OpenRefine users in the fastest possible way.

Who this book is for

This book is targeted at anyone who works on or handles a large amount of data. No prior knowledge of OpenRefine is required, as we start from the very beginning and gradually reveal more advanced features. You don't even need your own dataset, as we provide example data to try out the book's recipes.

Trusted by 375,005 students

Access to over 1 million titles for a fair monthly price.

Study more efficiently using our study tools.

Information

Publisher

Packt Publishing

Year

2013

eBook ISBN

9781783289080

Edition

Topic

Computer Science

Subtopic

Data Mining

Index

Computer Science

Using OpenRefine

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example files

Errata

Piracy

Questions

1. Diving Into OpenRefine

Introducing OpenRefine

Recipe 1 – installing OpenRefine

Windows

Mac

Linux

Recipe 2 – creating a new project

File formats supported by OpenRefine

Recipe 3 – exploring your data

Recipe 4 – manipulating columns

Collapsing and expanding columns

Moving columns around

Renaming and removing columns

Recipe 5 – using the project history

Recipe 6 – exporting a project

Recipe 7 – going for more memory

Windows

Mac

Linux

Summary

2. Analyzing and Fixing Data

Recipe 1 – sorting data

Reordering rows

Recipe 2 – faceting data

Text facets

Numeric facets

Customized facets

Faceting by star or flag

Recipe 3 – detecting duplicates

Recipe 4 – applying a text filter

Recipe 5 – using simple cell transformations

Recipe 6 – removing matching rows

Summary

3. Advanced Data Operations

Recipe 1 – handling multi-valued cells

Recipe 2 – alternating between rows and records mode

Recipe 3 – clustering similar cells

Recipe 4 – transforming cell values

Recipe 5 – adding derived columns

Recipe 6 – splitting data across columns

Recipe 7 – transposing rows and columns

Summary

4. Linking Datasets

Recipe 1 – reconciling values with Freebase

Recipe 2 – installing extensions

Recipe 3 – adding a reconciliation service

Recipe 4 – reconciling with Linked Data

Recipe 5 – extracting named entities

Summary

A. Regular Expressions and GREL

Regular expressions for text patterns

Character classes

Quantifiers

Anchors

Choices

Groups

Overview

General Refine Expression Language (GREL)

Transforming data

Creating custom facets

Solving problems with GREL

Index

Using OpenRefine

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: September 2013

Production Reference: 1040913

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78328-908-0

www.packtpub.com

Cover Image by Aniket Sawant (<[email protected]>)

Credits

Authors

Ruben Verborgh

Max De Wilde

Reviewers

Martin Magdinier

Dr. Mateja Verlic

Acquisition Editor

Sam Birch

Commissioning Editor

Subho Gupta

Technical Editors

Anita Nayak

Harshad Vairat

Project Coordinator

Sherin Padayatty

Proofreader

Paul Hindle

Indexer

Hemangini Bari

Production Coordinator

Nilesh R. Mohite

Cover Work

Nilesh R. Mohite

Foreword

At the time I joined Metaweb Technologies, Inc. in 2008, we were building up Freebase in earnest; entity by entity, fact by fact. Now you may know Freebase through its newest incarnation, Google's Knowledge Graph, which powers the "Knowledge panels" on www.google.com.

Building up "the world's database of everything" is a tall order that machines and algorithms alone cannot do, even if raw public domain data exists in abundance. Raw data from multiple sources must be cleaned up, homogenized, and then reconciled with data already in Freebase. Even that first step of cleaning up the data cannot be automated entirely; it takes the common sense of a human reader to know that if both 0.1 and 10,000,000 occur in a column named cost, they are very likely in different units (perhaps millions of dollars and dollars respectively). It also takes a human reader to decide that UCBerkley means the same as University of California in Berkeley, CA, but not the same as Berkeley DB.

If these errors occur often enough, we might as well have given up or just hired enough people to perform manual data entry. But these errors occur often enough to be a problem, and yet not often enough that anyone who has not dealt with such data thinks simple automation is sufficient. But, dear reader, you have dealt with data, and you know how unpredictably messy it can be.

Every dataset that we wanted to load into Freebase became an iterative exercise in programming mixed with manual inspection that led to hard-coding transformation rules, from turning two-digit years into four-digits, to swapping given name and surname if there is a comma in between them. Even for most of us programmers, this exercise got old quickly, and it was painful to start every time.

So, we created Freebase Gridworks, a tool for cleaning up data and making it ready for loading into Freebase. We designed it to be a database-spreadsheet hybrid; it is interactive like spreadsheet software and programmable like databases. It was this combination that made Gridworks the first of its kind.

In the process of creating and then using Gridworks ourselves, we realized that cleaning, transforming, and just playing with data is crucial and generally useful, even if the goal is not to load data into Freebase. So, we redesigned the tool to be more generic, and released its Version 2 under the name "Google Refine" after Google acquired Metaweb.

Since then, Refine has been well received in many different communities; data journalists, open data enthusiasts, librarians, archivists, hacktivists, and even programmers and developers by trade. Its adoption in the early days spread through word of mouth, in hackathons and informal tutorials held by its own users.

Having proven itself through early adopters, Refine now needs better organized efforts to spread and become a mature product with a sustainable community around it. Expert users, open source contributors, and data enthusiast groups are actively teaching how to use Refine on tours and in the classroom. Ruben and Max from the Free Your Metadata team have taken the next logical step in consolidating those tutorials and organizing those recipes into this handy missing manual for Refine.

Stepping back to take in the bigger picture, we...

Using OpenRefine

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription

No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud

Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app

Yes, you can access Using OpenRefine by Ruben Verborgh, Max De Wilde in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

Using OpenRefine

Using OpenRefine

About this book

In Detail

Approach

Who this book is for

Trusted by 375,005 students

Information

Using OpenRefine

Table of Contents

Using OpenRefine

Credits

Foreword

Table of contents

Frequently asked questions