Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
Provides basic techniques to query web documents and data sets (XPath and regular expressions).
An extensive set of exercises are presented to guide the reader through each technique.
Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.
Case studies are featured throughout along with examples for each technique presented.
R code and solutions to exercises featured in the book are provided on a supporting website.

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.

At the moment all of our mobile-responsive ePub books are available to download via the app. Most of our PDFs are also available to download and we're working on making the final remaining ones downloadable now. Learn more here.

Perlego offers two plans: Essential and Complete

Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.

Both plans are available with monthly, semester, or annual billing cycles.

We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.

Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.

Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.

Yes, you can access Automated Data Collection with R by Simon Munzert,Christian Rubba,Peter Meißner,Dominic Nyhuis in PDF and/or ePUB format, as well as other popular books in Computer Science & Data Mining. We have over one million books available in our catalogue for you to explore.

Information

Publisher

Year

Print ISBN

eBook ISBN

Edition

Topic

Computer Science

Subtopic

Data Mining

Index

Computer Science

1
Introduction

Are you ready for your first encounter with web scraping? Let us start with a small example that you can recreate directly on your machine, provided you have R installed. The case study gives a first impression of the book's central themes.

1.1 Case study: World Heritage Sites in Danger

The United Nations Educational, Scientific and Cultural Organization (UNESCO) is an organization of the United Nations which, among other things, fights for the preservation of the world's natural and cultural heritage. As of today (November 2013), there are 981 heritage sites, most of which of are man-made like the Pyramids of Giza, but also natural phenomena like the Great Barrier Reef are listed. Unfortunately, some of the awarded places are threatened by human intervention. Which sites are threatened and where are they located? Are there regions in the world where sites are more endangered than in others? What are the reasons that put a site at risk? These are the questions that we want to examine in this first case study.

What do scientists always do first when they want to get up to speed on a topic? They look it up on Wikipedia! Checking out the page of the world heritage sites, we stumble across a list of currently and previously endangered sites at http://en.wikipedia.org/wiki/List_of_World_Heritage_in_Danger. You find a table with the current sites listed when accessing the link. It contains the name, location (city, country, and geographic coordinates), type of danger that is facing the site, the year the site was added to the world heritage list, and the year it was put on the list of endangered sites. Let us investigate how the sites are distributed around the world.

Wikipedia—information source of choice

While the table holds information on the places, it is not immediately clear where they are located and whether they are regionally clustered. Rather than trying to eyeball the table, it could be very useful to plot the locations of the places on a map. As humans deal well with visual information, we will try to visualize results whenever possible throughout this book. But how to get the information from the table to a map? This sounds like a difficult task, but with the techniques that we are going to discuss extensively in the next pages, it is in fact not. For now, we simply provide you with a first impression of how to tackle such a task with R. Detailed explanations of the commands in the code snippets are provided later and more systematically throughout the book.

To start, we have to load a couple of packages. While R only comes with a set of basic, mostly math- and statistics-related functions, it can easily be extended by user-written packages. For this example, we load the following packages using the library() function:¹

 R> library(stringr) R> library(XML) R> library(maps)

In the next step, we load the data from the webpage into R. This can be done easily using the readHTMLTable() function from the XML package:

We are going to explain the mechanics of this step and all other major web scraping techniques in more detail in Chapter 9. For now, all you need to know is that we are telling R that the imported data come in the form of an HTML document. R is capable of interpreting HTML, that is, it knows how tables, headlines, or other objects are structured in this file format. This works via a so-called parser, which is called with the function htmlParse(). In the next step, we tell R to extract all HTML tables it can find in the parsed object heritage_parsed and store them in a new object tables. If you are not already familiar with HTML, you will learn that HTML tables are constructed from the same code components in Chapter 2. The readHTMLTable() function helps in identifying and reading out these tables.

All the information we need is now contained in the tables object. This object is a list of all the tables the function could find in the HTML document. After eyeballing all the tables, we identify and select the table we are interested in (the second one) and write it into a new one, named danger_table. Some of the variables in our table are of no further interest, so we select only those that contain information about the site's name, location, criterion of heritage (cultural or natural), year of inscription, and year of endangerment. The variables in our table have been assigned unhandy names, so we relabel them. Finally, we have a look at the names of the first few sites:

This seems to have worked. Additionally, we perform some simple data cleaning, a step often necessary when importing web-based content into R. The variable crit, which contains the information whether the site is of cultural or natural character, is recoded, and the two variables y_ins and y_end are turned into numeric ones.² Some of the entries in the y_end variable are ambiguous as they contain several years. We select the last given year in the cell. To do so, we specify a so-called regular expression, which goes [[:digit:]]4$—we explain what this means in the next paragraph:

The locn variable is a bit of a mess, exemplified by three cases drawn from the data-set:

The variable contains the name of the site's location, the country, and the geographic coordinates in several varieties. What we need for the map are the coordinates, given by the latitude (e.g., 30.84167N) and longitude (e.g., 29.66389E) values. To extract this information, we have to use some more advanced text manipulation tools called “regular expressions”, which are discussed extensively in Chapter 8. In short, we have to give R an exact description of what the information we are interested in looks like, and then let R search for and extract it. To do so, we use functions from the stringr package, which we will also discuss in detail in Chapter 8. In order to get the latitude and longitude values, we write the following:

The first regular expression

 R> reg_y <-"[/][ -]*[[:digit:]]*[.]*[[:digit:]]*[;]" R> reg_x <-"[;][ -]*[[:digit:]]*[.]*[[:digit:]]*" R> y_coords <- str_extract(danger_table$locn, reg_y) R> y_coords <- as.numeric(str_sub(y_coords, 3, -2)) R> danger_table$y_coords <- y_coords R> x_coords <- str_extract(danger_table$locn, reg_x) R> x_coords <- as.numeric(str_sub(x_coords, 3, -1)) R> danger_table$x_coords <- x_coords R> danger_table$locn <- NULL

Do not be confused by the first two lines of code. What looks like the result of a monkey typing on a keyboard is in fact a precise description of the coordinates in the locn variable. The information is contained in the locn variable as decimal degrees as well as in degrees, minutes, and seconds. As the decimal degrees are easier to describe with a regular expression, we try to extract those. Writing regular expressions means finding a general pattern for strings that we want to extract. We observe that latitudes and longitudes always appear after a slash and are a sequence of several digits, separated by a dot. Some values start with a minus sign. Both values are separated by a semicolon, which is cut off along with the empty spaces and the slash. When we apply this pattern to the locn variable with the str_extract(...

Cover
Title Page
Copyright
Dedication
Preface
Chapter 1: Introduction
Part One: A Primer on Web and Data Technologies
Part Two: A Practical Toolbox for Web Scraping and Text Mining
Part Three: A Bag of Case Studies
References
General index
Package index
Function index
End User License Agreement

9781400888184,

9781526418463,

About this book

Frequently asked questions

Information

1.1 Case study: World Heritage Sites in Danger

Table of contents