What do scientists always do first when they want to get up to speed on a topic? They look it up on Wikipedia! Checking out the page of the world heritage sites, we stumble across a list of currently and previously endangered sites at http://en.wikipedia.org/wiki/List_of_World_Heritage_in_Danger. You find a table with the current sites listed when accessing the link. It contains the name, location (city, country, and geographic coordinates), type of danger that is facing the site, the year the site was added to the world heritage list, and the year it was put on the list of endangered sites. Let us investigate how the sites are distributed around the world.
Wikipedia—information source of choice
While the table holds information on the places, it is not immediately clear where they are located and whether they are regionally clustered. Rather than trying to eyeball the table, it could be very useful to plot the locations of the places on a map. As humans deal well with visual information, we will try to visualize results whenever possible throughout this book. But how to get the information from the table to a map? This sounds like a difficult task, but with the techniques that we are going to discuss extensively in the next pages, it is in fact not. For now, we simply provide you with a first impression of how to tackle such a task with R. Detailed explanations of the commands in the code snippets are provided later and more systematically throughout the book.
To start, we have to load a couple of packages. While R only comes with a set of basic, mostly math- and statistics-related functions, it can easily be extended by user-written packages. For this example, we load the following packages using the library() function:1
R> library(stringr) R> library(XML) R> library(maps)
In the next step, we load the data from the webpage into R. This can be done easily using the readHTMLTable() function from the XML package:
We are going to explain the mechanics of this step and all other major web scraping techniques in more detail in Chapter 9. For now, all you need to know is that we are telling R that the imported data come in the form of an HTML document. R is capable of interpreting HTML, that is, it knows how tables, headlines, or other objects are structured in this file format. This works via a so-called parser, which is called with the function htmlParse(). In the next step, we tell R to extract all HTML tables it can find in the parsed object heritage_parsed and store them in a new object tables. If you are not already familiar with HTML, you will learn that HTML tables are constructed from the same code components in Chapter 2. The readHTMLTable() function helps in identifying and reading out these tables.
All the information we need is now contained in the tables object. This object is a list of all the tables the function could find in the HTML document. After eyeballing all the tables, we identify and select the table we are interested in (the second one) and write it into a new one, named danger_table. Some of the variables in our table are of no further interest, so we select only those that contain information about the site's name, location, criterion of heritage (cultural or natural), year of inscription, and year of endangerment. The variables in our table have been assigned unhandy names, so we relabel them. Finally, we have a look at the names of the first few sites:
This seems to have worked. Additionally, we perform some simple data cleaning, a step often necessary when importing web-based content into R. The variable crit, which contains the information whether the site is of cultural or natural character, is recoded, and the two variables y_ins and y_end are turned into numeric ones.2 Some of the entries in the y_end variable are ambiguous as they contain several years. We select the last given year in the cell. To do so, we specify a so-called regular expression, which goes [[:digit:]]4$—we explain what this means in the next paragraph:
The locn variable is a bit of a mess, exemplified by three cases drawn from the data-set:
The variable contains the name of the site's location, the country, and the geographic coordinates in several varieties. What we need for the map are the coordinates, given by the latitude (e.g., 30.84167N) and longitude (e.g., 29.66389E) values. To extract this information, we have to use some more advanced text manipulation tools called “regular expressions”, which are discussed extensively in Chapter 8. In short, we have to give R an exact description of what the information we are interested in looks like, and then let R search for and extract it. To do so, we use functions from the stringr package, which we will also discuss in detail in Chapter 8. In order to get the latitude and longitude values, we write the following:
The first regular expression
R> reg_y <-"[/][ -]*[[:digit:]]*[.]*[[:digit:]]*[;]" R> reg_x <-"[;][ -]*[[:digit:]]*[.]*[[:digit:]]*" R> y_coords <- str_extract(danger_table$locn, reg_y) R> y_coords <- as.numeric(str_sub(y_coords, 3, -2)) R> danger_table$y_coords <- y_coords R> x_coords <- str_extract(danger_table$locn, reg_x) R> x_coords <- as.numeric(str_sub(x_coords, 3, -1)) R> danger_table$x_coords <- x_coords R> danger_table$locn <- NULL
Do not be confused by the first two lines of code. What looks like the result of a monkey typing on a keyboard is in fact a precise description of the coordinates in the locn variable. The information is contained in the locn variable as decimal degrees as well as in degrees, minutes, and seconds. As the decimal degrees are easier to describe with a regular expression, we try to extract those. Writing regular expressions means finding a general pattern for strings that we want to extract. We observe that latitudes and longitudes always appear after a slash and are a sequence of several digits, separated by a dot. Some values start with a minus sign. Both values are separated by a semicolon, which is cut off along with the empty spaces and the slash. When we apply this pattern to the locn variable with the str_extract(...