1.1 What is data?
Data and their statistical summaries and interpretations are ubiquitous. For example, we found these four articles during a typical day reading the paper:
⢠Example 1.1: To compile evidence to establish cause and effect
In an opinion piece, Joe Nocera [46] discusses the prevalence of guns in the movies (in anticipation of yet another āDie Hardā movie). He quotes a spokesperson from the Motion Picture Association of America as
āThere is a predominance of findings that show there is no consistent or convincing evidence that exposure [to gun violence in movies] causes people to be more violent.ā
However, Nocera immediately refutes this quoting a professor from the University of Wisconsin: āThere is tons of research on this.ā
Clearly the collection and interpretation of data is crucial when making policy decisions. This isnāt an easy task, of course. A casual reader may think the above differences of opinion are a matter of political motivation, but this need not be the case. Relationships between variables can exist, even if there is not a cause and effect relationship. Trying to find convincing evidence in data often requires a careful collection of data in order for conclusions to be made.
⢠Example 1.2: Price of a hip replacement
In a news piece, Elisabeth Rosenthal [51] describes the research of Jaime Rosenthal who called more than 100 hospitals, covering every state in the summer of 2012 seeking the price of a hip replacement for a hypothetical, uninsured, 62-year-old female. The results were surprising:
Only about half the institutions could provide an estimate
Of those that could, the range of prices went from $11,000 to $125,798
Commentary in the article urges people to place the price data in the context of many other factors such as infection rates and unexpected deaths. However, the article summarizes the primary researcherās belief that there is little consistent correlation between higher prices and better quality in American health care.
Even in what is perhaps the most data-driven industry, there is clear need for data and context to place this data within. Further, this example hints at some other difficulties in data collection: e.g., the question of what to do with missing data, as it is often the case that some values will be unavailable. As well, the issue that the actual mechanism for computing this value at a given hospital may vary from that of another.
⢠Example 1.3: Safety of the airline industry
In a front page article titled āAirline Industry at Its Safest Since the Dawn of the Jet Age,ā authors Jad Mouawad and Christopher Drew [43] summarize the data collected by the Aviation Safety Network pointing out that 2012 had only 23 deadly accidents and 475 fatalities. This may sound high, but putting it into a rate helps give context: this is a risk of one death per 45 million flights. That is, a person could fly daily for an average of 123,000 years before being in a fatal plane crash.
The improvements in safety are not limited to advanced technologies, as the industry (regulators, pilots, and airlines) have created a culture of sharing data about flying hazards with the goal of preventing accidents.
This example shows how a focus on understanding the many factors that can contribute to a given statistic can help improve an area. It wasnāt enough that the airline kept statistics, but rather that they used their findings to address shortcomings.
⢠Example 1.4: Networking
On the business page Andrew Sorkin [53] reports on a data base containing names of over two-million deal makers, power brokers and business executives, and in many cases the name of spouses, children, associates, political donations, charity work, and more. This information held by a company called Relations Science is compiled by more than 800 people.
The goal of course is to sell this information to people who plan to leverage the network of relationships. Of course, other companies, such as Face-book and LinkedIn have such information on their users, and the NSA seeming has all the data it could ever need, but in this case the information is scraped from web sitesāa person need not be a member of a social network or have a security clearance.
How such large data bases get mined and what this means for personal privacy will likely continue to be a major topic of conversation for years to come. Though the statistical techniques of working with so-called ābig dataā are outside the scope of this text, many of the computational skills will be developed.
In this sampling of articles, we see the analysis of data used in many different ways:
Under the name āstudies,ā data is used to make a case about social policy (in two different ways!).
To investigate variability in prices and transparency, data is collected and summarized.
In an industry, data demonstrates that forward looking practices can have a substantial effect.
Data and the information it contains is mined to establish a financial advantage.
Data and its analysis is a very wide topic, so wide we couldnāt begin to describe it all. In this text we narrow our focus, looking at data with an eye towards statistical inference. This is the process of drawing conclusions about populations based on data collected from these populations. To do this, we will use the language of probability. This will give us the flexibility to describe concrete things using data subject to random variation. Exactly how this will be used will require us to make models for our data. This text is roughly organized into three areas: the first to develop techniques for exploring data, the second the basics of statistical inference, and the third area covers the beginnings of modeling with data.
The rest of this chapter is focused on getting started with using R. We save more statistically oriented examples for Chapters 2 and beyond.