1.1 Introduction
The earliest hard drives on personal computers had the capacity to store roughly 5 MB – now typical personal computers can store thousands of times more (around 500 GB). Data and computing are not limited to supercomputing centers, mainframe or personal computers, but have become more mobile and virtual. In fact, the very definition of “computer” is evolving to encompass more and more aspects of our lives from transportation (with computers in our cars and cars that drive themselves) to everyday living with televisions, refrigerators, doorbells, thermostats, and many other devices that are Internet‐enabled and “smart.” The rise of the Internet of things, smart devices, and personal computing power relegated to mobile environments and devices, like our smartphone, has certainly created an unprecedented opportunity for survey and social researchers and government officials to track, measure, and better understand public opinion, social phenomena, and the world around us.
A review of the current social science and survey research literature reveals that these fields are indeed at a crossroads transitioning from methods of active observation and data collection to a new landscape where researchers are exploring, considering, and using some of these new data sources for measuring public opinion and social phenomena. Connelly et al. (2016) comment that “whilst there may be a ‘Big Data revolution’ underway, it is not the size or quantity of these data that is revolutionary. The revolution centers on the increased availability of new types of data which have not previously been available for social research.” In our view, advances in this new landscape are taking place in seven major dimensions including
- (1) reimagining traditional survey research by leveraging new machine learning methods (MLMs) that improve efficiencies of traditional survey data collection, processing, and analysis;
- (2) augmenting traditional survey data with nonsurvey data (administrative, social media, or other Big Data sources) to improve estimates of public opinion and official statistics;
- (3) enhancing official statistics or estimates of public opinion derived from Big Data or other nonsurvey data;
- (4) comparing estimates of public opinion and official statistics derived from survey data sources to those generated from Big Data or other nonsurvey data exclusively;
- (5) exploring new methods for enhancing survey and nonsurvey data collection and gathering, processing, and analysis;
- (6) adapting and modifying current methods for use with new data sources and developing new techniques suitable for design and model‐based inference with these data sources;
- (7) contributing survey data, methods, and techniques to the Big Data ecosystem.
Weaved within this new landscape is the perspective of assimilating these new data sources into the process as substitutes, augments, or auxiliary to existing survey data. Computational social science continues to evolve as a social sciences subfield that uses computational methods, models, and advanced information technology to understand social phenomena. This subfield is particularly suited to take advantage of alternative data sources including social media and other sources of Big Data like digital footprint data generated as part of social online activities. Survey researchers are exploring the potential of these alternate data sources, especially social media data. Most recently, Burke et al. (2018) explored how social media data create opportunities for not only sampling and recruiting specific populations but also for understanding a growing proportion of the population who are active on social media sites by mining the data on such sites.
No matter the Big Data source, we need data science methods and approaches that are well suited to deal with these types of data. Although some have made the case that administrative data be considered as Big Data (Connelly et al. 2016), the general consensus in both the data science and survey research communities is that they are not. However, with the increased collection of paradata and the increased use of sensors and other similar peripherals used in data collection, one could argue that survey data are getting bigger – not in the number of cases, but in the number of variables that are available per case. Put another way, some survey datasets are getting bigger because they are “wider” rather than “longer.” So we could also make the case that surveys themselves are creating bigger data and could benefit from such applying these types of methods.
This chapter explores how techniques from data science, known collectively as MLMs, can be leveraged in each phase of the survey research process. Success in the new landscape will require adapting some of the analytic and data collection approaches traditionally used by survey and social scientists to handle this more data rich reality. In his recent MIT Press' Essential Knowledge book entitled Machine Learning, computer engineering professor Ethem Alpaydin notes, “Machine learning will help us make sense of an increasingly complex world. Already we are exposed to more data than what our sensors can cope with or our brains can process.” Although the use of MLMs seems fairly new in survey research, machines and technology have long been part of the survey and social sciences' DNA. At the 1957 AAPOR conference Frederick Stephan pointed out that “Computers will tax the ingenuity, judgment and skill of technically proficient people to (a) put the job on the machine and (b) put the results in form for comprehension of human beings; and determine the courses of action we might take based on what the machines have told us” in the context of data collection. This adage still holds today, but we are moving from machines to MLMs.
This chapter provides a brief overview of MLMs and a deeper exploration of how these data science techniques are applied to the social sciences from the perspective of the survey research process including sample design and constructing sampling frames; questionnaire design and evaluation; survey recruitment and data collection; survey data coding and processing; sample weighting and survey adjustment; and data analysis and es...