Data Analytics for the Social Sciences
eBook - ePub

Data Analytics for the Social Sciences

Applications in R

  1. 686 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Data Analytics for the Social Sciences

Applications in R

About this book

Data Analytics for the Social Sciences is an introductory, graduate-level treatment of data analytics for social science. It features applications in the R language, arguably the fastest growing and leading statistical tool for researchers.

The book starts with an ethics chapter on the uses and potential abuses of data analytics. Chapters 2 and 3 show how to implement a broad range of statistical procedures in R. Chapters 4 and 5 deal with regression and classification trees and with random forests. Chapter 6 deals with machine learning models and the "caret" package, which makes available to the researcher hundreds of models. Chapter 7 deals with neural network analysis, and Chapter 8 deals with network analysis and visualization of network data. A final chapter treats text analysis, including web scraping, comparative word frequency tables, word clouds, word maps, sentiment analysis, topic analysis, and more. All empirical chapters have two "Quick Start" exercises designed to allow quick immersion in chapter topics, followed by "In Depth" coverage. Data are available for all examples and runnable R code is provided in a "Command Summary". An appendix provides an extended tutorial on R and RStudio. Almost 30 online supplements provide information for the complete book, "books within the book" on a variety of topics, such as agent-based modeling.

Rather than focusing on equations, derivations, and proofs, this book emphasizes hands-on obtaining of output for various social science models and how to interpret the output. It is suitable for all advanced level undergraduate and graduate students learning statistical data analysis.

Tools to learn more effectively

Saving Books

Saving Books

Keyword Search

Keyword Search

Annotating Text

Annotating Text

Listen to it instead

Listen to it instead

Information

Chapter 1 Using and abusing data analytics in social science

DOI: 10.4324/9781003109396-1

1.1 Introduction

The use and abuse of data analytics (DA), data science, and artificial intelligence (AI) is of major concern in business, government, and academia. In late 2019, based on a survey of 350 US and UK executives involved in AI and machine learning, DataRobot (2019 a, 2019 b), itself a developer of machine learning automation platforms, issued a news release on its report, headlining “Nearly half of AI professionals are ‘very to extremely’ concerned about AI bias.” Critics think the percentage should be even higher. This chapter has a triple purpose. First, published literature in the social and policy sciences is used to illustrate the promise of big data and DA, highlighting a variety of specific ways in which DA are useful. However, the other two sections of this chapter are cautionary. In the second section, inventory threats to good research design common among researchers employing big data and DA are discussed. The third section inventories various ethical issues associated with big data and DA. The question underlying this chapter is whether, in terms of big data and DA, we are marching toward a better society or toward an Orwellian “1984”. As in all such questions, the answer is, “Some of both”.
Before beginning, a word about terminology is needed. The terms “data science”, “data analytics”, “machine learning”, and “artificial intelligence” overlap in scope. In this volume, these “umbrella” terms may be used interchangeably by the author and by other authors who are cited. However, connotations differ. Data science suggests work done by graduates of data science programs, which are dominated by computer science departments. DA connotes the application of data science methods to other disciplines, such as social science. Machine learning refers to any of a large number of algorithms which may be used for classification and prediction. AI refers to algorithms that adjust and hopefully improve in effectiveness across iterations, such as neural networks of various types. (In this book we do not refer to the broader popular meaning of artificial human intelligence as portrayed in science fiction.) The common denominator of all these admittedly fuzzy terms is what is often called “algorithmic thinking”, meaning reliance on computer algorithms to arrive at classifications, predictions, and decisions. All approaches may utilize “big data”, referring to the capacity of these methods to deal with enormous sets of mixed numeric, text, and even video data, such as may be scraped from the internet. Big data may magnify bias associated with algorithmic thinking but it is not a prerequisite for bias and abuse in the application of data science methods.
Official policy on ethics for information technology, including DA, is found in the 2012 “Menlo Report” of the Directorate of Science & Technology of the US Department of Homeland Security. This report was followed up by a “companion” document containing case studies and further guidance (Dittrich, Kenneally, & Bailey, 2013). The Menlo guidelines contain highly generalized guidelines for ethical practice in the domain of DA. In a nutshell, it sets out four principles that are as follows:
  1. Respect for persons: DA projects should be based on informed consent of those participating in or impacted by the project.
    The problem, of course, is that the whole basis of “big data” approaches is that huge amounts of data are collected without realistic possibility of gathering true informed consent. Even when data are collected directly from the person, consent takes the form of a button click, giving “consent” to fine print in legalese. This token consent may even be obtained coercively as failure to click may deny the person the right to make a purchase or obtain some other online benefits.
  2. Beneficence: This is the familiar “do not harm” ethic with roots going back to the Hippocratic Oath for doctors. In practical terms, DA projects are called upon to undertake systematic assessments of risks and harms as well as benefits.
    The problem is that DA projects are mostly commissioned with deliverables set beforehand and with tight timetables. For the most part, the technocratic staff of DA projects is ill-trained to undertake true cost-benefit studies even if time constraints and work contracts are permitted. The Menlo Report itself provides a giant loophole, noting that there are long-term social benefits to having research. It is easy to see these benefits as outweighing diffuse costs which take the form of loss of confidentiality and privacy, violations of data integrity, and individual or group impairment of reputation. The reality is that few, if any, DA projects are halted due to lack of “beneficence”, though placing a privacy policy on one's website or obtaining pro forma “consent” is commonplace. The costs in time and money of challenging shortcomings in “beneficence” falls of the aggrieved person, who often finds pro-business legislation and courts, not to mention the superior legal staff of corporations and governments, make the chance of success dim.
  3. Justice: The principle of information justice means that all persons are treated equally with regard to data selection without bias. Also, benefits of information technology are to be distributed fairly.
    The problem is that on the selection side, profiling is inherent in big data analysis. Profiling, in turn, is famously subject to bias. On the fair distribution side, the Menlo Report and DA projects generally interpret fairness in terms of individual need, individual effort, societal contribution, and overall merit. These fairness concepts are subjective and extremely vague. If information justice is considered at all, it is easy to rationalize to justify DA practices without need for revision.
  4. Respect for law and the public interest: DA projects should be based on legal “due diligence”, transparency with regard to DA methods and results, and DA should be subject to accountability.
    DA projects lack “due diligence” if there is no evidence that some effort was undertaken to conform to relevant laws dealing with privacy and data integrity. The corporation or government agency which commissions a DA project is wise to have such evidence, usually in the form of an official privacy policy, a policy on data sharing, and so on. These policies are frequently posted on the web, giving evidence of “transparency”. The problem is that this primarily serves for legal protection of the corporation or government entity and is rarely a constraint on what the DA project actually does.
It is common in many domains for ethical guidelines to lack impact. An illustration at this writing is the ethical standards document of the American Society for Public Administration in the era of the Trump presidency and its many challenges to ethics. Like that document, the usefulness of the Menlo Report is primarily to call attention to ethical issues, not actually to regulate DA projects.
Ostensibly, every US federal agency has appointed a “data steward” responsible for each database it maintains. While this is different from each algorithm-based program, most agencies have a data steward statement of responsibilities that often includes responsibilities in the areas of data privacy, transparency, and other values. An example is in the “Readings and References” section of the student Support Material (www.routledge.com/9780367624293) for this book.1 There may be a Data Stewardship Executive Policy Committee to oversee data stewardship, as there is in the US Census Bureau. A literature review by the author was unable to find even a single empirical study of the effectiveness of governmental data stewards, though prescriptive articles on what makes a data steward effective abound. “The proof is in the pudding” must be the investigatory rule here. Much of this chapter is devoted to illustrations of problems with the pudding.
Petrozzino (2020), addressing the Menlo Report, has argued that formal ethical principles do make a difference. Petrozzino, a Principal Cybersecurity Engineer within the National Security Engineering Center operated by MITRE for the US Department of Defense, concluded her analysis by writing, “The enthusiasm of organizations to use big data should be married with the appropriate analysis of potential impact to individuals, groups, and society. Without this analysis, the potential issues are numerous and substantively damaging to their mission, organization, and external stakeholders” (p. 17). Like Biblical principles of morality, it is largely up to the individual to act upon ethical principles. However, it is thought better for the DA project director to have principles than not to have them!

1.2 The promise of data analytics for social science

1.2.1 Data analytics in public affairs and public policy

The Menlo Report discussed earlier specifically calls attention to the societal value of basic research based on big data. Big data and DA have been applied to address such public policy problems as diverse as making health-care delivery more efficient (Sousa et al., 2019), improving the state of the art in biomedicine (Mittelstadt, 2019), advancing the techniques of forensic accounting (Zabihollah & Wang, 2019), improving crop selection in agriculture (Tseng, Cho, & Wu, 2019), estimating travel time in transportation networks (Bertsimas et al., 2019), and identifying trucks involved in illegal construction waste dumping (Lu, 2019). Likewise, Hauer (2019: 222) is one of many who have noted the sweeping scope of algorithms, which implement DA. He wrote, “Algorithms plan flights and then fly with planes. Algorithms run factories, the bank is a vast array of algorithms, evaluating our credit score, algorithms collect revenue and keep records, read medical images, diagnose cancer, drive cars, write scientific texts, compose music, conduct symphony orchestras, navigate drones, speak to us and for us, write film scenarios, invent chemical formulations for a new cosmetic cream, order, advise, paint pictures. Climate models decide what is a safe carbon dioxide level in the atmosphere. NSA algorithms decide whether you are a potential terrorist.”
In the same vein, Cathy Petrozzino has observed, “the public sector at every level – federal, state, local, and tribal – also has benefited from its creation of big data collections and applications of data science.” She gave such examples as the Care Assessment Needs (CAN) system of the Veterans Health Administration (VHA), and the Office of Anti-Fraud Program of the Social Security Administration (Petrozzino, 2020: 14).
Public health and the provision of medical care is one of the domains, which have been a center of big data and DA activity. Garattini et al. (2019: 69), for instance, have noted many benefits of big data in medicine, where DA “offers the capacity to rationalize, understand and use big data to serve many different purposes, from improved services modelling to prediction of treatment outcomes, to greater patient and disease stratification. In the area of infectious diseases, the application of big data analytics has introduced a number of changes in the information accumulation models… Big data analytics is fast becoming a crucial component for the modeling of transmission – aiding infection control measures and policies – emergency response analyses required during local or international outbreaks.”

1.2.2 Data analytics in the social sciences

Given the DA revolution in public and private sectors, it would be surprising not to see a rapid gravitation of the social sciences in the same direction and, indeed, this is happening quickly in the current era. The work of Richard Hendra, director of the Manpower Demonstration Research Corporation's (MDRC) Center for Data Insights (https://www.mdrc.org/), exemplifies how a social scientist can employ data analytic methods to address some of the nation's toughest social policy challenges through leveraging already collected data to derive actionable insights to help improve well-being among low-income individuals and families. Illustrative projects include a nonprofit initiative that focuses on leveraging MIS data to improve program targeting and a national effort to improve DA capacity and infrastructure in the Temporary Assistance for Needy Families (TANF) system. Other application areas include employment, housing, criminal justice, financial inclusion, and substance abuse issues. Hendra's work centers on how data science fits within long-term learning agendas, using techniques like random forests and ensemble methods to complement the causal inference studies that MDRC is known for.

1.2.3 Data analytics in the humanities

We would be remiss before closing this subsection not to mention that DA and big data open up new opportunities for scholars working in the humanities, where text analysis is paramount. Thus boyd (sic.) and Crawford (2012: 667) noted “Big Data offers the humanistic disciplines a new way to claim the status of quantitative science and objective method. It makes many more social spaces quantifiable.”

1.3 Research design issues in data analytics

1.3.1 Beware the true believer

Almost a decade ago the authors boyd and Crawford (2012: 666) found “an arrogant undercurrent in many Big Data debates where other forms of analysis are too easily sidelined… This is not a space that has been welcoming to older forms of intellectual craft.” This intellectual arrogance continues to the present day. For instance, this author (Garson) has experienced data science students having been taught that soon conventional statistical analysis would be a thing of the past. As boyd and Crawford noted, intellectual arrogance has the potential to “crystalize into new orthodoxies”, discouraging collaboration and inhibiting rather than promoting innovation. The deserved praise for the potential of big data and DA must be tempered with recognition that there are many quantitative, qualitative, and mixed paths to knowledge. Moreover many of the “new” machine language techniques like deep learning with neural networks or text analytic content analysis antedate the rise of modern data science, and correspondingly data science texts today commonly present linear and logistic regression, cluster analysis, multidimensional scaling, and other “traditional” statistical approaches as integral to DA, albeit often done in R or Python rather than SPSS, SAS, or Stata.
Technocratic isolation encourages “true believership”, diversity mitigates it. Speaking of ethics and bias in the application of machine learning and AI in response to the COVID pandemic crisis of 2020, Sipior (2020) wrote, “A diversity of disciplines is the key to success in AI, especially to minimize risk associated with rapid deployment … Team membership should be well-rounded, from a wide range of backgrounds and skill sets, for complex problem-solving with innovative solutions and for recognizing the potential for bias. To address issues such as bias, ethics, and compliance, among others, roles such as an AI Ethicist, attorney, and/or review board, may be added. She quotes Shellenbarger (2019), “The biases that are implicit in one team member are clear to, a...

Table of contents

  1. Cover Page
  2. Half Title Page
  3. Title Page
  4. Copyright Page
  5. Dedication Page
  6. Contents
  7. Acknowledgments
  8. Preface
  9. 1 Using and abusing data analytics in social science
  10. 2 Statistical analytics with R, Part 1
  11. 3 Statistical analytics with R, Part 2
  12. 4 Classification and regression trees in R
  13. 5 Random forests
  14. 6 Modeling and machine learning
  15. 7 Neural network models and deep learning
  16. 8 Network analysis
  17. 9 Text analytics
  18. Appendix 1: Introduction to R and RStudio
  19. Appendix 2: Data used in this book
  20. References
  21. Index

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Data Analytics for the Social Sciences by G. David Garson in PDF and/or ePUB format, as well as other popular books in Psychology & Research & Methodology in Psychology. We have over one million books available in our catalogue for you to explore.