Knowledge Discovery in the Social Sciences
eBook - ePub

Knowledge Discovery in the Social Sciences

A Data Mining Approach

  1. 264 pages
  2. English
  3. ePUB (mobile friendly)
  4. Available on iOS & Android
eBook - ePub

Knowledge Discovery in the Social Sciences

A Data Mining Approach

About this book

Knowledge Discovery in the Social Sciences helps readers find valid, meaningful, and useful information. It is written for researchers and data analysts as well as students who have no prior experience in statistics or computer science. Suitable for a variety of classes—including upper-division courses for undergraduates, introductory courses for graduate students, and courses in data management and advanced statistical methods—the book guides readers in the application of data mining techniques and illustrates the significance of newly discovered knowledge. 

Readers will learn to: 
• appreciate the role of data mining in scientific research 
• develop an understanding of fundamental concepts of data mining and knowledge discovery
• use software to carry out data mining tasks
• select and assess appropriate models to ensure findings are valid and meaningful
• develop basic skills in data preparation, data mining, model selection, and validation
• apply concepts with end-of-chapter exercises and review summaries
 

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription.
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn more here.
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 1000+ topics, we’ve got you covered! Learn more here.
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more here.
Yes! You can use the Perlego app on both iOS or Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app.
Yes, you can access Knowledge Discovery in the Social Sciences by Prof. Xiaoling Shu in PDF and/or ePUB format, as well as other popular books in Social Sciences & Social Science Research & Methodology. We have over one million books available in our catalogue for you to explore.

PART I

KNOWLEDGE DISCOVERY AND DATA MINING IN SOCIAL SCIENCE RESEARCH

Chapter 1

INTRODUCTION

ADVANCES IN TECHNOLOGY—the internet, mobile devices, computers, digital sensors, and recording equipment—have led to exponential growth in the amount and complexity of data available for analysis. It has become difficult or even impossible to capture, manage, process, and analyze these data in a reasonable amount of time. We are at the threshold of an era in which digital data play an increasingly important role in the research process. In the traditional approach, hypotheses derived from theories are the driving forces behind model building. However, with the rise of big data and the enormous wealth of information and knowledge buried in this data mine, using data mining technologies to discover interesting, meaningful, and robust patterns has becoming increasingly important. This alternative method of research affects all fields, including the social sciences. The availability of huge amounts of data provides unprecedented opportunities for new discoveries, as well as challenges.
Today we are confronted with a data tsunami. We are accumulating data at an unprecedented scale in many areas of industry, government, and civil society. Analysis and knowledge based on big data now drive nearly every aspect of society, including retail, financial services, insurance, wireless mobile services, business management, urban planning, science and technology, social sciences, and humanities. Google Books has so far digitalized 4 percent of all the books ever printed in the world, and the process is ongoing. The Google Books corpus contains more than 500 billion words in English, French, Spanish, German, Chinese, Russian, and Hebrew that would take a person eighty years to read continuously at a pace of 200 words per minute. This entire corpus is available for downloading (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html), and Google also hosts another site to graph word usage over time, from 1800 to 2008 (https://books.google.com/ngrams). The Internet Archive, a digital library of internet sites and other cultural artifacts in digital form, provides free access to 279 billion web pages, 11 million books and texts, 4 million audio recordings, 3 million videos, 1 million images, and 100,000 software programs (https://archive.org/about/). Facebook generates 4 new petabyes of data and runs 600,000 queries and one million map-reduce jobs per day. Facebook’s data warehouse Hive stores 300 petabytes of data in 800,000 tables, as reported in 2014 (https://research.fb.com/facebook-s-top-open-data-problems/). The GDELT database monitors global cyberspace in real time and analyzes and extracts news events from portals, print media, TV broadcasts, online media, and online forums in all countries of the world and extracts key information such as people, places, organizations, and event types related to news events. The GDELT Event Database records over 300 categories of physical activities around the world, from riots and protests to peaceful appeals and diplomatic exchanges, georeferenced to the city or mountaintop, across the entire world from January 1, 1979 and updated every 15 minutes. Since February 2015, GDELT has brought together 940 million messages from global cyberspace in a volume of 9.4TB (https://www.gdeltproject.org/). A report by McKinsey (Manyika et al. 2011) estimated that corporations, institutions, and users stored more than 13 exabytes of new data, which is over 50,000 times larger than the amount of data in the Library of Congress. The value of global personal location data is estimated to be $700 billion, and these data can reduce costs as much as 50 percent in product development and assembly.
Both industry and academic demands for data analytical skills have soared rapidly and continue to do so. IBM projects that by 2020 the number of jobs requiring data analytical skills in the United States will increase by 15 percent, to more than 2.7 million, and job openings requiring advanced data science analytical skills will reach more than 60,000 (Miller and Hughes 2017). Global firms are focusing on data-intensive sectors such as finance, insurance, and medicine. The topic of big data has been covered in popular news media such as the Economist (2017), the New York Times (Lohr 2012), and National Public Radio (Harris 2016), and data mining has also been featured in Forbes (2015; Brown 2018), the Atlantic (Furnis 2012), and Time (Stein 2011), to name a few.
The growth of big data has also revolutionized scientific research. Computational social sciences emerged as a new methodology, and it is growing in popularity as a result of dramatic increases in available data on human and organizational behaviors (Lazer at al. 2009). Astronomy has also been revolutionized by using a huge database of space pictures, the Sloan Digital Sky Survey, to identify interesting objects and phenomena (https://www.sdss.org/). Bioinformatics has emerged from biological science to focus on databases of genome sequencing, allowing millions or billions of DNA strands to be sequenced rapidly in parallel.
In the field of artificial intelligence (AI), scientists have developed AlphaGo, which was trained to model expert players from recorded historical games of a database of 30 million game moves and was later trained to learn new strategies for itself (https://deepmind.com/research/alphago/). AlphaGo has defeated Go world champions many times and is regarded as the strongest Go player in the game’s history. This is a major advancement over the old AI technology. When IBM’s DeepMind beat chess champion Gary Kasparov in the late 1990s, it used brute-force AI that searched chess moves in a space that was just a small fraction of the search space for Go.
The Google Books corpus has made it possible to expand quantitative analysis into a wider array of topics in the social sciences and the humanities (Michel et al. 2011). By analyzing this corpus, social scientists and humanists have been able to provide insights into cultural trends that include English-language lexicography, the evolution of grammar, collective memory, adoption of technology, pursuits of fame, censorship, and historical epidemiology.
In response to this fast-growing demand, universities and colleges have developed data science or data studies majors. These fields have grown from the confluence of statistics, machine learning, AI, and computer science. They are products of a structural transformation in the nature of research in disciplines that include communication, psychology, sociology, political science, economics, business and commerce, environmental science, linguistics, and the humanities. Data mining projects not only require that users possess in-depth knowledge about data processing, database technology, and statistical and computational algorithms; they also require domain-specific knowledge (from experts such as psychologists, economists, sociologists, political scientists, and linguists) to combine with available data mining tools to discover valid and meaningful knowledge. On many university campuses, social sciences programs have joined forces to consolidate course offerings across disciplines to teach introductory, intermediate, and advanced courses on data description, visualization, mining, and modeling to students in the social sciences and humanities.
This chapter examines the major concepts of big data, knowledge discovery in databases, data mining, and computational social science. It analyzes the characteristics of these terms, their central features, components, and research methods.

WHAT IS BIG DATA?

The concept of big data was conceived in 2001 when the META analyst D. Laney (2001) proposed the famous ā€œ3V’s Modelā€ to cope with the management of increasingly large amounts of data. Laney described the data as of large volume, growing at a high velocity, and having great variety. The concept of big data became popular in 2008 when Nature featured a special issue on the utility, approaches, and challenges of big data analysis. Big data has since become a widely discussed new topic in all areas of scientific research. Science featured a special forum on big data in 2011, further highlighting the enormous potential and great challenge of big data research. In the same year, McKinsey’s report ā€œBig Data: The Next Frontier for Innovation, Competition, and Productivityā€ (2011) announced that the tsunami of data will bring enormous productivity and profits, adding enthusiasm to this already exciting development. Mayer-Schƶnberger and Cukier (2012) focused on the dramatic impacts that big data will have on the economy, science, and society and the revolutionary changes it will bring about in society at large.
A variety of definitions of big data all agree on one central feature of this concept: data enormity and complexity. Some treat data that are too large for traditional database technologies to store, access, manage, and analyze (Manyika et al. 2011). Others define big data based on its characteristic four big V’s: (1) big volume, measured at terabytes or petabytes; (2) big velocity, which grows rapidly and continuously; (3) big variety, which includes structured numerical data and unstructured data such as text, pictures, video, and sound; and (4) big value, which can be translated into enormous economic profits, academic knowledge, and policy insights. Analysis of big data uses computational algorithms, cloud storage, and AI to instantaneously and continuously mine and analyze data (Dumbill 2013).
There are just as many scholars who think big data is a multifaceted and complex concept that cannot be viewed simply from a data or technology perspective (Mauro, Greco, and Grimaldi 2016). A word cloud analysis from the literature shows that big data can be viewed from at least four dif...

Table of contents

  1. Title
  2. Copyright
  3. Dedication
  4. Contents
  5. Part I. Knowledge Discovery and Data Mining in Social Science Research
  6. Part II. Data Preprocessing
  7. Part III. Model Assessment
  8. Part IV. Data Mining: Unsupervised Learning
  9. Part V. Data Mining: Supervised Learning
  10. Part VI. Data Mining: Text Data and Network Data
  11. Index