Data Science and Big Data Analytics
eBook - ePub

Data Science and Big Data Analytics

Discovering, Analyzing, Visualizing and Presenting Data

,
  1. English
  2. ePUB (mobile friendly)
  3. Available on iOS & Android
eBook - ePub

Data Science and Big Data Analytics

Discovering, Analyzing, Visualizing and Presenting Data

,

About this book

Data Science and Big Data Analytics is about harnessing the power of data for new insights. The book covers the breadth of activities and methods and tools that Data Scientists use. The content focuses on concepts, principles and practical applications that are applicable to any industry and technology environment, and the learning is supported and explained with examples that you can replicate using open-source software.

This book will help you:

  • Become a contributor on a data science team
  • Deploy a structured lifecycle approach to data analytics problems
  • Apply appropriate analytic techniques and tools to analyzing big data
  • Learn how to tell a compelling story with data to drive business action
  • Prepare for EMC Proven Professional Data Science Certification

Get started discovering, analyzing, visualizing, and presenting data in a meaningful way today!

Frequently asked questions

Yes, you can cancel anytime from the Subscription tab in your account settings on the Perlego website. Your subscription will stay active until the end of your current billing period. Learn how to cancel your subscription
No, books cannot be downloaded as external files, such as PDFs, for use outside of Perlego. However, you can download books within the Perlego app for offline reading on mobile or tablet. Learn how to download books offline
Perlego offers two plans: Essential and Complete
  • Essential is ideal for learners and professionals who enjoy exploring a wide range of subjects. Access the Essential Library with 800,000+ trusted titles and best-sellers across business, personal growth, and the humanities. Includes unlimited reading time and Standard Read Aloud voice.
  • Complete: Perfect for advanced learners and researchers needing full, unrestricted access. Unlock 1.4M+ books across hundreds of subjects, including academic and specialized titles. The Complete Plan also includes advanced features like Premium Read Aloud and Research Assistant.
Both plans are available with monthly, semester, or annual billing cycles.
We are an online textbook subscription service, where you can get access to an entire online library for less than the price of a single book per month. With over 1 million books across 990+ topics, we’ve got you covered! Learn about our mission
Look out for the read-aloud symbol on your next book to see if you can listen to it. The read-aloud tool reads text aloud for you, highlighting the text as it is being read. You can pause it, speed it up and slow it down. Learn more about Read Aloud
Yes! You can use the Perlego app on both iOS and Android devices to read anytime, anywhere — even offline. Perfect for commutes or when you’re on the go.
Please note we cannot support devices running on iOS 13 and Android 7 or earlier. Learn more about using the app
Yes, you can access Data Science and Big Data Analytics by in PDF and/or ePUB format, as well as other popular books in Computer Science & Operations. We have over one million books available in our catalogue for you to explore.

Information

Publisher
Wiley
Year
2015
Print ISBN
9781118876138
eBook ISBN
9781118876053

Chapter 1
Introduction to Big Data Analytics

Key Concepts

  1. Big Data overview
  2. State of the practice in analytics
  3. Business Intelligence versus Data Science
  4. Key roles for the new Big Data ecosystem
  5. The Data Scientist
  6. Examples of Big Data analytics
Much has been written about Big Data and the need for advanced analytics within industry, academia, and government. Availability of new data sources and the rise of more complex analytical opportunities have created a need to rethink existing data architectures to enable analytics that take advantage of Big Data. In addition, significant debate exists about what Big Data is and what kinds of skills are required to make best use of it. This chapter explains several key concepts to clarify what is meant by Big Data, why advanced analytics are needed, how Data Science differs from Business Intelligence (BI), and what new roles are needed for the new Big Data ecosystem.

1.1 Big Data Overview

Data is created constantly, and at an ever-increasing rate. Mobile phones, social media, imaging technologies to determine a medical diagnosis—all these and more create new data, and that must be stored somewhere for some purpose. Devices and sensors automatically generate diagnostic information that needs to be stored and processed in real time. Merely keeping up with this huge influx of data is difficult, but substantially more challenging is analyzing vast amounts of it, especially when it does not conform to traditional notions of data structure, to identify meaningful patterns and extract useful information. These challenges of the data deluge present the opportunity to transform business, government, science, and everyday life.
Several industries have led the way in developing their ability to gather and exploit data:
  • Credit card companies monitor every purchase their customers make and can identify fraudulent purchases with a high degree of accuracy using rules derived by processing billions of transactions.
  • Mobile phone companies analyze subscribers' calling patterns to determine, for example, whether a caller's frequent contacts are on a rival network. If that rival network is offering an attractive promotion that might cause the subscriber to defect, the mobile phone company can proactively offer the subscriber an incentive to remain in her contract.
  • For companies such as LinkedIn and Facebook, data itself is their primary product. The valuations of these companies are heavily derived from the data they gather and host, which contains more and more intrinsic value as the data grows.
Three attributes stand out as defining Big Data characteristics:
  • Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of rows and millions of columns.
  • Complexity of data types and structures: Big Data reflects the variety of new data sources, formats, and structures, including digital traces being left on the web and other digital repositories for subsequent analysis.
  • Speed of new data creation and growth: Big Data can describe high velocity data, with rapid data ingestion and near real time analysis.
Although the volume of Big Data tends to attract the most attention, generally the variety and velocity of the data provide a more apt definition of Big Data. (Big Data is sometimes described as having 3 Vs: volume, variety, and velocity.) Due to its size or structure, Big Data cannot be efficiently analyzed using only traditional databases or methods. Big Data problems require new tools and technologies to store, manage, and realize the business benefit. These new tools and technologies enable creation, manipulation, and management of large datasets and the storage environments that house them. Another definition of Big Data comes from the McKinsey Global report from 2011:Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value.
McKinsey & Co.; Big Data: The Next Frontier for Innovation, Competition, and Productivity [1]
McKinsey's definition of Big Data implies that organizations will need new data architectures and analytic sandboxes, new tools, new analytical methods, and an integration of multiple skills into the new role of the data scientist, which will be discussed in Section 1.3. Figure 1.1 highlights several sources of the Big Data deluge.
image
Figure 1.1 What's driving the data deluge
The rate of data creation is accelerating, driven by many of the items in Figure 1.1.
Social media and genetic sequencing are among the fastest-growing sources of Big Data and examples of untraditional sources of data being used for analysis.
For example, in 2012 Facebook users posted 700 status updates per second worldwide, which can be leveraged to deduce latent interests or political views of users and show relevant ads. For instance, an update in which a woman changes her relationship status from “single” to “engaged” would trigger ads on bridal dresses, wedding planning, or name-changing services.
Facebook can also construct social graphs to analyze which users are connected to each other as an interconnected network. In March 2013, Facebook released a new feature called “Graph Search,” enabling users and developers to search social graphs for people with similar interests, hobbies, and shared locations.
Another example comes from genomics. Genetic sequencing and human genome mapping provide a detailed understanding of genetic makeup and lineage. The health care industry is looking toward these advances to help predict which illnesses a person is likely to get in his lifetime and take steps to avoid these maladies or reduce their impact through the use of personalized medicine and treatment. Such tests also highlight typical responses to different medications and pharmaceutical drugs, heightening risk awareness of specific drug treatments.
While data has grown, the cost to perform this work has fallen dramatically. The cost to sequence one human genome has fallen from $100 million in 2001 to $10,000 in 2011, and the cost continues to drop. Now, websites such as 23andme (Figure 1.2) offer genotyping for less than $100. Although genotyping analyzes only a fraction of a genome and does not provide as much granularity as genetic sequencing, it does point to the fact that data and complex analysis is becoming more prevalent and less expensive to deploy.
image
Figure 1.2 Examples of what can be learned through genotyping, from 23andme.com
As illustrated by the examples of social media and genetic sequencing, individuals and organizations both derive benefits from analysis of ever-larger and more complex datasets that require increasingly powerful analytical capabilities.

1.1.1 Data Structures

Big data can come in multiple forms, including structured and non-structured data such as financial data, text files, multimedia files, and genetic mappings. Contrary to much of the traditional data analysis performed by organizations, most of the Big Data is unstructured or sem...

Table of contents

  1. Cover
  2. Foreword
  3. Introduction
  4. Chapter 1: Introduction to Big Data Analytics
  5. Chapter 2: Data Analytics Lifecycle
  6. Chapter 3: Review of Basic Data Analytic Methods Using R
  7. Chapter 4: Advanced Analytical Theory and Methods: Clustering
  8. Chapter 5: Advanced Analytical Theory and Methods: Association Rules
  9. Chapter 6: Advanced Analytical Theory and Methods: Regression
  10. Chapter 7: Advanced Analytical Theory and Methods: Classification
  11. Chapter 8: Advanced Analytical Theory and Methods: Time Series Analysis
  12. Chapter 9: Advanced Analytical Theory and Methods: Text Analysis
  13. Chapter 10: Advanced Analytics—Technology and Tools: MapReduce and Hadoop
  14. Chapter 11: Advanced Analytics—Technology and Tools: In-Database Analytics
  15. Chapter 12: The Endgame, or Putting It All Together
  16. End User License Agreement